Sample a BAM file – bioinformatics

When working out a pipeline, especially those involving BAM files, I often find it useful to create some toy sample data to push through it, just to get it working. Most BAM files I use contain millions of reads and are often gigabytes in size. Using full size files can burn lots of time just to find out an error has occurred. Whereas using a file with only a few reads allows you to spot errors instantly. What if you simply want 1000 sequence reads from a BAM file? What is the easiest/quickest way to grab them? Use samtools view to grab reads and specify how many you’d like to keep using “head”.

samtools view -h in.bam | head -1000 | samtools view -b - > sample.bam

This won’t return exactly the number of reads you specify because the header of the BAM file will count as lines against the number you specify. The point of taking the small sample above it to not have to read through the whole file to get a few reads. However, another option which is generally more robust, but which will read though the whole file, is to sample the file. For instance, if you took every tenth read – you’d have a file that is approximately 20 times smaller, and there’s a samtools option for that:

samtools view -s 0.1 -b in.bam > out.bam

The -s option samples the file by taking only a fraction of the alignments, chosen randomly. The -b option in the command above specifies that the output should be in BAM format.

Leave a Reply