BED files into R

Sometimes BED files drive me nuts. They’re great for their simplicity in specifying a genomic feature, but can be difficult to work with across programs because only the first three columns are required (chromosome, start, end) and there are several ways to specify any number of remaining columns. Well, maybe not any number, but looking at the following names, tell me if clarity jumps from the page: BED, BED6, BED8, BED12, BED6+4, BED15, any others? Even the spec at UCSC doesn’t explain the conventions behinds those names.

Working with BED files in R is typically fairly easy. The rtracklayer library contains import/export functions for reading or creating bed files. For instance, reading a bed file into a GRanges object is as simple as:

foo <- import("myPeaks.bed")

However, when those BED files are produced by programs like MACS2 that muck around with some of the fields, then things break, and you have to start figuring out all those unwritten rules about BED files. Typical UCSC BED format has 12 columns. MACS2 produces a narrowPeak file described as BED6+4. The R help for the import() function states the problem explicitly, and offers a solution: …many tools and organizations have extended BED with additional columns. These are not officially valid BED files…To import one of the multitude of BEDX+Y formats, such as those used to distribute ENCODE data through UCSC (narrowPeaks, etc), specify the ‘extraCols’ argument to indicate the expected names and classes of the special columns.

Now the problem is simply that we don’t know which of the columns are standard and which are “special”, and none of the columns of a BED file are “named”, leaving this bit of help a little confusing. But we can make some good guesses, and it turns out they work.

The first 6 fields of the MACS2 narrowPeak file have the properties expected by import() and conform mostly to the UCSC standard (the 5th column is supposed to be an integer between 0 and 1000, but for MACS2 it is simply the -log10 transformed peak Q-Value multiplied by 10 and converted to an integer). But the remaining 4 fields DO NOT match what import() expects, and thus have to be explicitly specified. Thus to get our MACS2 results into R we can do something like the following:

# declare names and types for the extra BED columns
extraCols_narrowPeak <- c(FoldChange="numeric", pVal="numeric",
qVal="numeric", summit="integer")

# import our peak data
peaks <- import.bed("my_peaks.narrowPeak", extraCols=extraCols_narrowPeak)

The result is a GRanges object called “peaks” with meta columns named based on what you gave to extraCols.