Calculate number of singletons in R - r

I have some RNA-seq data and I need to calculate the number of singletons. We define a singleton as a read that does not have any other reads mapped close by (in a distance of 100 bases to either side).
I have a dataframe with the begin coordinate and the end coordinate of each read. I'm using R to do this.
I have written this code for the moment, but the apply is not correct and therefore is giving an error.
begin_end <- data.frame(begin_coordinate, final_coordinate)
apply(begin_end, 1, function(x) x[,1]-(x-1)[,2])
The first lines of the dataframe are:
> head(begin_end)
begin final
1 60507 60551
2 60790 60840
3 62004 62051
4 62819 62868
5 65141 65187
The first one seems to be a singleton because the next reads starts more than 100 bases after it ends and so are the rest in the first lines of the dataset. But the dataframe is long and I hope not all the reads are singletons.

Here's the same thing #jeremycg did with dplyr's lag and lead, but in data.table:
library(data.table)
setDT(begin_end)
begin_end[{
d = begin - shift(final, type="lag")
pmin(d, shift(d, type="lead"), na.rm=TRUE) > 100
}]
Comment. The basic data.table syntax is DT[i,j]. i is for filtering the input while j is for modifying the output.
We used i above, but to examine how it works, we can toss the relevant vectors into j:
begin_end[,{
d = begin - shift(final, type="lag")
d_lead = shift(d, type="lead")
my_pmin = pmin(d, d_lead, na.rm=TRUE)
c(.SD, list(d = d, d_lead = d_lead, my_pmin = my_pmin))
}]
# begin final d d_lead my_pmin
# 1: 60507 60551 NA 239 239
# 2: 60790 60840 239 1164 239
# 3: 62004 62051 1164 768 768
# 4: 62819 62868 768 2273 768
# 5: 65141 65187 2273 NA 2273
.SD is a list of column vectors already in the table, short for Subset of Data.

You seem to be trying to get the previous end value out of the apply using (x-1). Unfortunately, you can't do this inside the apply family.
Luckily, there is a function called lag (there are several, so i'll use the one from dplyr). This lets us lag a column by a given number of entries:
begin_end$space <- begin_end$begin - dplyr::lag(begin_end$final)
here's the output:
begin final space
1 60507 60551 NA
2 60790 60840 239
3 62004 62051 1164
4 62819 62868 768
5 65141 65187 2273
Then you can try:
begin_end$issingle <- begin_end$space >= 100

Using Bioconductor's GenomicRanges I think the idea would be to create a GRanges() (maybe from reading the data using GenomicAlignments::readGAlignments() or makeGRangesFromDataFrame()) from your reads, extend them in each direction using resize(), then use findOverlaps() to identify singletons as the reads that only overlap themselves. Roughly
library(GenomicRanges)
gr = GRanges(seqnames="chr1",
IRanges(start=c(1000, 1150, 1500), width=100))
gr100 = resize(gr, width(gr) + 200, fix="center")
hits = findOverlaps(gr100)
gr100[tabulate(queryHits(hits), queryLength(hits)) == 1]
leading to
> gr100[tabulate(queryHits(hits), queryLength(hits)) == 1]
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [1400, 1699] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
This will be fast for millions of records.

Related

Break region into smaller regions based on cutoff

This is I assume a somewhat simple programming issue, but I've been struggling with it. Mostly because I don't know the right words to use, perhaps?
Given a set of "ranges" (in the form of 1-a set of numbers as below, 2-IRanges, or 3-GenomicRanges), I'd like to split it into a set of smaller ranges.
Example Beginning:
Chr Start End
1 1 10000
2 1 5000
Example size of breaks: 2000
New dataset:
Chr Start End
1 1 2000
1 2001 4000
1 4001 6000
1 6001 8000
1 8001 10000
2 1 2000
2 2001 4000
2 4001 5000
I'm doing this in R. I know I could generate these simply with seq, but I'd like to be able to do it based on a list/df of regions instead of having to manually do it every time I have a new list of regions.
Here's an example I've made using seq:
Given 22 chromosomes, loop through them and break each into pieces
# initialize df
Regions <- data.frame(Chromosome = c(), Start = c(), End = c())
# for each row, do the following
for(i in 1:nrow(Chromosomes)){
# create a sequence from the minimum start to the max end by some value
breks <- seq(min(Chromosomes$Start[Chromosomes$Chromosome == i]), max(Chromosomes$End[Chromosomes$Chromosome == i]), by=2000000)
# put this into a dataframe
database <- data.frame(Chromosome = i, Start = breks, End = c(breks[2:length(breks)]-1, max(Chromosomes$End[Chromosomes$Chromosome == i])))
# bind with what we already have
Regions <- rbind(Regions, database)
rm(database)
}
This works fine, I'm wondering if there is something built into a package already to do this as a one-liner OR that is more flexible, as this has its limitations.
Using the R / Bioconductor package GenomicRanges, here are your initial ranges
library(GenomicRanges)
rngs = GRanges(1:2, IRanges(1, c(10000, 5000)))
and then create a sliding window across the genome, generated first as a list (one set of tiles per chromosome) and then unlisted for the format you have in your question
> windows = slidingWindows(rngs, width=2000, step=2000)
> unlist(windows)
GRanges object with 8 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 1, 2000] *
[2] 1 [2001, 4000] *
[3] 1 [4001, 6000] *
[4] 1 [6001, 8000] *
[5] 1 [8001, 10000] *
[6] 2 [ 1, 2000] *
[7] 2 [2001, 4000] *
[8] 2 [4001, 5000] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Coerce from / to a data.frame with as(df, "GRanges") or as(unlist(tiles), "data.frame").
Find help at ?"slidingWindows,GenomicRanges-method" (tab completion is your friend, ?"slidingW<tab>).
Embarrassingly, this seems to be implemented only in the 'devel' version of GenomicRanges (v. 1.25.93?); tile does something similar but rounds the width of ranges to be approximately equal while spanning the width of the GRanges. Here is a poor-man's version
windows <- function(gr, width, withMcols=FALSE) {
starts <- Map(seq, start(rngs), end(rngs), by=width)
ends <- Map(function(starts, len) c(tail(starts, -1) - 1L, len),
starts, end(gr))
seq <- rep(seqnames(gr), lengths(starts))
strand <- rep(strand(gr), lengths(starts))
result <- GRanges(seq, IRanges(unlist(starts), unlist(ends)), strand)
seqinfo(result) <- seqinfo(gr)
if (withMcols) {
idx <- rep(seq_len(nrow(gr)), lengths(starts))
mcols(result) = mcols(gr)[idx,,drop=FALSE]
}
result
}
invoked as
> windows(rngs, 2000)
If the approach is useful, consider asking follow-up questions on the Bioconductor support site.

R index and match with multiple conditions [duplicate]

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Quickly create new columns in dataframe using lists - R

I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.

Find the intersection of overlapping ranges in two tables using data.table function foverlaps

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

How to join data.tables when one is a lookup table?

I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Here is the simple example (derived from this thread: Join of two data.tables fails).
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).
DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).
Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.
Likewise
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.
Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?
Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Taking you answer directly. The error message
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...
states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.
If it doesn't answer your question you should restate it.

Resources