Removing Overlapping sites within Data Frame in R - r

I have a dataframe which contains overlapping sites, I want to be able to remove these overlapping site according to this rule.
End(B) > Start(A) & End(A) > Start(B) then we remove site B
For example ,
Chrom Start End
scaffold_98 8309 8313
scaffold_98 8311 8320
scaffold_98 8811 8815
In this case, I should remove site (8311,8320) because it overlaps with (8309,8313).
Is there a fast way to do it when we are working with a large data.

Using GenomicRanges is not critically necessary here, but I recommend its use for more complex operations involving genomic coordinates. It is a very powerful library, and has been designed for these kind of operations.
Here is a solution using findOverlaps::GenomicRanges.
Load the library.
library(GenomicRanges);
Read-in the sample data and turn into GRanges object.
df <- read.table(text =
"Chrom Start End
scaffold_98 8309 8313
scaffold_98 8311 8320
scaffold_98 8811 8815 ", header = T)
gr <- makeGRangesFromDataFrame(df);
Select for non-overlapping regions using findOverlaps.
gr[unique(findOverlaps(gr, type = "any", select = "first"))];
#GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] scaffold_98 [8309, 8313] *
# [2] scaffold_98 [8811, 8815] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

Assuming that your df is ordered based on the start column, this might work:
remove <- vector()
for (i in 2:nrow(df)){
if(df[i,3] > df[i-1, 2] && df[i-1, 3] > df[i, 2]) {
remove <- append(remove, i)
}
}
df[-remove,]

Related

Problem with CoverageHeatmap (Bioconductor) function in R

I have 2 sets of pairwise alignments, where query genome 1 (q1) is aligned to the reference genome and query genome 2 (q2) is aligned to the same reference genome. Therefore, I have both alignments with a coordinate system in the reference genome. The alignments are in the form of GRanges objects.
I would like to project the breakpoints of q2 onto q1, by aligning the breakpoints of q1 in the center, and look for any clustering of q2 breakpoints around the q1 breakpoints, all in the reference genome coordinate system.
Therefore, I make a GRanges object of q1 with its breakpoints in the center. For example, if there is a breakpoint in q1 relative to the reference genome at scaffold 1, bp 833, then taking a window on 500 either side of this, the q1 GRanges object will have an element:
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] S1 333-1333 *
-------
seqinfo: 576 sequences from an unspecified genome; no seqlengths
I then construct a GRanges object of the breakpoints on q2, but all seqlengths are of length 1. I intersect this with the q1 GRanges object, so that q2 only obtains points that can be projected onto q1.
The CoverageHeatmap function requires:
windows:
A set of GRanges of equal length
track:
A GRanges or RleList object specifying coverage
When I call the CoverageHeatmap function, I always get this error and warning message:
Error: subscript contains out-of-bounds ranges
In addition: Warning message:
In e1 == Rle(e2) :
longer object length is not a multiple of shorter object length
Called from: S4Vectors:::.subscript_error("subscript contains out-of-bounds ",
"ranges")
I've tried a bunch of things to try and make this work and still get the same error and warning message. This is my code (including when I've tried the function with q2 as a GRanges object and an RleList)
## BP Pairwise comparison, using 3rd genome as co-ordinate reference
# q1 is used as the centre point reference, with q2 bps projected on to it.
# gr_ref_q1 is the pw alignment between the reference and query genome 1
# gr_ref_q2 is the pw alignment between the reference and query genome 2
# We construct two GRanges objects to feed into CoverageHeatMaps
library(schoolmath)
library(heatmaps)
library(IRanges)
bp_3gen_v2 <- function(gr_ref_q1, gr_ref_q2, win){
# Failsafes (check ref genome is the same, etc)
if(!(is.even(win))){stop("win should be an even number")}
## Construct g1_rco (1st GRanges object)
# IRanges object
q1_starts1 <- start(ranges(gr_ref_q1)) - (win*0.5)
q1_starts2 <- end(ranges(gr_ref_q1)) - (win*0.5)
q1_starts <- c(q1_starts1, q1_starts2)
q1_ends1 <- start(ranges(gr_ref_q1)) + (win*0.5)
q1_ends2 <- end(ranges(gr_ref_q1)) + (win*0.5)
q1_ends <- c(q1_ends1, q1_ends2)
q1_ir_ob <- IRanges(start = q1_starts, end = q1_ends)
# GR object
g1_vec_seq <- as.vector(seqnames(gr_ref_q1))
gr1_seqnames <- c(g1_vec_seq, g1_vec_seq)
g1_rco <- GRanges(seqnames = gr1_seqnames, ranges = q1_ir_ob,
seqinfo = seqinfo(gr_ref_q1))
# Remove negative ranges from GR object
g1_rco <- g1_rco[!(start(ranges(g1_rco)) < 0)]
## Construct g2_rco (2nd GRanges object)
# IRanges object
q2_starts <- start(ranges(gr_ref_q2))
q2_ends <- end(ranges(gr_ref_q2))
q2_bps <- c(q2_starts, q2_ends)
q2_ir_ob <- IRanges(start = q2_bps, end = q2_bps)
# GR object
g2_vec_seq <- as.vector(seqnames(gr_ref_q2))
gr2_seqnames <- c(g2_vec_seq, g2_vec_seq)
g2_rco <- GRanges(seqnames = gr2_seqnames, ranges = q2_ir_ob,
seqinfo = seqinfo(gr_ref_q2))
# Try removing anywhere in g2_rco that is not present in g1_rco
# find intersection of seqnames
g_inter <- intersect(g1_vec_seq, g2_vec_seq)
# apply to g2_rco to remove out of bound scaffols
g2_rco <- g2_rco[seqnames(g2_rco) == g_inter]
# now to remove out of bound ranges (GRanges object)
g2_red <- intersect(g1_rco, g2_rco)
# And try as RleList object
g2_red_rle <- coverage(g2_red)
# Heatmap
heat_map <- CoverageHeatmap(windows = g1_rco, track = g2_red_rle)
To avoid these problems and to achieve what you need, the simplest solution is to have the same seqlevels and seqlenghts for both GRanges. If you know this for your reference then provide it, if not try this:
First example datasets:
library(heatmaps)
gr1 = GRanges(seqnames=c(1,2,3),
IRanges(start=c(1,101,1001),end=c(500,600,1500)))
gr2 = GRanges(seqnames=c(2,2,3,3),
IRanges(start=c(1,301,1,1201),end=c(2500,4800,3500,9700)))
Then we make a combined range to get the levels and lengths:
combined= range(c(gr1,gr2))
seqlevels(gr1) = as.character(seqnames(combined))
seqlevels(gr2) = as.character(seqnames(combined))
seqlengths(gr1) = end(combined)
seqlengths(gr2) = end(combined)
Then the heatmap can be easily obtained by:
CoverageHeatmap(gr1,coverage(gr2))
Or if you only want to look at gr1 windows that have some values in gr2, then do:
CoverageHeatmap(gr1[countOverlaps(gr1,gr2)>0],coverage(gr2))

Break region into smaller regions based on cutoff

This is I assume a somewhat simple programming issue, but I've been struggling with it. Mostly because I don't know the right words to use, perhaps?
Given a set of "ranges" (in the form of 1-a set of numbers as below, 2-IRanges, or 3-GenomicRanges), I'd like to split it into a set of smaller ranges.
Example Beginning:
Chr Start End
1 1 10000
2 1 5000
Example size of breaks: 2000
New dataset:
Chr Start End
1 1 2000
1 2001 4000
1 4001 6000
1 6001 8000
1 8001 10000
2 1 2000
2 2001 4000
2 4001 5000
I'm doing this in R. I know I could generate these simply with seq, but I'd like to be able to do it based on a list/df of regions instead of having to manually do it every time I have a new list of regions.
Here's an example I've made using seq:
Given 22 chromosomes, loop through them and break each into pieces
# initialize df
Regions <- data.frame(Chromosome = c(), Start = c(), End = c())
# for each row, do the following
for(i in 1:nrow(Chromosomes)){
# create a sequence from the minimum start to the max end by some value
breks <- seq(min(Chromosomes$Start[Chromosomes$Chromosome == i]), max(Chromosomes$End[Chromosomes$Chromosome == i]), by=2000000)
# put this into a dataframe
database <- data.frame(Chromosome = i, Start = breks, End = c(breks[2:length(breks)]-1, max(Chromosomes$End[Chromosomes$Chromosome == i])))
# bind with what we already have
Regions <- rbind(Regions, database)
rm(database)
}
This works fine, I'm wondering if there is something built into a package already to do this as a one-liner OR that is more flexible, as this has its limitations.
Using the R / Bioconductor package GenomicRanges, here are your initial ranges
library(GenomicRanges)
rngs = GRanges(1:2, IRanges(1, c(10000, 5000)))
and then create a sliding window across the genome, generated first as a list (one set of tiles per chromosome) and then unlisted for the format you have in your question
> windows = slidingWindows(rngs, width=2000, step=2000)
> unlist(windows)
GRanges object with 8 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 1, 2000] *
[2] 1 [2001, 4000] *
[3] 1 [4001, 6000] *
[4] 1 [6001, 8000] *
[5] 1 [8001, 10000] *
[6] 2 [ 1, 2000] *
[7] 2 [2001, 4000] *
[8] 2 [4001, 5000] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Coerce from / to a data.frame with as(df, "GRanges") or as(unlist(tiles), "data.frame").
Find help at ?"slidingWindows,GenomicRanges-method" (tab completion is your friend, ?"slidingW<tab>).
Embarrassingly, this seems to be implemented only in the 'devel' version of GenomicRanges (v. 1.25.93?); tile does something similar but rounds the width of ranges to be approximately equal while spanning the width of the GRanges. Here is a poor-man's version
windows <- function(gr, width, withMcols=FALSE) {
starts <- Map(seq, start(rngs), end(rngs), by=width)
ends <- Map(function(starts, len) c(tail(starts, -1) - 1L, len),
starts, end(gr))
seq <- rep(seqnames(gr), lengths(starts))
strand <- rep(strand(gr), lengths(starts))
result <- GRanges(seq, IRanges(unlist(starts), unlist(ends)), strand)
seqinfo(result) <- seqinfo(gr)
if (withMcols) {
idx <- rep(seq_len(nrow(gr)), lengths(starts))
mcols(result) = mcols(gr)[idx,,drop=FALSE]
}
result
}
invoked as
> windows(rngs, 2000)
If the approach is useful, consider asking follow-up questions on the Bioconductor support site.

R index and match with multiple conditions [duplicate]

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Calculate number of singletons in R

I have some RNA-seq data and I need to calculate the number of singletons. We define a singleton as a read that does not have any other reads mapped close by (in a distance of 100 bases to either side).
I have a dataframe with the begin coordinate and the end coordinate of each read. I'm using R to do this.
I have written this code for the moment, but the apply is not correct and therefore is giving an error.
begin_end <- data.frame(begin_coordinate, final_coordinate)
apply(begin_end, 1, function(x) x[,1]-(x-1)[,2])
The first lines of the dataframe are:
> head(begin_end)
begin final
1 60507 60551
2 60790 60840
3 62004 62051
4 62819 62868
5 65141 65187
The first one seems to be a singleton because the next reads starts more than 100 bases after it ends and so are the rest in the first lines of the dataset. But the dataframe is long and I hope not all the reads are singletons.
Here's the same thing #jeremycg did with dplyr's lag and lead, but in data.table:
library(data.table)
setDT(begin_end)
begin_end[{
d = begin - shift(final, type="lag")
pmin(d, shift(d, type="lead"), na.rm=TRUE) > 100
}]
Comment. The basic data.table syntax is DT[i,j]. i is for filtering the input while j is for modifying the output.
We used i above, but to examine how it works, we can toss the relevant vectors into j:
begin_end[,{
d = begin - shift(final, type="lag")
d_lead = shift(d, type="lead")
my_pmin = pmin(d, d_lead, na.rm=TRUE)
c(.SD, list(d = d, d_lead = d_lead, my_pmin = my_pmin))
}]
# begin final d d_lead my_pmin
# 1: 60507 60551 NA 239 239
# 2: 60790 60840 239 1164 239
# 3: 62004 62051 1164 768 768
# 4: 62819 62868 768 2273 768
# 5: 65141 65187 2273 NA 2273
.SD is a list of column vectors already in the table, short for Subset of Data.
You seem to be trying to get the previous end value out of the apply using (x-1). Unfortunately, you can't do this inside the apply family.
Luckily, there is a function called lag (there are several, so i'll use the one from dplyr). This lets us lag a column by a given number of entries:
begin_end$space <- begin_end$begin - dplyr::lag(begin_end$final)
here's the output:
begin final space
1 60507 60551 NA
2 60790 60840 239
3 62004 62051 1164
4 62819 62868 768
5 65141 65187 2273
Then you can try:
begin_end$issingle <- begin_end$space >= 100
Using Bioconductor's GenomicRanges I think the idea would be to create a GRanges() (maybe from reading the data using GenomicAlignments::readGAlignments() or makeGRangesFromDataFrame()) from your reads, extend them in each direction using resize(), then use findOverlaps() to identify singletons as the reads that only overlap themselves. Roughly
library(GenomicRanges)
gr = GRanges(seqnames="chr1",
IRanges(start=c(1000, 1150, 1500), width=100))
gr100 = resize(gr, width(gr) + 200, fix="center")
hits = findOverlaps(gr100)
gr100[tabulate(queryHits(hits), queryLength(hits)) == 1]
leading to
> gr100[tabulate(queryHits(hits), queryLength(hits)) == 1]
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [1400, 1699] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
This will be fast for millions of records.

Find the intersection of overlapping ranges in two tables using data.table function foverlaps

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Resources