I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}
Related
Last week I asked this question: How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table on the matching of GPS coordinates of a gps tracker with the Gps coordinates of bus stop. User Hugh, gave an answer that works many times faster than the solution I eventually came up with. However, there is one problem I cannot get fixed in his/her code (I feel like I really tried all iterations possibilities of the code) I need to get a rolling join in which multiple bus stops that are near a GPS point get listed. In the current code only the nearest is given, but a GPS coordinate can of course be near multiple bus stops. I know this means I will have to turn around the Y and the X in the rolling join somewhere , but somehow it just doesnt work. Can someone maybe point me towards a solution?
All the code and full explanation can be found in the original question, but this is how I create the data:
# create GPS data
number_of_GPS_coordinates <- 1000000
set.seed(1)
gpsdata<-data.frame(id=1:number_of_GPS_coordinates,
device_id = 1,
latitude=runif(number_of_GPS_coordinates,50.5,53.5),
longitude=runif(number_of_GPS_coordinates,4,7))
# create Bus stop data
set.seed(1)
number_of_bus_stops <- 55000
busdata<-data.frame(id=1:number_of_bus_stops,
name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")),
latitude=runif(number_of_bus_stops,50.5,53.5),
longitude=runif(number_of_bus_stops,4,7))
And this is the function that does work for me, but is still not very fast (much faster than my first attempt):
check_if_close <- function(dataset1 = GPS.Utrecht.to.Gouda,
dataset2 = bus_stops,
n.splits = 500,
desired.dist = .2){
# dataset1 needs at least the columns
# - "id",
# - "device_id"
# - "latitude"
# - "longitude"
# dataset2 needs at least the columns
# - "id",
# - "name"
# - "latitude"
# - "longitude"
# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.
# 1) we only select the variables we need from dataset 1
dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))
# 2) we only select the variables we need from dataset 2
dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))
# 3) only keep subet of dataset2 that falls within dataset 1.
# There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
# We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates,
# but is still near to a GPS point
dataset2 <- dataset2[latitude_feature %between% (range(dataset1$latitude_gps) + c(-0.01, +0.01))
& longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]
# 4) we cut the dataset2 into slices on the latitude dimension
# some trial and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
# if you add to few you get you need to check too many GPS to feauture distances per slice
dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]
# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))
# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
# this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]
# 7) this is the actual function we use to check if a datapoint is nearby.
# potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]),
p,
longlat=TRUE) # in km
return(which(distances <= dist)) # distance is now set to 200 meters
}
# 8) we assign a ID to the dataset1 starting again at every slice.
# we need this to later match the data again
dataset1[, ID2 := sequence(.N), by = latitude_range]
# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
# to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
# are nearby, but just fall in a different slice.
# 9a: create an empty list we fill with dataframes later
TT<-vector("list", length=n.splits)
# 9b: loop over the number of slices using above defined function
for(i in 1:n.splits){
datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
# 9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
TT[[i]] <-NULL
} else {
# 9d: if there was a match we get a list of data point that are named
names(datapoints.near.feature) <- dataset2[range2 %in% c(i-1,i, i+1), name]
# 9e: then we 'melt' this list into data.frame
temp <- melt(datapoints.near.feature)
# 9f: then we transform it into a data.table and change the names
setDT(temp)
setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
# 9h: then we only select the data point in dataset1 that fall in the current slice give them an
# ID and merge them with the file of nearby busstops
gpsdata.f <- dataset1[latitude_range==i, ]
gpsdata.f[, rowID2 := seq_len(.N)]
setkey(gpsdata.f, key = "rowID2")
setkey(temp, key = "value")
GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
# 9i: we only keep the unique entries and for every slice save them to the list
GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
TT[[i]] <- GPS.joined.unique.temp
cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()
#cat(i/n.splits*100, " \r"); flush.console()
}
}
# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1,
y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]),
by.x= "id_dataset1",
by.y= "id_dataset1",
all.x=TRUE)
# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# 12) if a point is near multiple features at once these are listed in a vector,
# instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")
return(finallist)
}
This is the code #Hugh proposed, which is super fast and almost does the same thing. except that multiple bus stations to a point are not listed and only the nearest ones.
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
Can someone help me adjust the code in such a way that not only the nearest bus stop, but all stops within 200 meters are given (in a list, or separate rows)
I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.
I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.
Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?
[An even more complex set of constraints would be even more helpful...]
For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest
I'd do it with non-equijoins and data.table
EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)
EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1),
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
The final dataset contains each pair, according to the given threshholds
EDIT:
With Additional variables:
If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)
Here's an example joining on two additional categorical variables, species and plot:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species == species,
plot == plot),
nomatch = NA,
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
id x y height species plot height.1 i.id i.x i.y i.height dist
1: 3 0.4837348 0.4325731 91.53387 C 2 111.53387 486 0.5549221 0.4395687 101.53387 0.005116568
2: 13 0.8267298 0.3137061 94.58949 C 2 114.58949 754 0.8408547 0.2305702 104.58949 0.007111079
3: 29 0.2905729 0.4952757 89.52128 C 2 109.52128 333 0.2536760 0.5707272 99.52128 0.007054301
4: 37 0.4534841 0.5249862 89.95493 C 2 109.95493 72 0.4807242 0.6056771 99.95493 0.007253044
5: 63 0.1678515 0.8814829 84.77450 C 2 104.77450 289 0.1151764 0.9728488 94.77450 0.011122404
---
994: 137 0.8696393 0.2226888 66.57792 C 2 86.57792 473 0.4467795 0.6881008 76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466 A 2 130.14466 338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821 A 2 133.61821 348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652 71.39953 C 3 91.39953 904 0.6954973 0.3690599 81.39953 0.112536771
998: 172 0.6834473 0.6221259 65.52187 A 2 85.52187 783 0.4400028 0.9526355 75.52187 0.168501816
>
NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.
A Mem-efficient solution
One of the issues here for #theforestecologist was that this requires a lot of memory to do,
(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),
However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]
# for clarity inside the brackets,
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2
# Join on a range, must be a cartesian join, since there are many candidates
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species,
plot),
.(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
by = .EACHI,
nomatch = NA,
allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
id x y height species plot i.id i.x i.y i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
1: 1 0.17622235 0.66547312 84.68450 B 2 965 0.17410840 0.63219350 93.60226 B 2 74.68450 94.68450 B 2
2: 2 0.04523011 0.33813054 89.46288 B 2 457 0.07267547 0.35725229 88.42827 B 2 79.46288 99.46288 B 2
3: 3 0.24096368 0.32649256 103.85870 C 3 202 0.20782303 0.38422814 94.35898 C 3 93.85870 113.85870 C 3
4: 4 0.53160655 0.06636979 101.50614 B 1 248 0.47382417 0.01535036 103.74101 B 1 91.50614 111.50614 B 1
5: 5 0.83426727 0.55380451 101.93408 C 3 861 0.78210747 0.52812487 96.71422 C 3 91.93408 111.93408 C 3
This way we should keep total memory usage low.
I want to match the elements of two unequal columns from two different data frames if they fall within the range: 1 to 3 (2+/-1)
My data frames:
dat1:
Number status
10023 T
10324 F
12277 F
12888 T
12000 T
dat2:
Number status
10020 T
10002 F
12279 F
12888 T
Required ouput:
10023 10020 T
12277 12279 F
My attempt (below) did not work:
diff <- 2
allow <- 1
NewData <- dat1$Number %in% (dat2$Number<=diff+allow | dat2$Number>=diff+allow)
Help will be appreciated.
This looks like a must case for data.table::foverlaps to me.
The workflow is to create start and end columns within both data sets, while we will create the range within the second data set. Then, we will key both and simply run foverlaps
library(data.table)
diff <- 2
allow <- 1
setDT(dat1)[, `:=`(start = Number, end = Number)]
setkey(dat1, status, start, end)
setDT(dat2)[, `:=`(start = Number - (diff + allow), end = Number + diff + allow)]
setkey(dat2, status, start, end)
foverlaps(dat2, dat1, nomatch = 0L)[, .(Numdf1 = Number, Numdf2 = i.Number, status)]
# Numdf1 Numdf2 status
# 1: 12277 12279 FALSE
# 2: 10023 10020 TRUE
# 3: 12888 12888 TRUE ### <- I'm assuming you had an error in the desirred output
I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}
I have a large dataset with 1008412 observations,
the columns are customer_id (int), visit_date (Date, format: "2010-04-04"), visit_spend (float).
This date function for the aggregate maps week numbers of interest to the range 13-65:
weekofperiod <- function(dt) {
as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
Each customer_id has a variable number of total visits over a 53-week period.
For each customer_id, I want to get the aggregate of spend_per_week, by weekofperiod().
The code below is functionally correct but very slow - comments to make it faster?
Also, aggregate() produces sparse output where weeks without visits are missing, I initialize spend_per_week to 0, then row-wise manually assign the non-zero results from aggregate(), to make sure the result always has 53 rows. Surely that can be done better?
Sample dataset lines look like:
customer_id visit_date visit_spend
72 40 2011-03-15 18.38
73 40 2011-03-20 23.45
74 79 2010-04-07 150.87
75 79 2010-04-17 101.90
76 79 2010-05-02 111.90
and here's the code with aggregate call and adjustment for empty weeks:
for (cid in all_tt_cids) {
print_pnq('Getting statistics for cid', cid)
# Get row indices of the selected subset, for just this cid's records
I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
# (other code to compute other per-cid statistics)
# spend_per_week (mode;mean;sd)
# Aggregate spend_per_week, but beware this should be 0 for those week with no visits
spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
for (i in 1:nrow(nonzero_spends_per_week)) {
spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
}
colnames(spend_per_week)[2] <- 'spend_per_week'
# (code to compute and store per-cid statistics on spend_per_week)
}
Your biggest speed up is going to come if you replace the for loops. I can't quite tell from your example, because you overwrite each customer in the loop, but here's one way to do it if you want to keep the info for all subjects.
For testing, first define functions for the original method, and a new method without loops:
weekofperiod <- function(dt) {
as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
FastMethod <- function(tt) {
tt$week = weekofperiod(tt$visit_date)
spend_per_week.tmp = as.data.frame(tapply(tt$visit_spend, tt[,c(1,4)], sum))
spend_per_week = data.frame(matrix(0, nrow=nrow(spend_per_week.tmp), ncol=length(13:65)))
colnames(spend_per_week) = 13:65
rownames(spend_per_week) = rownames(spend_per_week.tmp)
spend_per_week[, colnames(spend_per_week.tmp)] = spend_per_week.tmp
spend_per_week
}
OrigMethod <- function(tt) {
all_tt_cids = unique(tt$customer_id)
for (cid in all_tt_cids) {
# Get row indices of the selected subset, for just this cid's records
I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
# Aggregate spend_per_week, but beware this should be 0 for those week with no visits
spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)))
nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
for (i in 1:nrow(nonzero_spends_per_week)) {
spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
}
colnames(spend_per_week)[2] <- 'spend_per_week'
}
spend_per_week
}
Now simulate a larger dataset so it's easier to compare:
n.row = 10^4
n.cust = 10^3
customer_id = 1:n.cust
dates = seq(as.Date('2010-04-01'), as.Date('2011-03-31'), by=1)
visit_date = sample(dates, n.row, replace=T)
visit_spend = runif(n.row, 0, 200)
tt = data.frame(customer_id, visit_date, visit_spend)
Finally, compare the two methods:
> system.time(FastMethod(tt))
user system elapsed
0.082 0.001 0.083
> system.time(OrigMethod(tt))
user system elapsed
4.505 0.007 4.514
This is already 50x faster, and I bet you can make it even better with more optimization. Good luck!
Here is a faster method using data.table, which is also easier to read.
FasterMethod <- function(tt){
# LOAD LIBRARIES
require(reshape2)
require(data.table)
tt <- transform(tt, week_of_period = weekofperiod(visit_date))
# AGGREGATE SPEND BY CUSTOMER AND WEEK OF PERIOD
tt <- data.table(tt)
ans <- tt[,list(spend = sum(visit_spend)), 'customer_id, week_of_period']
# RESHAPE TO CUSTOMER ID VS. WEEK OF PERIOD
dcast(ans, customer_id ~ week_of_period, value_var = 'spend')
}
We can benchmark this against FastMethod and OrigMethod using rbenchmark, and see that we gain a 1.3x speedup over FastMethod and an overall speedup of 70x
library(rbenchmark)
benchmark(FastMethod(tt), FasterMethod(tt), replications = 40)
test elapsed relative
FastMethod(tt) 5.594 1.346654
FasterMethod(tt) 4.154 1.000000
You can speed it up even further (2.5 x compared to FastMethod) if you did not care about reshaping the final output to customer id vs. week of period.