match 2 column elements based on a differece within a range - r

I want to match the elements of two unequal columns from two different data frames if they fall within the range: 1 to 3 (2+/-1)
My data frames:
dat1:
Number status
10023 T
10324 F
12277 F
12888 T
12000 T
dat2:
Number status
10020 T
10002 F
12279 F
12888 T
Required ouput:
10023 10020 T
12277 12279 F
My attempt (below) did not work:
diff <- 2
allow <- 1
NewData <- dat1$Number %in% (dat2$Number<=diff+allow | dat2$Number>=diff+allow)
Help will be appreciated.

This looks like a must case for data.table::foverlaps to me.
The workflow is to create start and end columns within both data sets, while we will create the range within the second data set. Then, we will key both and simply run foverlaps
library(data.table)
diff <- 2
allow <- 1
setDT(dat1)[, `:=`(start = Number, end = Number)]
setkey(dat1, status, start, end)
setDT(dat2)[, `:=`(start = Number - (diff + allow), end = Number + diff + allow)]
setkey(dat2, status, start, end)
foverlaps(dat2, dat1, nomatch = 0L)[, .(Numdf1 = Number, Numdf2 = i.Number, status)]
# Numdf1 Numdf2 status
# 1: 12277 12279 FALSE
# 2: 10023 10020 TRUE
# 3: 12888 12888 TRUE ### <- I'm assuming you had an error in the desirred output

Related

R index and match with multiple conditions [duplicate]

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Histogram-like summary for interval data

How do I get a histogram-like summary of interval data in R?
My MWE data has four intervals.
interval range
Int1 2-7
Int2 10-14
Int3 12-18
Int4 25-28
I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins.
The function output should look like this:
bin count which
[0-4] 1 Int1
[5-9] 1 Int1
[10-14] 2 Int2 and Int3
[15-19] 1 Int3
[20-24] 0 None
[25-29] 1 Int4
Here the range is [minfloor(Int1, Int2, Int3, Int40), maxceil(Int1, Int2, Int3, Int4)) = [0,30) and there are six bins of size = 5.
I would greatly appreciate any pointers to R packages or functions that implement the functionality I want.
Update:
So far, I have a solution from the IRanges package which uses a fast data structure called NCList, which is faster than Interval Search Trees according to users.
> library(IRanges)
> subject <- IRanges(c(2,10,12,25), c(7,14,18,28))
> query <- IRanges(c(0,5,10,15,20,25), c(4,9,14,19,24,29))
> countOverlaps(query, subject)
[1] 1 1 2 1 0 1
But I am still unable to get which are the ranges that overlap. Will update if I get through.
Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though.
I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package:
require(data.table)
subject <- data.table(interval = paste("int", 1:4, sep=""),
start = c(2,10,12,25),
end = c(7,14,18,28))
query <- data.table(start = c(0,5,10,15,20,25),
end = c(4,9,14,19,24,29))
setkey(subject, start, end)
ans = foverlaps(query, subject, type="any")
ans[, .(count = sum(!is.na(start)),
which = paste(interval, collapse=", ")),
by = .(i.start, i.end)]
# i.start i.end count which
# 1: 0 4 1 int1
# 2: 5 9 1 int1
# 3: 10 14 2 int2, int3
# 4: 15 19 1 int3
# 5: 20 24 0 NA
# 6: 25 29 1 int4

Find the intersection of overlapping ranges in two tables using data.table function foverlaps

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

How to find sum and average for some columns based on the numbers from another column in R

GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965

Resources