R function to calculate nearest neighbor distance given [inconsistent] constraint? - r

I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.
I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.
Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?
[An even more complex set of constraints would be even more helpful...]
For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest

I'd do it with non-equijoins and data.table
EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)
EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1),
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
The final dataset contains each pair, according to the given threshholds
EDIT:
With Additional variables:
If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)
Here's an example joining on two additional categorical variables, species and plot:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species == species,
plot == plot),
nomatch = NA,
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
id x y height species plot height.1 i.id i.x i.y i.height dist
1: 3 0.4837348 0.4325731 91.53387 C 2 111.53387 486 0.5549221 0.4395687 101.53387 0.005116568
2: 13 0.8267298 0.3137061 94.58949 C 2 114.58949 754 0.8408547 0.2305702 104.58949 0.007111079
3: 29 0.2905729 0.4952757 89.52128 C 2 109.52128 333 0.2536760 0.5707272 99.52128 0.007054301
4: 37 0.4534841 0.5249862 89.95493 C 2 109.95493 72 0.4807242 0.6056771 99.95493 0.007253044
5: 63 0.1678515 0.8814829 84.77450 C 2 104.77450 289 0.1151764 0.9728488 94.77450 0.011122404
---
994: 137 0.8696393 0.2226888 66.57792 C 2 86.57792 473 0.4467795 0.6881008 76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466 A 2 130.14466 338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821 A 2 133.61821 348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652 71.39953 C 3 91.39953 904 0.6954973 0.3690599 81.39953 0.112536771
998: 172 0.6834473 0.6221259 65.52187 A 2 85.52187 783 0.4400028 0.9526355 75.52187 0.168501816
>
NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.
A Mem-efficient solution
One of the issues here for #theforestecologist was that this requires a lot of memory to do,
(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),
However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]
# for clarity inside the brackets,
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2
# Join on a range, must be a cartesian join, since there are many candidates
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species,
plot),
.(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
by = .EACHI,
nomatch = NA,
allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
id x y height species plot i.id i.x i.y i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
1: 1 0.17622235 0.66547312 84.68450 B 2 965 0.17410840 0.63219350 93.60226 B 2 74.68450 94.68450 B 2
2: 2 0.04523011 0.33813054 89.46288 B 2 457 0.07267547 0.35725229 88.42827 B 2 79.46288 99.46288 B 2
3: 3 0.24096368 0.32649256 103.85870 C 3 202 0.20782303 0.38422814 94.35898 C 3 93.85870 113.85870 C 3
4: 4 0.53160655 0.06636979 101.50614 B 1 248 0.47382417 0.01535036 103.74101 B 1 91.50614 111.50614 B 1
5: 5 0.83426727 0.55380451 101.93408 C 3 861 0.78210747 0.52812487 96.71422 C 3 91.93408 111.93408 C 3
This way we should keep total memory usage low.

Related

R: Summing of the column values by ranged values of another column

Good day!
I’ve got a table of two columns. In the first column (x) there are values which I want to divide in into categories according to the specified range of values (in my instance – 300). And then using these categories I want to sum values in anther column (v). For instance, using my test data: The first category is from 65100 to 65400 (65100
The result: there is a table of two columns. The first one is the categories of x; the second column is the sum of according values of v.
Thank you!!!
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(x=c(x), v=c(v))
attach(tabl)
#categories
seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
I understood you want to:
Cut vector x,
Using pre-calculated cut-off thresholds
Compute sums over vector v using those groupings
This is one line of code with data.table and chaining. Your data are in data.table named DT.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Explanation:
First, assign cut-offs to variable breaks like so.
breaks <- seq(((min(x) - min(x) %% 300) + 300), ((max(x) - max(x) %% 300) + 300), by =300)
Second, compute a new column CUT to group rows by the data in breaks.
DT[, CUT := cut(x, breaks)]
Third, sum on column v in groups, using by=. I have chained this operation with the previous.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Convert your data.frame to data.table like so.
library(data.table)
DT <- as.data.table(tabl)
This is the final result:
CUT V1
1: (6.57e+04,6.6e+04] 45493
2: (6.6e+04,6.63e+04] 77865
3: (6.66e+04,6.69e+04] 22893
4: (6.75e+04,6.78e+04] 61738
5: (6.54e+04,6.57e+04] 44805
6: (6.69e+04,6.72e+04] 64079
7: NA 33234
8: (6.72e+04,6.75e+04] 66517
9: (6.63e+04,6.66e+04] 43887
10: (6.78e+04,6.81e+04] 172
You can dress this up to improve aesthetics. For example, you can reset the factor levels for ease of reading.
When I use dplyr I am used to do it like this. Although I like the cut solution too.
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(group=c(x), value=c(v))
attach(tabl)
#categories
s <- seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
tabl %>% rowwise() %>% mutate(g = s[min(which(group < s), na.rm=T)]) %>% ungroup() %>%
group_by(g) %>% summarise(sumvalue = sum(value))
result:
g sumvalue
<dbl> <int>
65400 28552
65700 49487
66000 45493
66300 77865
66600 43887
66900 21187
67200 65785
67500 66517
67800 61738
68100 1722
Try this (no package needed):
s <- seq(65100, max(tabl$x)+300, 300)
tabl$col = as.vector(cut(tabl$x, breaks = s, labels = 1:10))
df <- aggregate(v~col, tabl, sum)
# col v
# 1 1 33234
# 2 2 44805
# 3 3 45493
# 4 4 77865
# 5 5 43887
# 6 6 22893
# 7 7 64079
# 8 8 66517
# 9 9 61738
# 10 10 1722

R index and match with multiple conditions [duplicate]

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Comparing Groups in data.table Columns

I have a dataset that I need to both split by one variable (Day) and then compare between groups of another variable (Group), performing per-group statistics (e.g. mean) and also tests.
Here's an example of what I devised:
require(data.table)
data = data.table(Day = rep(1:10, each = 10),
Group = rep(1:2, times = 50),
V = rnorm(100))
data[, .(g1_mean = mean(.SD[Group == 1]$V),
g2_mean = mean(.SD[Group == 2]$V),
p.value = t.test(V ~ Group, .SD, alternative = "two.sided")$p.value),
by = list(Day)]
Which produces:
Day g1_mean g2_mean p.value
1: 1 0.883406048 0.67177271 0.6674138
2: 2 0.007544956 -0.55609722 0.3948459
3: 3 0.409248637 0.28717183 0.8753213
4: 4 -0.540075365 0.23181458 0.1785854
5: 5 -0.632543900 -1.09965990 0.6457325
6: 6 -0.083221671 -0.96286343 0.2011136
7: 7 -0.044674252 -0.27666473 0.7079499
8: 8 0.260795244 -0.15159164 0.4663712
9: 9 -0.134164758 0.01136245 0.7992453
10: 10 0.496144329 0.76168408 0.1821123
I'm hoping that there's a less roundabout manner of arriving at this result.
A possible compact alternative which can also apply more functions to each group:
DTnew <- dcast(DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day],
Day + pval ~ paste0("g",Group), fun = list(mean,sd), value.var = "V")
which gives:
> DTnew
Day pval V_mean_g1 V_mean_g2 V_sd_g1 V_sd_g2
1: 1 0.4763594 -0.11630634 0.178240714 0.7462975 0.4516087
2: 2 0.5715001 -0.29689807 0.082970631 1.3614177 0.2745783
3: 3 0.2295251 -0.48792449 -0.031328749 0.3723247 0.6703694
4: 4 0.5565573 0.33982242 0.080169698 0.5635136 0.7560959
5: 5 0.5498684 -0.07554433 0.308661427 0.9343230 1.0100788
6: 6 0.4814518 0.57694034 0.885968245 0.6457926 0.6773873
7: 7 0.8053066 0.29845913 0.116217727 0.9541060 1.2782210
8: 8 0.3549573 0.14827289 -0.319017581 0.5328734 0.9036501
9: 9 0.7290625 -0.21589411 -0.005785092 0.9639758 0.8859461
10: 10 0.9899833 0.84034529 0.850429982 0.6645952 1.5809149
A decomposition of the code:
First, a pval variable is added to the dataset with DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day]
Because DT is updated in place and by reference by the previous step, the dcast function can be applied to that directly.
In the casting formula, you specify the variables that need to stay in the current form on the RHS and the variable that needs to be spread over columns on the LHS.
With the fun argument you can specify which aggregation function has to be used on the value.var (here V). If multiple aggregation functions are needed, you can specify them in a list (e.g. list(mean,sd)). This can be any type of function. So, also cumstom made functions can be used.
If you want to remove the V_ from the column names, you can do:
names(DTnew) <- gsub("V_","",names(DTnew))
NOTE: I renamed the data.table to DT as it is often not wise to name your dataset after a function (check ?data)
While not a one-liner, you might consider doing your two processes separate and then merging the results. This prevents you from having to hardcode the group-names.
First, we calculate the means:
my_means <- dcast(data[,mean(V), by = .(Day, Group)],
Day~ paste0("Mean_Group", Group),value.var="V1")
Or in the less-convoluted way #Akrun mentioned in the comments, with some added formatting.
my_means <- dcast(Day~paste0("Mean_Group", Group), data=data,
fun.agg=mean, value.var="V")
Then the t-tests:
t_tests <- data[,.(p_value=t.test(V~Group)$p.value), by = Day]
And then merge:
output <- merge(my_means, t_tests)

Find the intersection of overlapping ranges in two tables using data.table function foverlaps

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources