Non-equi joins - comparing two data frames in R - r

I would like to filter a data frame based on the values present in a second data frame.
For example, match the rows from the first data frame that, in the column "BP", are higher than the first value of the "start_pos" column and smaller than "end_pos" column or just smaller than "end_pos" in the second data frame.
I need to repeat this procedure for all the values in the second data frame. Currently, I am performing these using a for loop. However, I would like to do it in a single command.
Data frame 1
CHR BP
29 836019
29 4417047
29 7589996
29 11052921
29 14009294
29 33174196
Data frame 2
start_pos end_pos gene_id
19774 19899 ENSBTAG00000046619
34627 35558 ENSBTAG00000006858
69695 71121 ENSBTAG00000039257
83323 84281 ENSBTAG00000035349
124849 179713 ENSBTAG00000001753
264298 264843 ENSBTAG00000005540
for(j in 1:nrow(tmp_markers)){
temp_out_markers<- tmp_markers[j,]
tmp_search<-tmp_gene[which((tmp_markers[j,"BP"]>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]<= tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval >=tmp_gene[,"end_pos"]& tmp_markers[j,"BP"]<=tmp_gene[,"start_pos"])| (tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval >=tmp_gene[,"start_pos"])|(tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval<=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]>=tmp_gene[,"end_pos"])),]
if(nrow(tmp_search)>0){
temp_out<-cbind(temp_out_markers[rep(seq_len(nrow(tmp_search))),],tmp_search)
temp_out[,"Distance_from_gene_start"]<-temp_out[,"BP"]-temp_out[,"start_pos"]
temp_out[,"Distance_from_gene_end"]<-temp_out[,"BP"]-temp_out[,"end_pos"]
output_genes<-rbind(temp_out,output_genes)
}
}
At the end, I want a data frame with all the rows that are within my tested intervals.

As I stated in a comment, your mock data won't result in a match, as the smallest BP value (836019) is larger than the largest end_pos (264843).
It could be also that I misunderstood altogether your problem!
I understand that you want to match the rows in df1 to those in df2 for which BP >= start_pos and BP <= end_pos. If it's so, we can achieve that using the non-equi joins provided by package data.table.
library(data.table)
result <- dt1[dt2,
.(BP, CHR, gene_id),
on = .(BP >= start_pos, BP <= end_pos),
nomatch = NULL,
by = .EACHI]
setnames(result, 1:2, names(dt2)[1:2])
result
start_pos end_pos BP CHR gene_id
1: 0.000000 2.000000 0 29 ABCD01
2: 4.571429 6.571429 6 30 ABCD03
3: 11.428571 13.428571 12 31 ABCD06
4: 16.000000 18.000000 18 32 ABCD08
5: 22.857143 24.857143 24 33 ABCD011
6: 29.714286 31.714286 30 34 ABCD014
In case you need the full 15 rows of dt2, simply omit the nomatch = NULL part.
DATA USED:
dt1 <- data.table(CHR = 29:34,
BP = seq(0, 30, length.out = 6),
key = "BP")
dt2 <- data.table(start_pos = seq(0, 32, length.out = 15),
gene_id = paste0("ABCD", rep(0, 3), 1:15))
dt2[, end_pos := start_pos + 2]
setcolorder(dt2, c(1, 3, 2))
Alternative with foverlaps
As #r2evans mentioned in a comment, data.table has another function, foverlaps than can be useful here. It checks if a range overlaps with one in another table, so we need to do a small trick to create a 0-width range in dt1:
dt1[, BP2 := BP]
We also need to have keyed data.tables:
setkey(dt1, "BP", "BP2")
setkey(dt2, "start_pos", "end_pos")
And then pass it to the working horse:
foverlaps(dt1, dt2)
start_pos end_pos gene_id CHR BP BP2
1: 0.000000 2.000000 ABCD01 29 0 0
2: 4.571429 6.571429 ABCD03 30 6 6
3: 11.428571 13.428571 ABCD06 31 12 12
4: 16.000000 18.000000 ABCD08 32 18 18
5: 22.857143 24.857143 ABCD011 33 24 24
6: 29.714286 31.714286 ABCD014 34 30 30
Of course we can get rid of BP2 later on by BP2 := NULL.
If we want the full 15 rows of dt2, the it's just inverting the order of the objects in the call:
foverlaps(dt2, dt1)

Thank you very much!
I ended with this solution and it is working very well.
foverlaps(tmp_gene, tmp_markers, by.x = c("start_pos","end_pos"), by.y =
key(tmp_markers),nomatch = 0)
Cheers.

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

identify consecutively overlapping segments in R

I need to aggregate overlapping segments into a single segment ranging all connected segments.
Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.
So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.
I was thinking about a non-equi rolling join, but faild to implement that
library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59),
end = c(42,44,45,53,48,50,52,55,57,56,58,60)))
# start end
# 1: 41 42
# 2: 43 44
# 3: 43 45
# 4: 47 53
# 5: 47 48
# 6: 48 50
# 7: 51 52
# 8: 52 55
# 9: 54 57
# 10: 55 56
# 11: 57 58
# 12: 59 60
setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])
# grp start end
# 1: 1 41 42
# 2: 2 47 58
# 3: 3 43 45
# 4: 4 59 60
do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))
EDIT:
That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.
end[.N] can get wrong results, try example data x below.
max(end) is correct in all cases, and faster.
x <- data.table(start = c(11866, 12696, 13813, 14011, 14041),
end = c(13140, 14045, 14051, 14039, 14045))
min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
grp on the fly is significantly faster, unfortunately I need grp assigned.
cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
I did not test the package GenomicRanges solution.
The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.
Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:
x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
, .(start = min(start), end = max(end)), by = grp]
grp start end
1: 1 41 42
2: 2 43 45
3: 3 47 58
4: 4 59 60
Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.
Edit:
As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]
Visualisation
OP's plot can be amended to show also the aggregated segments (quick and dirty):
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
, segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]
You can try a GenomicRanges approach. In the output each row is a group.
library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
start end
1: 41 42
2: 43 45
3: 47 58
4: 59 60
And a visual insepection can be done using Gviz. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.
library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)

R: Using different DFs to get third DF with specific info from first 2

I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}

R function to calculate nearest neighbor distance given [inconsistent] constraint?

I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.
I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.
Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?
[An even more complex set of constraints would be even more helpful...]
For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest
I'd do it with non-equijoins and data.table
EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)
EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1),
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
The final dataset contains each pair, according to the given threshholds
EDIT:
With Additional variables:
If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)
Here's an example joining on two additional categorical variables, species and plot:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species == species,
plot == plot),
nomatch = NA,
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
id x y height species plot height.1 i.id i.x i.y i.height dist
1: 3 0.4837348 0.4325731 91.53387 C 2 111.53387 486 0.5549221 0.4395687 101.53387 0.005116568
2: 13 0.8267298 0.3137061 94.58949 C 2 114.58949 754 0.8408547 0.2305702 104.58949 0.007111079
3: 29 0.2905729 0.4952757 89.52128 C 2 109.52128 333 0.2536760 0.5707272 99.52128 0.007054301
4: 37 0.4534841 0.5249862 89.95493 C 2 109.95493 72 0.4807242 0.6056771 99.95493 0.007253044
5: 63 0.1678515 0.8814829 84.77450 C 2 104.77450 289 0.1151764 0.9728488 94.77450 0.011122404
---
994: 137 0.8696393 0.2226888 66.57792 C 2 86.57792 473 0.4467795 0.6881008 76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466 A 2 130.14466 338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821 A 2 133.61821 348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652 71.39953 C 3 91.39953 904 0.6954973 0.3690599 81.39953 0.112536771
998: 172 0.6834473 0.6221259 65.52187 A 2 85.52187 783 0.4400028 0.9526355 75.52187 0.168501816
>
NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.
A Mem-efficient solution
One of the issues here for #theforestecologist was that this requires a lot of memory to do,
(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),
However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]
# for clarity inside the brackets,
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2
# Join on a range, must be a cartesian join, since there are many candidates
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species,
plot),
.(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
by = .EACHI,
nomatch = NA,
allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
id x y height species plot i.id i.x i.y i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
1: 1 0.17622235 0.66547312 84.68450 B 2 965 0.17410840 0.63219350 93.60226 B 2 74.68450 94.68450 B 2
2: 2 0.04523011 0.33813054 89.46288 B 2 457 0.07267547 0.35725229 88.42827 B 2 79.46288 99.46288 B 2
3: 3 0.24096368 0.32649256 103.85870 C 3 202 0.20782303 0.38422814 94.35898 C 3 93.85870 113.85870 C 3
4: 4 0.53160655 0.06636979 101.50614 B 1 248 0.47382417 0.01535036 103.74101 B 1 91.50614 111.50614 B 1
5: 5 0.83426727 0.55380451 101.93408 C 3 861 0.78210747 0.52812487 96.71422 C 3 91.93408 111.93408 C 3
This way we should keep total memory usage low.

Combine information from two data frames with dplyr

I need some help with dplyr.
I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).
I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:
LargeDF
ts timestamp signal # left_boundary
1 A 0.3209338 10.43279 # 0
2 A 1.4791524 10.34295 # 1
3 A 2.6007494 10.71601 # 2
and
Categories
ts left right
1 A 0 1
2 A 1 2
3 A 2 3
My code I came up with is
LargeDF %>%
group_by(ts) %>%
do(myFUN(., Categories))
# calls this ...
myFUN <- function(Large, Categ) {
CategTS <- Categ %>%
filter(ts == Large[1, "ts"][[1]])
Large %>%
group_by(timestamp) %>% # this is bothering me...
mutate(left_boundary = CategTS$left[CategTS$left < timestamp
& timestamp < CategTS$right])
}
but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.
Does someone see a better solution? That would be much appreciated.
# Code for making the example data frames ...
library("dplyr")
n <- 10; series <- c("A", "B", "C")
LargeDF <- data.frame(
ts = rep(series, each = n)
, timestamp = runif(n*length(series), max = 4)
, signal = runif(n*length(series), min = 10, max = 11)
) %>% group_by(ts) %>% arrange(timestamp)
m <- 7
Categories <- data.frame(
ts = rep(series, each = m)
, left = rep(seq(1 : m) - 1, length(series))
, right = rep(seq(1 : m), length(series))
)
Update (data.table and my slightly modified mockup)
So, I tried the suggestions from #DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).
> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
left right value timestamp timestamp2
1: 0.9 1.9 0.1885459 1 1
2: 0.9 1.9 0.0542375 2 2 # binned here
3: 1.9 2.9 0.0542375 2 2 # and here as well
13: 19.9 25.9 0.4579986 20 20
I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.
> as.numeric(Sys.time())
[1] 1429022267
Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.
left right value timestamp timestamp2
1: 9 19 0.64971126 10 10
2: 19 29 0.75994751 20 20
3: 29 99 0.98276462 30 30
9: 199 259 0.89816165 200 200
With my real data, everything went smoothly, so thanks again.
## Code for my data.table example -----
n <- 1
d <- data.table( value = runif(9),
timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n,
timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n)
c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n,
right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n)
setkey(c, left, right)
foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
Update 2 (JOIN, then FILTER, within dplyr)
I tested the suggestion from #aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:
Error: std::bad_alloc
Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.
dplyr isn't suitable for such operations, try data.tables foverlaps functions instead
library(data.table)
class(LargeDF) <- "data.frame" ## Removing all the dplyr classes
setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table
setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work)
LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary
# ts timestamp signal left right left_boundary
# 1: A 0.46771516 10.72175 0.46771516 0.46771516 0
# 2: A 0.58841492 10.35459 0.58841492 0.58841492 0
# 3: A 1.14494484 10.50301 1.14494484 1.14494484 1
# 4: A 1.18298225 10.82431 1.18298225 1.18298225 1
# 5: A 1.69822678 10.04780 1.69822678 1.69822678 1
# 6: A 1.83189609 10.75001 1.83189609 1.83189609 1
# 7: A 1.90947475 10.94715 1.90947475 1.90947475 1
# 8: A 2.73305266 10.14449 2.73305266 2.73305266 2
# 9: A 3.02371968 10.17724 3.02371968 3.02371968 3
# ...

Resources