identify consecutively overlapping segments in R - r

I need to aggregate overlapping segments into a single segment ranging all connected segments.
Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.
So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.
I was thinking about a non-equi rolling join, but faild to implement that
library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59),
end = c(42,44,45,53,48,50,52,55,57,56,58,60)))
# start end
# 1: 41 42
# 2: 43 44
# 3: 43 45
# 4: 47 53
# 5: 47 48
# 6: 48 50
# 7: 51 52
# 8: 52 55
# 9: 54 57
# 10: 55 56
# 11: 57 58
# 12: 59 60
setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])
# grp start end
# 1: 1 41 42
# 2: 2 47 58
# 3: 3 43 45
# 4: 4 59 60
do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))
EDIT:
That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.
end[.N] can get wrong results, try example data x below.
max(end) is correct in all cases, and faster.
x <- data.table(start = c(11866, 12696, 13813, 14011, 14041),
end = c(13140, 14045, 14051, 14039, 14045))
min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
grp on the fly is significantly faster, unfortunately I need grp assigned.
cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
I did not test the package GenomicRanges solution.

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.
Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:
x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
, .(start = min(start), end = max(end)), by = grp]
grp start end
1: 1 41 42
2: 2 43 45
3: 3 47 58
4: 4 59 60
Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.
Edit:
As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]
Visualisation
OP's plot can be amended to show also the aggregated segments (quick and dirty):
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
, segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

You can try a GenomicRanges approach. In the output each row is a group.
library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
start end
1: 41 42
2: 43 45
3: 47 58
4: 59 60
And a visual insepection can be done using Gviz. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.
library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)

Related

R: Creating and Labelling Groups by Increments

I am working with the R programming language.
I have the following data:
set.seed(123)
my_data = data.frame(var1 = rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)
Here is what I am trying to do:
Starting from the smallest value of var1, I would like to create a variable that groups values of var1 by some "fixed increment" (e.g. by 10) until the maximum value of var1 is reached
Then, I would then like to create another variable which labels each of these groups by the min/max value of that group
Here is my attempt to do this:
# create a vector of increments
breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)
# initialize new variables
my_data$class <- NA
my_data$label <- NA
# get the number of breaks
n <- length(breaks)
# Loop
for (i in 1:(n - 1)) {
# find which "class" (i.e. break) each value of var1 is located within
indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
# make assignment
my_data$class[indices] <- i
# create labels
my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
}
The code seems to have run, but I am not sure if this is correct (I don't think I have done this correctly because I see some NA's).
Can someone please tell show me how to do this correctly?
Thanks!
This could be done with a non-equi join
library(data.table)
my_data1 <- copy(my_data)
setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
on = .(var1 > start, var1 <= end), by = .EACHI]
-output
> head(my_data1)
var1 indices label
1: 43.95244 18 39.0831124359188 49.0831124359188
2: 76.98225 21 69.0831124359188 79.0831124359188
3: 255.87083 39 249.083112435919 259.083112435919
4: 107.05084 24 99.0831124359188 109.083112435919
5: 112.92877 25 109.083112435919 119.083112435919
6: 271.50650 41 269.083112435919 279.083112435919
compare it with OP's for loop
> head(my_data)
var1 class label
1 43.95244 18 39.0831124359188 49.0831124359188
2 76.98225 21 69.0831124359188 79.0831124359188
3 255.87083 39 249.083112435919 259.083112435919
4 107.05084 24 99.0831124359188 109.083112435919
5 112.92877 25 109.083112435919 119.083112435919
6 271.50650 41 269.083112435919 279.083112435919
Regarding the NAs in the output, it is a result of the seq output
> breaks
[1] -130.9168876 -120.9168876 -110.9168876 -100.9168876 -90.9168876 -80.9168876 -70.9168876 -60.9168876 -50.9168876 -40.9168876 -30.9168876
[12] -20.9168876 -10.9168876 -0.9168876 9.0831124 19.0831124 29.0831124 39.0831124 49.0831124 59.0831124 69.0831124 79.0831124
[23] 89.0831124 99.0831124 109.0831124 119.0831124 129.0831124 139.0831124 149.0831124 159.0831124 169.0831124 179.0831124 189.0831124
[34] 199.0831124 209.0831124 219.0831124 229.0831124 239.0831124 249.0831124 259.0831124 269.0831124 279.0831124 289.0831124 299.0831124
[45] 309.0831124
Note the max value is 309.083, and for the var1 > -130.9168876 would return FALSE for those values that are exactly same. Instead, it should be var1 >= -130.9168876. In order to correct this, we may need to concatenate with max at the end and then take the unique (in case there are duplicates)
breaks <- unique(c(seq(min, max, by = 10), max))
Now, we do the same
> setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
+ type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
+ on = .(var1 >= start, var1 <= end), by = .EACHI]
>
> head(my_data1)
var1 indices label
1: 43.95244 18 39.0831124359188 49.0831124359188
2: 76.98225 21 69.0831124359188 79.0831124359188
3: 255.87083 39 249.083112435919 259.083112435919
4: 107.05084 24 99.0831124359188 109.083112435919
5: 112.92877 25 109.083112435919 119.083112435919
6: 271.50650 41 269.083112435919 279.083112435919
> head(my_data)
var1 class label
1 43.95244 18 39.0831124359188 49.0831124359188
2 76.98225 21 69.0831124359188 79.0831124359188
3 255.87083 39 249.083112435919 259.083112435919
4 107.05084 24 99.0831124359188 109.083112435919
5 112.92877 25 109.083112435919 119.083112435919
6 271.50650 41 269.083112435919 279.083112435919
> my_data1[is.na(indices)]
Empty data.table (0 rows and 3 cols): var1,indices,label

Non-equi joins - comparing two data frames in R

I would like to filter a data frame based on the values present in a second data frame.
For example, match the rows from the first data frame that, in the column "BP", are higher than the first value of the "start_pos" column and smaller than "end_pos" column or just smaller than "end_pos" in the second data frame.
I need to repeat this procedure for all the values in the second data frame. Currently, I am performing these using a for loop. However, I would like to do it in a single command.
Data frame 1
CHR BP
29 836019
29 4417047
29 7589996
29 11052921
29 14009294
29 33174196
Data frame 2
start_pos end_pos gene_id
19774 19899 ENSBTAG00000046619
34627 35558 ENSBTAG00000006858
69695 71121 ENSBTAG00000039257
83323 84281 ENSBTAG00000035349
124849 179713 ENSBTAG00000001753
264298 264843 ENSBTAG00000005540
for(j in 1:nrow(tmp_markers)){
temp_out_markers<- tmp_markers[j,]
tmp_search<-tmp_gene[which((tmp_markers[j,"BP"]>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]<= tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval >=tmp_gene[,"end_pos"]& tmp_markers[j,"BP"]<=tmp_gene[,"start_pos"])| (tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval >=tmp_gene[,"start_pos"])|(tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval<=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]>=tmp_gene[,"end_pos"])),]
if(nrow(tmp_search)>0){
temp_out<-cbind(temp_out_markers[rep(seq_len(nrow(tmp_search))),],tmp_search)
temp_out[,"Distance_from_gene_start"]<-temp_out[,"BP"]-temp_out[,"start_pos"]
temp_out[,"Distance_from_gene_end"]<-temp_out[,"BP"]-temp_out[,"end_pos"]
output_genes<-rbind(temp_out,output_genes)
}
}
At the end, I want a data frame with all the rows that are within my tested intervals.
As I stated in a comment, your mock data won't result in a match, as the smallest BP value (836019) is larger than the largest end_pos (264843).
It could be also that I misunderstood altogether your problem!
I understand that you want to match the rows in df1 to those in df2 for which BP >= start_pos and BP <= end_pos. If it's so, we can achieve that using the non-equi joins provided by package data.table.
library(data.table)
result <- dt1[dt2,
.(BP, CHR, gene_id),
on = .(BP >= start_pos, BP <= end_pos),
nomatch = NULL,
by = .EACHI]
setnames(result, 1:2, names(dt2)[1:2])
result
start_pos end_pos BP CHR gene_id
1: 0.000000 2.000000 0 29 ABCD01
2: 4.571429 6.571429 6 30 ABCD03
3: 11.428571 13.428571 12 31 ABCD06
4: 16.000000 18.000000 18 32 ABCD08
5: 22.857143 24.857143 24 33 ABCD011
6: 29.714286 31.714286 30 34 ABCD014
In case you need the full 15 rows of dt2, simply omit the nomatch = NULL part.
DATA USED:
dt1 <- data.table(CHR = 29:34,
BP = seq(0, 30, length.out = 6),
key = "BP")
dt2 <- data.table(start_pos = seq(0, 32, length.out = 15),
gene_id = paste0("ABCD", rep(0, 3), 1:15))
dt2[, end_pos := start_pos + 2]
setcolorder(dt2, c(1, 3, 2))
Alternative with foverlaps
As #r2evans mentioned in a comment, data.table has another function, foverlaps than can be useful here. It checks if a range overlaps with one in another table, so we need to do a small trick to create a 0-width range in dt1:
dt1[, BP2 := BP]
We also need to have keyed data.tables:
setkey(dt1, "BP", "BP2")
setkey(dt2, "start_pos", "end_pos")
And then pass it to the working horse:
foverlaps(dt1, dt2)
start_pos end_pos gene_id CHR BP BP2
1: 0.000000 2.000000 ABCD01 29 0 0
2: 4.571429 6.571429 ABCD03 30 6 6
3: 11.428571 13.428571 ABCD06 31 12 12
4: 16.000000 18.000000 ABCD08 32 18 18
5: 22.857143 24.857143 ABCD011 33 24 24
6: 29.714286 31.714286 ABCD014 34 30 30
Of course we can get rid of BP2 later on by BP2 := NULL.
If we want the full 15 rows of dt2, the it's just inverting the order of the objects in the call:
foverlaps(dt2, dt1)
Thank you very much!
I ended with this solution and it is working very well.
foverlaps(tmp_gene, tmp_markers, by.x = c("start_pos","end_pos"), by.y =
key(tmp_markers),nomatch = 0)
Cheers.

ifelse didn't work in dataframe in R

I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)

R function to calculate nearest neighbor distance given [inconsistent] constraint?

I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.
I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.
Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?
[An even more complex set of constraints would be even more helpful...]
For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest
I'd do it with non-equijoins and data.table
EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)
EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1),
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
The final dataset contains each pair, according to the given threshholds
EDIT:
With Additional variables:
If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)
Here's an example joining on two additional categorical variables, species and plot:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species == species,
plot == plot),
nomatch = NA,
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
id x y height species plot height.1 i.id i.x i.y i.height dist
1: 3 0.4837348 0.4325731 91.53387 C 2 111.53387 486 0.5549221 0.4395687 101.53387 0.005116568
2: 13 0.8267298 0.3137061 94.58949 C 2 114.58949 754 0.8408547 0.2305702 104.58949 0.007111079
3: 29 0.2905729 0.4952757 89.52128 C 2 109.52128 333 0.2536760 0.5707272 99.52128 0.007054301
4: 37 0.4534841 0.5249862 89.95493 C 2 109.95493 72 0.4807242 0.6056771 99.95493 0.007253044
5: 63 0.1678515 0.8814829 84.77450 C 2 104.77450 289 0.1151764 0.9728488 94.77450 0.011122404
---
994: 137 0.8696393 0.2226888 66.57792 C 2 86.57792 473 0.4467795 0.6881008 76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466 A 2 130.14466 338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821 A 2 133.61821 348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652 71.39953 C 3 91.39953 904 0.6954973 0.3690599 81.39953 0.112536771
998: 172 0.6834473 0.6221259 65.52187 A 2 85.52187 783 0.4400028 0.9526355 75.52187 0.168501816
>
NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.
A Mem-efficient solution
One of the issues here for #theforestecologist was that this requires a lot of memory to do,
(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),
However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]
# for clarity inside the brackets,
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2
# Join on a range, must be a cartesian join, since there are many candidates
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species,
plot),
.(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
by = .EACHI,
nomatch = NA,
allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
id x y height species plot i.id i.x i.y i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
1: 1 0.17622235 0.66547312 84.68450 B 2 965 0.17410840 0.63219350 93.60226 B 2 74.68450 94.68450 B 2
2: 2 0.04523011 0.33813054 89.46288 B 2 457 0.07267547 0.35725229 88.42827 B 2 79.46288 99.46288 B 2
3: 3 0.24096368 0.32649256 103.85870 C 3 202 0.20782303 0.38422814 94.35898 C 3 93.85870 113.85870 C 3
4: 4 0.53160655 0.06636979 101.50614 B 1 248 0.47382417 0.01535036 103.74101 B 1 91.50614 111.50614 B 1
5: 5 0.83426727 0.55380451 101.93408 C 3 861 0.78210747 0.52812487 96.71422 C 3 91.93408 111.93408 C 3
This way we should keep total memory usage low.

Combine information from two data frames with dplyr

I need some help with dplyr.
I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).
I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:
LargeDF
ts timestamp signal # left_boundary
1 A 0.3209338 10.43279 # 0
2 A 1.4791524 10.34295 # 1
3 A 2.6007494 10.71601 # 2
and
Categories
ts left right
1 A 0 1
2 A 1 2
3 A 2 3
My code I came up with is
LargeDF %>%
group_by(ts) %>%
do(myFUN(., Categories))
# calls this ...
myFUN <- function(Large, Categ) {
CategTS <- Categ %>%
filter(ts == Large[1, "ts"][[1]])
Large %>%
group_by(timestamp) %>% # this is bothering me...
mutate(left_boundary = CategTS$left[CategTS$left < timestamp
& timestamp < CategTS$right])
}
but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.
Does someone see a better solution? That would be much appreciated.
# Code for making the example data frames ...
library("dplyr")
n <- 10; series <- c("A", "B", "C")
LargeDF <- data.frame(
ts = rep(series, each = n)
, timestamp = runif(n*length(series), max = 4)
, signal = runif(n*length(series), min = 10, max = 11)
) %>% group_by(ts) %>% arrange(timestamp)
m <- 7
Categories <- data.frame(
ts = rep(series, each = m)
, left = rep(seq(1 : m) - 1, length(series))
, right = rep(seq(1 : m), length(series))
)
Update (data.table and my slightly modified mockup)
So, I tried the suggestions from #DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).
> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
left right value timestamp timestamp2
1: 0.9 1.9 0.1885459 1 1
2: 0.9 1.9 0.0542375 2 2 # binned here
3: 1.9 2.9 0.0542375 2 2 # and here as well
13: 19.9 25.9 0.4579986 20 20
I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.
> as.numeric(Sys.time())
[1] 1429022267
Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.
left right value timestamp timestamp2
1: 9 19 0.64971126 10 10
2: 19 29 0.75994751 20 20
3: 29 99 0.98276462 30 30
9: 199 259 0.89816165 200 200
With my real data, everything went smoothly, so thanks again.
## Code for my data.table example -----
n <- 1
d <- data.table( value = runif(9),
timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n,
timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n)
c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n,
right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n)
setkey(c, left, right)
foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
Update 2 (JOIN, then FILTER, within dplyr)
I tested the suggestion from #aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:
Error: std::bad_alloc
Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.
dplyr isn't suitable for such operations, try data.tables foverlaps functions instead
library(data.table)
class(LargeDF) <- "data.frame" ## Removing all the dplyr classes
setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table
setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work)
LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary
# ts timestamp signal left right left_boundary
# 1: A 0.46771516 10.72175 0.46771516 0.46771516 0
# 2: A 0.58841492 10.35459 0.58841492 0.58841492 0
# 3: A 1.14494484 10.50301 1.14494484 1.14494484 1
# 4: A 1.18298225 10.82431 1.18298225 1.18298225 1
# 5: A 1.69822678 10.04780 1.69822678 1.69822678 1
# 6: A 1.83189609 10.75001 1.83189609 1.83189609 1
# 7: A 1.90947475 10.94715 1.90947475 1.90947475 1
# 8: A 2.73305266 10.14449 2.73305266 2.73305266 2
# 9: A 3.02371968 10.17724 3.02371968 3.02371968 3
# ...

Resources