I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id group x y z
1: 1 1 109.3400 208.6732 308.7595
2: 2 1 101.6920 201.0989 310.1080
3: 3 1 119.4697 217.8550 313.9384
4: 4 1 111.4261 205.2945 317.3651
5: 5 1 100.4024 212.2826 305.1375
6: 6 1 114.4711 203.6988 319.4913
in several stages, unless it results in an empty subset. In this case, I would like to skip that specific subsetting. In an earlier question, Frank has found a great solution for this:
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
where I can enter the data, and the cut-offs for each variable manually.
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id group x y z
1: 55 2 119.2634 219.0044 315.6556
I now wonder how this (or something even better!) could be applied to a case where the cut-offs are in a second data.table
c <- data.table(group=1:2, x=c(110,119), y=c(210,219), z=c(310,319))
> c
group x y z
1: 1 110 210 310
2: 2 119 219 319
and specified for each group separately.
If I were to use f(.), I thought of a join of c into dat but can't figure it out. But perhaps there is a smarter way entirely.
First, I would change how c is constructed. You currently have it set up with one column per filter, but a long format would allow you to use multiple filters on the same column similar to your initial example (i.e. two filters on y):
c <- data.table(group=c(1,2,1,2,1,2,1),variable = c("x","x","y","y","z","z","y"), c_val = c(110,119,210,219,310,319,1e6))
c[, c_id := 1:.N]
c
group variable c_val c_id
1: 1 x 110 1
2: 2 x 119 2
3: 1 y 210 3
4: 2 y 219 4
5: 1 z 310 5
6: 2 z 319 6
7: 1 y 1000000 7
you can then merge your filters to your data.
dat_cut <- melt(dat, id.vars = c("id", "group"), value.name = "dat_val")
output <- merge(dat_cut, c, by = c("group","variable"), allow.cartesian = TRUE)
This line then tests the filters - you can expand this line if you want to expand your filter logic (greater than / less than, equal etc.), and can code that logic back to c
output <- output[dat_val > c_val]
You then want to find any line where the number of filters met is equal to the unique total number of filters met, for that group:
output[,req_match := uniqueN(c_id), by = .(group)] # number of filters where a condition was met.
selection <- output[,.N,by = .(id, group, req_match)][N == req_match, id]
If a filter did not match any rows, it will be excluded here.
Then you can filter your initial dataset for the solution:
dat[id %in% selection]
id group x y z
1: 3 1 119.4697 217.8550 313.9384
2: 18 1 117.2930 216.5670 310.4617
3: 35 1 110.4283 218.6130 312.0904
4: 50 1 119.2519 214.2517 318.8567
Related
When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed
I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id group x y z
1: 1 1 109.3400 208.6732 308.7595
2: 2 1 101.6920 201.0989 310.1080
3: 3 1 119.4697 217.8550 313.9384
4: 4 1 111.4261 205.2945 317.3651
5: 5 1 100.4024 212.2826 305.1375
6: 6 1 114.4711 203.6988 319.4913
in several stages within each group. I need to automate this and it might happen that the subset is empty. For example, focusing only on group 1,
dat1 <- dat[1:50]
> s <-subset(dat1,x>119)
> s
id group x y z
1: 3 1 119.4697 217.8550 313.9384
2: 50 1 119.2519 214.2517 318.8567
the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). If I were to set the threshold manually, Frank has provided an excellent solution here that outputs
> f(dat1, x>119, y>219, z>315)
cond skip
1: x > 119 FALSE
2: y > 219 TRUE
3: z > 315 FALSE
id group x y z
1: 50 1 119.2519 214.2517 318.8567
and reports which parts were skipped.
My problem is that I need to apply this to different groups simultaneously, where the thresholds for each group are given in a separate data.table. The goal is to have at least one id per group. For example, if my thresholds were
c <- data.table(group=1:2, x=c(119,119), y=c(219,219), z=c(315,319))
> c
group x y z
1: 1 119 219 315
2: 2 119 219 319
I would like to end up with
> res
id group x y z
1: 50 1 119.2519 214.2517 318.8567
2: 55 2 119.2634 219.0044 315.6556
I could apply Frank's function repeatedly within a for-loop but I am sure there are cleverer ways that save time. I wonder, for instance, whether the function can be applied to each group within data.table. Or perhaps there is a way within the tidyverse, which I am not really familiar with yet.
Another possible approach using standard evaluation:
#convert conditions into long format, storing operator in data.table as well
cond <- data.table(group=1:2, bop=c(`>`, `>`), x=c(119,119), y=c(219,219), z=c(315,319))
thres <- melt(cond, id.vars=c("group","bop"))
#convert data into long format and lookup filter and thresholds
mdat <- melt(dat, id.vars=c("id", "group"))[
thres, on=.(group, variable), c("bop","thres") := mget(c("bop","i.value"))]
#apply filtering
ss <- mdat[mapply(function(f, x, y) f(x, y), bop, value, thres)]
#apply sequential subsetting
dat[id %in% ss[, {
idx <- id
ans <- .SD[, {
x <- intersect(idx, id)
if(length(x) > 0) {
idx <- x
}
idx
}, .(variable)]
ans[variable==last(variable), V1]
}, .(group)]$V1
]
output:
id group x y z
1: 50 1 119.2519 214.2517 318.8567
2: 55 2 119.2634 219.0044 315.6556
3: 58 2 119.2211 214.0305 319.3097
4: 72 2 114.0802 217.7402 313.3655
5: 90 2 116.8115 215.1576 317.0261
6: 99 2 119.2964 212.9973 308.9360
data:
library(data.table)
set.seed(333)
n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2),
x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
This is related to this question. I have data like this:
x t
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
I'd like to flag the last observation in every group (and keep the other observations), defined by x, where the "last" observation is defined by t. I tried this:
dt[order(x, t), flag_last := 1, by = "x", mult = "last"]
but that returns
x t flag_last
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 1
5: 2 2 1
6: 2 3 1
The desired output is
x t flag_last
1: 1 1 0
2: 1 2 0
3: 1 3 1
4: 2 1 0
5: 2 2 0
6: 2 3 1
Am I going about this the wrong way?
A couple of caveats:
The actual dataset is roughly 61 GB and there are only a couple of observations per x group, so if possible I'd like to avoid creating another copy with the unique values or creating another copy with dplyr. If that's unavoidable, I'll make do.
Obviously this is simple data. The number of observations within each group is not necessarily the same, and the values for t differ too, so simply picking out t == 3 will not work.
Use the built-in .I like this:
DT[, is.end := .I == last(.I), by = "x"]
dt[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x]
# x t flag_last
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 0
# 6: 2 3 1
One option is to use .N and which.max to check for equality between the row index and the row index at which t is maximized
df[, flag := as.integer(1:.N == which.max(t)), x]
But benchmarking shows replace is faster on my machine for this dataset, and if you don't mind NAs instead of 0s, David Arenburg's suggested method using .I is fastest.
df <- data.table(x = rep(1:1e4, each = 1e4), t = sample(1e4*1e4))
library(microbenchmark)
microbenchmark(
replace = df[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x],
use.N = df[, flag := as.integer(1:.N == which.max(t)), x],
use.max = df[, flag := as.integer(t==max(t)), x],
use.I = {
df[, flag := 0L]
df[df[, .I[which.max(t)], by = x]$V1, flag := 1L]
},
use.I.no0 = df[df[, .I[which.max(t)], by = x]$V1, flag := 1L],
times = 20)
# Unit: seconds
# expr min lq mean median uq max neval cld
# replace 1.228490 1.292348 1.442919 1.443021 1.578300 1.659990 20 b
# use.N 1.439939 1.522782 1.617104 1.574932 1.696046 1.923207 20 c
# use.max 1.405741 1.436817 1.596363 1.502337 1.663895 2.743942 20 c
# use.I 1.497599 1.547276 1.574657 1.564789 1.589066 1.686353 20 bc
# use.I.no0 1.080715 1.115329 1.162752 1.145145 1.182280 1.383989 20 a
This would do the trick, if you create an id variable that you can then use to merge the two datasets together:
library(dplyr)
x <- c(1,1,1,2,2,2)
t <- c(1,2,3,1,2,3)
id <- as.character(c(1,2,3,4,5,6))
data <- data.frame(x,t, id)
You create a sliced dataset with the max value of each group, and then you merge it back to the initial dataframe.
sliced <- data %>%
group_by(x) %>%
slice(which.max(t)) %>%
ungroup() %>%
select(id) %>%
mutate(max = "max_group")
tot <- left_join(data, sliced, c("id"))
The sliced df has only two variables, so might be not too bad to work with. This is the easier way that came to my mind.
I want to select specific rows values (here TARGET) of a data.table (DT1) where the filter criterions are in an other data.table (DT2).
It is not an exact filter, because if I have value 3 in DT2 I have a minimum and a maximum variable for this value in DT1. Also I have a string which shall contain a specific pattern.
E.g : A = 3 in DT2 and the corresponding line in DT1 contains minA = 3, maxA = 6 and C = "Mon" which is in C = "Mon,Tue".
DT1
INDEX1 minA maxA C TARGET
9 : 9 3 6 Mon,Tue 109
DT2
A C INDEX2
1: 3 Mon 1
I am looking for the line with the range in which this value lies and also with the maximum target value.
I have the following simplified example:
# version 1.9.6
library(data.table)
DT1 <- data.table(INDEX1 = 1:12,
minA = c(1,1,1,2,2,2,3,3,3,4,4,4),
maxA = c(4,5,6),
C = c("Mon,Tue", "Mon,Wed", "Tue,Thu", "Wed,Thu"),
TARGET = c(101:112))
size <- 2
DT2 <- data.table(A = rep(c(3,4), size),
C = rep(c("Mon", "Thu"), size),
INDEX2 = 1:(2*size))
which looks like this:
DT1
INDEX1 minA maxA C TARGET
1 : 1 1 4 Mon,Tue 101
2 : 2 1 5 Mon,Wed 102
3 : 3 1 6 Tue,Thu 103
4 : 4 2 4 Wed,Thu 104
5 : 5 2 5 Mon,Tue 105
6 : 6 2 6 Mon,Wed 106
7 : 7 3 4 Tue,Thu 107
8 : 8 3 5 Wed,Thu 108
9 : 9 3 6 Mon,Tue 109
10: 10 4 4 Mon,Wed 110
11: 11 4 5 Tue,Thu 111
12: 12 4 6 Wed,Thu 112
DT2
A C INDEX2
1: 3 Mon 1
2: 4 Thu 2
I included size just for scaling and testing.
My solution so far is the following:
I wrote a function foo() which takes the filter-input values and returns the index (or some other more usefull variable) of DT1.
foo <- function(i.A, i.C){
DT1[INDEX1 %in% grep(i.C,C) & minA <= i.A & maxA >= i.A,][TARGET == max(TARGET),]
}
I call this function for each row of DT2
DT2[, foo(i.A = A, i.C = C), by = INDEX2]
with the outout:
INDEX2 INDEX1 minA maxA C TARGET
1: 1 9 3 6 Mon,Tue 109
2: 2 12 4 6 Wed,Thu 112
And here is my problem:
This works fine for small data.tables, but I have a lot more rows in DT2. The functions takes much longer and i was wondering if there is a better / faster way for this kind of filtering?
Maybe it is possible to "upgrade" foo() so that it can process the whole column instead of single row?
If possible i like to avoid to expand my DT1 like here:
R: select specific rows in data.table
and I think, I have a more complex filter than in these questions:
How to filter cases in a data.table by multiple conditions defined in another data.table
R: efficiently select specified rows from a data.table according to another data.table?
Thanks in advance for any help.
New solution
I realised that going through each line of the bigger data.table takes to much time, so I build a new function foo_new which works the other way around:
foo_new <- function(data, i.A, i.C){
data[C %in% i.C & A %between% i.A, INDEX2]
}
Instead of machting every row of DT2 with a row of DT1, I select every row in DT2 which matches the values of one row of DT1.
The ordering of DT2 is done because I need the row with the highest TARGET value. Also if a row in DT2 was already selected, it is removed for the next iteration.
The whole process is speeded up a lot:
function user system elapsed
foo 61.511 0.327 62.052
foo_new 0.045 0.003 0.047
This is probaly only the case, when DT1 is smaller than DT2 - which is my case.
Here my whole simulation code:
rm(list = ls())
library(data.table)
DT1 <- data.table(INDEX1 = 1:12,
minA = c(1,1,1,2,2,2,3,3,3,4,4,4),
maxA = c(4,5,6),
C = c("Mon,Tue", "Mon,Wed", "Tue,Thu", "Wed,Thu"),
TARGET = c(101:112))
size <- 20000
DT2 <- data.table(A = rep(c(3,4), size),
C = rep(c("Mon", "Thu"), size),
INDEX2 = 1:(2*size))
foo <- function(i.A, i.C){
DT1[INDEX1 %in% grep(i.C, C) &
minA <= i.A &
maxA >= i.A,
][TARGET == max(TARGET),]
}
foo_new <- function(data, i.A, i.C){
data[C %in% i.C & A %between% i.A, INDEX2]
}
# with foo
DT2[, foo(i.A = A, i.C = C), by = INDEX2])
# with foo_new
DT1.ordered <- copy(DT1[order(TARGET, decreasing = TRUE)])
tmp.index <- list()
DT2[, TARGET := as.numeric(NA)]
for (i in c(1:dim(DT1.ordered)[1])) {
# i <- 1
restdata <- copy(DT2[is.na(TARGET),])
tmp.index <- foo_new(data = restdata,
i.A = unlist(DT1.ordered[i, list(minA, maxA)]),
i.C = DT1.ordered[i, strsplit(C, ",")[[1]]])
DT2[INDEX2 %in% tmp.index, TARGET := DT1.ordered[i, TARGET]]
}
I'm trying to add columns to my data.table that essentially append a cumulative frequency table for each group that is aggregated. Unfortunately, my current solution is about ten times slower than I had hoped.
Here is what I'm using (apologies for the ugly one-liner):
DT[, c("bin1","bin2","bin3","bin4") := as.list(cumsum(hist(colx,c(lbound,bound1,bound2, bound3,ubound),plot=FALSE)$counts)), by=category]
If the bin boundaries are set at 0,25,50,75,100, I would like my table to look like:
id category colx bin1 bin2 bin3 bin4
1 a 5 1 2 2 3
2 a 30 1 2 2 3
3 b 21 1 2 3 4
4 c 62 0 1 3 3
5 b 36 1 2 3 4
6 a 92 1 2 2 3
7 c 60 0 1 3 3
8 b 79 1 2 3 4
9 b 54 1 2 3 4
10 c 27 0 1 3 3
In the actual dataset I'm grouping using 4 different columns and there are millions of rows and unique groups. When I try a simpler function, such as sum, it takes an acceptable amount of time to do the calculation. Is there any way to significantly speed up the counting process?
Okay, here's one way (here I use data.table v1.9.3). Remove the by=.EACHI if you're using versions <= 1.9.2.
dt[, ival := findInterval(colx, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, category, ival)
ans <- dt[CJ(unique(category), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="category"][, bin := "bin"]
ans <- dcast.data.table(ans, category ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
id category colx bin_1 bin_2 bin_3 bin_4
1: 1 a 5 1 2 2 3
2: 2 a 30 1 2 2 3
3: 6 a 92 1 2 2 3
4: 3 b 21 1 2 3 4
5: 5 b 36 1 2 3 4
6: 9 b 54 1 2 3 4
7: 8 b 79 1 2 3 4
8: 10 c 27 0 1 3 3
9: 4 c 62 0 1 3 3
10: 7 c 60 0 1 3 3
Benchmark on simulated large data:
I generate here a data.table with 20 million rows and a total of 1-million groups with 2 grouping columns (instead of 4 as you state in your question).
K = 1e3L
N = 20e6L
sim_data <- function(K, N) {
set.seed(1L)
ff <- function(K, N) sample(paste0("V", 1:K), N, TRUE)
data.table(x=ff(K,N), y=ff(K,N), val=sample(1:100, N, TRUE))
}
dt <- sim_data(K, N)
method1 <- function(x) {
dt[, ival := findInterval(val, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, x, y, ival)
ans <- dt[CJ(unique(x), unique(y), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="x,y"][, bin := "bin"]
ans <- dcast.data.table(ans, x+y ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
}
system.time(ans1 <- method1(dt))
# user system elapsed
# 13.148 2.778 16.209
I hope this is faster than your original solution and scales well for your real data dimensions.
Update: Here's another version using data.table's rolling joins instead of findInterval from base. We've to modify the intervals slightly so that the rolling join finds the right match.
dt <- sim_data(K, N)
method2 <- function(x) {
ivals = seq(24L, 100L, by=25L)
ivals[length(ivals)] = 100L
setkey(dt, x,y,val)
dt[, ival := seq_len(.N), by="x,y"]
ans <- dt[CJ(unique(x), unique(y), ivals), roll=TRUE, mult="last"][is.na(ival), ival := 0L][, bin := "bin"]
ans <- dcast.data.table(ans, x+y~bin+val, value.var="ival")
dt[, ival := NULL]
ans2 <- dt[ans]
}
system.time(ans2 <- method2(dt))
# user system elapsed
# 12.538 2.649 16.079
## check if both methods give identical results:
setkey(ans1, x,y,val)
setnames(ans2, copy(names(ans1)))
setkey(ans2, x,y,val)
identical(ans1, ans2) # [1] TRUE
Edit: Some explanation on why OP's is very time consuming:
A huge reason, I suspect, for the difference in runtime between these solutions and hist is that both the answers here are vectorised (written entirely in C and will work on the whole data set directly), where as hist is a S3 method (which'll take time for dispatch to the .default method and added to that, it's written in R. So, basically you're executing about a million times hist, a function in R, where as the other two vectorised solutions are calling it once in C (no need to call for every group here).
And since that's the most complex part of your question, it obviously slows things down.