For these data
library(data.table)
set.seed(42)
dat <- data.table(id=1:12, group=rep(1:3, each=4), x=rnorm(12))
> dat
id group x
1: 1 1 1.37095845
2: 2 1 -0.56469817
3: 3 1 0.36312841
4: 4 1 0.63286260
5: 5 2 0.40426832
6: 6 2 -0.10612452
7: 7 2 1.51152200
8: 8 2 -0.09465904
9: 9 3 2.01842371
10: 10 3 -0.06271410
11: 11 3 1.30486965
12: 12 3 2.28664539
My goal is to get, from each group, the first id for which x is larger than some threshold, say x>1.5.
> dat[x>1.5, .SD[1], by=group]
group id x
1: 2 7 1.511522
2: 3 9 2.018424
is indeed correct but I am unhappy about that fact that it silently yields no result for group 1. Instead, I would like it to yield the last id of each group for which no id fulfills the condition. I see that I could achieve this in two steps
> tmp <- dat[x>1.5, .SD[1], by=group]
> rbind(tmp,dat[!group%in%tmp$group,.SD[.N], by=group])
group id x
1: 2 7 1.5115220
2: 3 9 2.0184237
3: 1 4 0.6328626
but I am sure I am not making full use of the data.table capabilities here, which must permit a more elegant solution.
Using data.table, we can check for a condition and subset row by group.
library(data.table)
dat[dat[, if(any(x>1.5)) .I[which.max(x > 1.5)] else .I[.N], by=group]$V1]
# id group x
#1: 4 1 0.6328626
#2: 7 2 1.5115220
#3: 9 3 2.0184237
The dplyr, translation of that would be
library(dplyr)
dat %>%
group_by(group) %>%
slice(if(any(x > 1.5)) which.max(x > 1.5) else n())
Or more efficiently
dat[, .SD[{temp = x > 1.5; if (any(temp)) which.max(temp) else .N}], by = group]
Thanks to #IceCreamTouCan, #sindri_baldur and #jangorecki for their valuable suggestions to improve this answer.
Another option is:
dat[x>1.5 | group!=shift(group, -1L), .SD[1L], .(group)]
You could subset both ways (which are optimized by GForce) and then combine them:
D1 = dat[x>1.5, lapply(.SD, first), by=group]
D2 = dat[, lapply(.SD, last), by=group]
rbind(D1, D2[!D1, on=.(group)])
group id x
1: 2 7 1.5115220
2: 3 9 2.0184237
3: 1 4 0.6328626
There is some inefficiency here since we are grouping by group three times. I'm not sure if that will be outweighed by more efficient calculations in j thanks to GForce or not. #jangorecki points out that the inefficiency of grouping three times might be mitigated by setting the key first.
Comment: I used last(.SD) since .SD[.N] is not yet optimized and last(.SD) throws an error. I changed the OP's code to use lapply first for the sake of symmetry.
Related
I have two data.tables: an experiment data table x and a category lookup table dict.
library(data.table)
set.seed(123)
x = data.table(samp=c(1,1,2,3,3,3,4,5,5,5,6,7,7,7,8,9,9,10,10), y=rnorm(19))
x
samp y
#1: 1 -0.56047565
#2: 1 -0.23017749
#3: 2 1.55870831
#4: 3 0.07050839
#5: 3 0.12928774
#6: 3 1.71506499
#7: 4 0.46091621
#8: 5 -1.26506123
#9: 5 -0.68685285
#10: 5 -0.44566197
#11: 6 1.22408180
#12: 7 0.35981383
#13: 7 0.40077145
#14: 7 0.11068272
#15: 8 -0.55584113
#16: 9 1.78691314
#17: 9 0.49785048
#18: 10 -1.96661716
#19: 10 0.70135590
dict = data.table(samp=c(1:5, 4:8, 7:10), cat=c(rep(1,length(1:5)), rep(2,length(4:8)), rep(3,length(7:10))))
dict
# samp cat
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 4 2
# 7: 5 2
# 8: 6 2
# 9: 7 2
# 10: 8 2
# 11: 7 3
# 12: 8 3
# 13: 9 3
# 14: 10 3
For each samp, I need to first compute the product of all y's associated with it. I then need to compute the sum of these products per each sample category specified in dict$cat. Note that each samp maps to more than one dict$cat.
One way of doing this is merge x and dict right away, allowing row duplication (allow.cartesian=T):
setkey(dict, samp)
setkey(x, samp)
step0 = dict[x, allow.cartesian=T]
setkey(step0, samp, cat)
step1 = step0[, list(prodY=prod(y)[1], cat=cat[1]), by=c("samp", "cat")]
resMet1 = step1[, sum(prodY), by="cat"]
I wonder however whether this joining step can be avoided. There are a few reasons to this - for example, if x is enormous, duplication will use extra memory (am I right?). Also, these summary tables with duplicated rows are quite confusing, making the analysis more error-prone.
So instead I was thinking of using samples in each dict$cat for a binary search in x. I know how to do it for a single category, so an ugly way of doing it for all of them would be with a loop:
setkey(x, samp)
setkey(dict,samp)
pool = vector("list")
for(n in unique(dict$cat)){
thisCat = x[J(dict[cat==n])]
setkey(thisCat, samp)
step1 = thisCat[, list(prodY=prod(y)[1], cat=cat[1]), by="samp"]
pool[[n]] = step1[, sum(prodY), by="cat"]
}
resMet2 = rbindlist(pool)
But of course such loops are to be avoided. So I'm wondering if there's any way to somehow get data.table to iterate over the key values inside of J()?
IIUC, I'd formulate your question as follows: for each dict$cat, I'd like to get prod(y) corresponding to each sample for that cat, and then sum them all up.
Let's construct this step by step now:
For each dict$cat - sounds like you need to group by cat:
dict[, ,by=cat]
All that's left is to fill up j properly.
you need to get prod(y) from x for each sample for this group:
x[samp %in% .SD$samp, prod(y), by=samp]
extracts those rows from x corresponding to this group's samp (using .SD which stands for subset of data) and computes prod(y) on them, grouped by samp. Great!
We still need to sum them.
sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1)
We've the complete j expression. Let's plug it all in:
dict[, sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1), by=cat]
# cat V1
# 1: 1 1.7770272
# 2: 2 0.7578771
# 3: 3 -1.0295633
Hope this helps.
Note 1: that there's some redundant computation of prod(y) here, but the upside is that we don't materialise much intermediate data. So it's memory efficient. If you've too many groups, this might get slower.. and you might want to compute prod(y) in another variable like so:
x_p = x[, .(p = prod(y)), by=samp]
With this, we can simplify j as follows:
dict[, x_p[samp %in% .SD$samp, sum(p)], by=cat]
Note 2: that %in% expression creates an auto index on the first run on x's samp column to use binary search based subset from then on. Therefore there need not be any worries on performance due to vector scans.
You might as well collapse x to the samp level first.
xprod = x[, .(py = prod(y)), by=samp]
Merge
res2 <- xprod[dict, on = "samp"][, sum(py), by=cat]
identical(res2, resMet2) # test passed
Or subset
If samp is the row number in xprod (as here), you can subset instead of merging:
res3 <- xprod[(dict$samp), sum(py), by=.(cat=dict$cat)]
identical(res3, resMet2) # test passed
It's very simple to relabel sample IDs so that this is true.
I'm trying to add columns to my data.table that essentially append a cumulative frequency table for each group that is aggregated. Unfortunately, my current solution is about ten times slower than I had hoped.
Here is what I'm using (apologies for the ugly one-liner):
DT[, c("bin1","bin2","bin3","bin4") := as.list(cumsum(hist(colx,c(lbound,bound1,bound2, bound3,ubound),plot=FALSE)$counts)), by=category]
If the bin boundaries are set at 0,25,50,75,100, I would like my table to look like:
id category colx bin1 bin2 bin3 bin4
1 a 5 1 2 2 3
2 a 30 1 2 2 3
3 b 21 1 2 3 4
4 c 62 0 1 3 3
5 b 36 1 2 3 4
6 a 92 1 2 2 3
7 c 60 0 1 3 3
8 b 79 1 2 3 4
9 b 54 1 2 3 4
10 c 27 0 1 3 3
In the actual dataset I'm grouping using 4 different columns and there are millions of rows and unique groups. When I try a simpler function, such as sum, it takes an acceptable amount of time to do the calculation. Is there any way to significantly speed up the counting process?
Okay, here's one way (here I use data.table v1.9.3). Remove the by=.EACHI if you're using versions <= 1.9.2.
dt[, ival := findInterval(colx, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, category, ival)
ans <- dt[CJ(unique(category), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="category"][, bin := "bin"]
ans <- dcast.data.table(ans, category ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
id category colx bin_1 bin_2 bin_3 bin_4
1: 1 a 5 1 2 2 3
2: 2 a 30 1 2 2 3
3: 6 a 92 1 2 2 3
4: 3 b 21 1 2 3 4
5: 5 b 36 1 2 3 4
6: 9 b 54 1 2 3 4
7: 8 b 79 1 2 3 4
8: 10 c 27 0 1 3 3
9: 4 c 62 0 1 3 3
10: 7 c 60 0 1 3 3
Benchmark on simulated large data:
I generate here a data.table with 20 million rows and a total of 1-million groups with 2 grouping columns (instead of 4 as you state in your question).
K = 1e3L
N = 20e6L
sim_data <- function(K, N) {
set.seed(1L)
ff <- function(K, N) sample(paste0("V", 1:K), N, TRUE)
data.table(x=ff(K,N), y=ff(K,N), val=sample(1:100, N, TRUE))
}
dt <- sim_data(K, N)
method1 <- function(x) {
dt[, ival := findInterval(val, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, x, y, ival)
ans <- dt[CJ(unique(x), unique(y), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="x,y"][, bin := "bin"]
ans <- dcast.data.table(ans, x+y ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
}
system.time(ans1 <- method1(dt))
# user system elapsed
# 13.148 2.778 16.209
I hope this is faster than your original solution and scales well for your real data dimensions.
Update: Here's another version using data.table's rolling joins instead of findInterval from base. We've to modify the intervals slightly so that the rolling join finds the right match.
dt <- sim_data(K, N)
method2 <- function(x) {
ivals = seq(24L, 100L, by=25L)
ivals[length(ivals)] = 100L
setkey(dt, x,y,val)
dt[, ival := seq_len(.N), by="x,y"]
ans <- dt[CJ(unique(x), unique(y), ivals), roll=TRUE, mult="last"][is.na(ival), ival := 0L][, bin := "bin"]
ans <- dcast.data.table(ans, x+y~bin+val, value.var="ival")
dt[, ival := NULL]
ans2 <- dt[ans]
}
system.time(ans2 <- method2(dt))
# user system elapsed
# 12.538 2.649 16.079
## check if both methods give identical results:
setkey(ans1, x,y,val)
setnames(ans2, copy(names(ans1)))
setkey(ans2, x,y,val)
identical(ans1, ans2) # [1] TRUE
Edit: Some explanation on why OP's is very time consuming:
A huge reason, I suspect, for the difference in runtime between these solutions and hist is that both the answers here are vectorised (written entirely in C and will work on the whole data set directly), where as hist is a S3 method (which'll take time for dispatch to the .default method and added to that, it's written in R. So, basically you're executing about a million times hist, a function in R, where as the other two vectorised solutions are calling it once in C (no need to call for every group here).
And since that's the most complex part of your question, it obviously slows things down.
I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?
I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]
The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]
I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?
You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.
I have a dataset
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
for every id the values are sorted with ascending order
i want to reduce the dtf to include only the first row for every id that the value exceeds a specified limit. Only one row per id, and that should be the one that the value first exceed a specified limit.
For this example and for the limit of 5 the dtf should reduce to :
A 6
B 6
Is the a nice way to do this?
Thanks a lot
It could be done with aggregate:
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
aggregate(value ~ id, dtf, function(x) x[x > limit][1])
The result:
id value
1 A 6
2 B 6
Update: A solution for multiple columns:
An example data frame, dtf2:
dtf2 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
value=c(2,4,6,8,4,6,8,10),
col3 = letters[1:8],
col4 = 1:8)
A solution including ave:
with(dtf2, dtf2[ave(value, id, FUN = function(x) cumsum(x > limit)) == 1, ])
The result:
id value col3 col4
3 A 6 c 3
6 B 6 f 6
Here is a "nice" option using data.table:
library(data.table)
DT <- data.table(dft, key = "id")
DT[value > 5, head(.SD, 1), by = key(DT)]
# id value
# 1: A 6
# 2: B 6
And, in the spirit of sharing, an option using sqldf which might be nice depending on whether you feel more comfortable with SQL.
sqldf("select id, min(value) as value from dtf where value > 5 group by id")
# id value
# 1 A 6
# 2 B 6
Update: Unordered source data, and a data.frame with multiple columns
Based on your comments to some of the answers, it seems like there might be a chance that your "value" column might not be ordered like it is in your example, and that there are other columns present in your data.frame.
Here are two alternatives for those scenarios, one with data.table, which I find easiest to read and is most likely the fastest, and one with a typical "split-apply-combine" approach that is commonly needed for such tasks.
First, some sample data:
dtf2 <- data.frame(id = c("A","A","A","A","B","B","B","B"),
value = c(6,4,2,8,4,10,8,6),
col3 = letters[1:8],
col4 = 1:8)
dtf2 # Notice that the value column is not ordered
# id value col3 col4
# 1 A 6 a 1
# 2 A 4 b 2
# 3 A 2 c 3
# 4 A 8 d 4
# 5 B 4 e 5
# 6 B 10 f 6
# 7 B 8 g 7
# 8 B 6 h 8
Second, the data.table approach:
library(data.table)
DT <- data.table(dtf2)
DT # Verify that the data are not ordered
# id value col3 col4
# 1: A 6 a 1
# 2: A 4 b 2
# 3: A 2 c 3
# 4: A 8 d 4
# 5: B 4 e 5
# 6: B 10 f 6
# 7: B 8 g 7
# 8: B 6 h 8
DT[order(value)][value > 5, head(.SD, 1), by = "id"]
# id value col3 col4
# 1: A 6 a 1
# 2: B 6 h 8
Second, base R's common "split-apply-combine" approach:
do.call(rbind,
lapply(split(dtf2, dtf2$id),
function(x) x[x$value > 5, ][which.min(x$value[x$value > 5]), ]))
# id value col3 col4
# A A 6 a 1
# B B 6 h 8
Another approach with aggregate:
> aggregate(value~id, dtf[dtf[,'value'] > 5,], min)
id value
1 A 6
2 B 6
This does depend on the elements being sorted, as that will be the entry returned by min
might aswell, an alternative with plyr and head :
library(plyr)
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
result <- ddply(dtf, "id", function(x) head(x[x$value > limit ,],1) )
> result
id value
1 A 6
2 B 6
This depends on your data.frame being sorted:
threshold <- 5
foo <- dtf[dtf$value>=threshold,]
foo[c(1,which(diff(as.numeric(as.factor(foo$id)))>0)),]