R data.table: how to refer to current object in a chain? - r

In R's data.table, one can chain multiple operations by putting together squared braces, each of which will be able to use non-standard evaluation of e.g. column names for whatever is the current transformation in a chain - for example:
dt[,
.(agg1=mean(var1), agg2=mean(var2)),
by=.(col1, col2)
][,
.(agg2=ceiling(agg2), agg3=agg2^2)
]
Suppose I want to do an operation which involves computing some function that would take as input a data.frame, which I want to put in a data.table chain taking the current object in the chain, while using these results in e.g. a by clause. In magrittr/dplyr chains, I can use a . to refer to the object that is passed to a pipe and do arbitrary operations with it, but in data.table the chains work quite differently.
For example, suppose I have this table and I want to compute a percentage of total across groups:
library(data.table)
dt = data.table(col1 = c(1,1,1,2,2,3), col2=10:15)
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
In this snippet, I referred to the full object name inside [ to compute the total of the column as sum(dt$col2), which bypasses the by part. But if it were in a chain and these columns were calculated through other operations, I wouldn't have been able to simply use dt like that as that's an external variable (that is, it's evaluated in an environment level outside the [ scope).
How can I refer to the current object/table inside a chain? To be clear, I am not asking about how to do this specific grouping operation, but about a general way of accessing the current object so as to use it in arbitrary functions.

We could use .SDcols or directly .SD and select the column as a vector with [[
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)][,
.(agg1 = ceiling(.SD[["agg1"]]))]

Interesting question!
If I understand correctly, the OP wants to create complex chains of data.table expressions. One sample use case is to compute a percentage of total across groups in a chain like in
dt[, .(agg1 = sum(col2) / sum(dt$col2)), by = col1]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
Here col2 appears in two places. First, to compute the group totals, and second to compute the grand total.
data.table has a special symbol .SD which stands for Subset, Selfsame, or Self-reference of the Data (please, see the vignette Using .SD for Data Analysis for details and examples).
When used in grouping (by = ), .SD refers to the subset of the data within each group and thus cannot be used to refer to the whole (ungrouped) dataset. So,
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, .(agg1 = sum(col2) / sum(.SD$col2)), by = col1]
returns the trivial result 1 for each group.
Chaining within a chain
However, I believe I have found a way to accomplish OP's goal by chaining within a chain:
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, .SD[, .(agg1 = sum(col2)), by = col1][, .(col1, agg1 = agg1 / sum(col2))]]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
Here,
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[, j]
is a data.table chain with a data.table created on-the-fly, i.e., without assigning to a variable name, and with the j expression being another data.table chain
.SD[, .(agg1 = sum(col2)), by = col1][, .(col1, agg1 = agg1 / sum(col2))]
The inner chain consists of three parts:
.SD refers to the whole dataset.
The next expression computes the aggregates by group, returning a data.table with two columns, col1 and agg1.
The last expression divides agg1 by the grand total sum(col2). As col2 is not included in the result of step 2, it is taken from the whole dataset.
However, if you do not want to rely on scoping rules, we can achieve the same result by explicitely computing the grand total beforehand:
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, {tmp <- sum(col2); .SD[, .(agg1 = sum(col2) / tmp), by = col1]}]
Here, we use the fact that in data.table j can be an arbitrary expression (enclosed in curly braces).

Related

In base or data.table for R, use a function, evaluated on a column, to select rows?

Given a data table DT with a column Col1, select the rows of DT where the values x in Col1 satisfy some boolean expression, for example f(x) == TRUE or another example f(x) <= 4, and then doing more data table operations.
For example, I tried something like
DT[f(Col1) == TRUE, Col2 := 2]
which does not work because f() acts on values not vectors. Using lapply(), seems to work but it take a long time to run with a very large DT.
A workaround would be to create a column and using that to select the rows
DT[, fvalues := f(Col1)][fvalues == TRUE, Col2 := 2]
but it would be better not to increase the size of DT.
EDIT: Here is an example.
map1<-data.table(k1=c("A","B","C"), v=c(-1,2,3))
map2<-data.table(k2=c("A","B","A","A","C","B"), k3=c("A","B","C","B","C","B"))
f <- function(x) map1[k1 == x, v]
To find the rows in map2 using the corresponding value in map1: these do not work (returning an error about a length mismatch around ==)
map2[f(k2) == 2, flag1 := TRUE]
map2[f(k2) + f(k3) == 2, flag2 := TRUE]
but using lapply() the first one works but it is somehow slower (for a large data table) than adding column to map2 with the values of f and selecting based on that new column
map2[lapply(k2,f) == 2, flag1 := TRUE]
and the second one
map2[lapply(k2,f) + lapply(k3,f) == 2, flag2 := TRUE]
returns an error (non-numeric argument).
The question would be how to do this most efficiently, particularly without making the data table larger.
I think the problem perhaps is when you're adding the parts which are aiming to modify the data.table in-place (i.e. your := parts). I don't think you can filter in place, as such filter really requires writing to a new memory location.
This works for filtering, whilst creating a new object:
library(data.table)
f <- function(x) x > 0.5
DT <- data.table(Col1 = runif(10))
DT[f(Col1),]
#> Col1
#> 1: 0.7916055
#> 2: 0.5391773
#> 3: 0.6855657
#> 4: 0.5250881
#> 5: 0.9089948
#> 6: 0.6639571
To do more data.table operations on a filtered table, assign to a new object and work with that one:
DT2 <- DT[f(Col1),]
DT2[, Col2 := 2]
Perhaps I've misunderstood your problem though - what function are you using? Could you post more code so we can replicate your problem more precisely?
If f is a function working only on scalar values, you could Vectorize it:
DT[Vectorize(f)(Col1)]
Not sure this fully answers your question because Vectorize uses lapply/mapply

Does unique.data.table with by behave like dplyr::distinct with .keep_all = TRUE?

I want to get unique rows in a data frame based on one variable, while still choosing which rows (based on other variables) are included.
Example:
dt <- as.data.table(list(group = c("A", "A", "B", "B", "C", "C"), number = c(1, 2, 1, 2, 2, 1)))
I would normally do this, as it allows me to always keep the row where number == 1.
dt %>%
arrange(group, number) %>%
distinct(group, .keep_all = TRUE)
This is now too slow, and I'm hoping the data.table equivalent will be faster.
This seems to work:
dt <- dt[order(group, number)]
unique(dt, by = c("group"))
But I couldn't find anything in the unique.data.table documentation which says that the first row per group is the one which is kept. Is it safe to assume it is?
According to documentation
unique returns a data.table with duplicated rows removed, by columns specified in by argument. When no by then duplicated rows by all columns are removed.
We can reason from that it does return the first row by each unique group.
To complement options provided by #Ian here is another one which will probably be the fastest one.
setkeyv(dt, c("group","number"))
unique(dt, by="group")
At least as of now, because there are possible improvements coming. An example of reducing time from 3.544s to 0.075s, it needs an index rather than key, can be found in unique can be optimized on keyed data.tables #2947.
How about subsetting .SD in j?
library(data.table)
dt[order(group,number),.SD[1],by=group]
# group number
#1: A 1
#2: B 1
#3: C 1
You might also find using .I faster because it avoids assembling .SD:
In this version, we first assemble a list of row indices using the .I special symbol and subset those indices by ones that equal 1 and then take the first ([1]) by group. We then access just the indices with $V1 and subset the original dt by that.
dt[,.I[number == 1][1], by=group]
group V1
1: A 1
2: B 3
3: C 6
dt[dt[,.I[number == 1][1], by=group]$V1]
group number
1: A 1
2: B 1
3: C 1
Edit:
As #IceCreamToucan points out in the comments, another, easier to read option is with head.data.table:
dt[order(group,number), head(.SD, 1), by=group]
group number
1: A 1
2: B 1
3: C 1

Update data.table by reference but populate only certain rows when duplicates are present using a prioritized vector

I didn't quite know how to word the title, but here is what I'm trying to do. I'd like to grow the data table dt1 using columns from dt2. In dt1, there are duplicated data in the column I'm updating/merging by. My goal is to populate new columns in dt1 at duplicates only if a condition is met
specified by another variable. Let me demonstrate what I mean:
library(data.table)
dt1 <- data.table(common_var = c(rep("a", 3), rep("b", 2)),
condition_var = c("update1", rep(c("update2", "update3"), 2)),
other_var = 1:5)
dt2 <- data.table(common_var = c("a", "b", "C", "d"),
new_var1 = 11:14,
new_var2 = 21:24)
# What I want to obtain is the following
dt_goal <- data.table(common_var = dt1$common_var,
condition_var = dt1$condition_var,
other_var = dt1$other_var,
new_var1 = c(11, NA, NA, 12, NA),
new_var2 = c(21, NA, NA, 22, NA))
dt_goal
Updating by reference or merging populates all the matching rows (as expected), but this is not what I want:
# Updating by reference populates all the duplicate rows as expected
# (doesn't work for my purpose)
dt1[, names(dt2) := as.list(dt2[match(dt1$common_var, dt2$common_var),])]
# merging also populates duplicate rows as expected.
# dt3 <- merge(dt1, dt2, by="common_var")
I tried overriding the rows of merged dt3 (or updated dt1) with NAs where I don't want to have data:
dt3 <- dt3[which(alldup(dt3$common_var) & dt3$condition_var %in% c("update2", "update3")), names(dt2)[2:3] := NA]
dt3
The logic in the code above finds duplicates and the unwanted conditional cases, and replaces the selected columns with NA. This partially works, with two problems:
1) If the value to keep (update1) isn't present in other duplicate rows (b in my example), they get erased too
2) This approach requires hard-coding the case I want to keep. In my real-world application, I will loop this type of data prep and the conditional values will change. I know the priority for updating the data table though:
order_to_populate_dups <- c("update1", "update2", "update3")
In other words, I want a code to grow the data table as follows:
1) When no duplicates, add columns by reference (or merge) normally
2) When duplicates are present under the id variable, look at condition_var
2a) If you see update1 add data, if not, next
2b) If you see update2 add data, if not, next
2c) If you see update3 add data, if not, next, ...
I couldn't locate a solution for this problem in SO. Please let me know if this is somehow duplicate.
Thanks!
Are you looking for something like:
cols <- paste0("new_var", 1:2)
remap <- c(update1=1, update2=2, update3=3)
dt1[, rp := remap[condition_var]]
setkey(dt1, common_var, rp)
dt1[rowid(common_var)==1L, (cols) :=
dt2[.SD, on=.(common_var), mget(paste0("i.",cols))]
Explanation:
You can use factor or a vector to remap your character vector into something that can be ordered accordingly. Then use setkey to sort the data before performing an update join on the first row of each group of common_var.
Please let me know if i understood your example correctly or not. I can change the solution if needed.
# order dt1 by the common variable and
setorder(dt1, common_var, condition_var) condition
# calculate row_id for each group (grouped by common_var)
dt1[, row_index := rowid(common_var)]
# assume dt2 has only one row per common_var
dt2[, row_index := 1]
# left join on common_var and row_index, reorder columns.
dt3 <- dt2[dt1, on = c('common_var', 'row_index')][, list(common_var, condition_var, other_var, new_var1, new_var2)]

number of rows of .SD

If I have the following simple datatable:
DT <- data.table(VAL = sample(c(1, 2, 3), 10, replace = TRUE),Group = c(rep("A",5),rep("B",5)))
I can calc the mean via:
DT[,lapply(.SD,function(x){mean(x)}),by=Group]
I could also use:
DT[,lapply(.SD,function(x){sum(x)/.N}),by=Group]
But my question is, why does the following NOT work:
DT[,lapply(.SD,function(x){sum(x)/nrow(x)}),by=Group]
From my understanding, .SD is a sub datatable of the full datatable, so via function(x) I should be able to refer to the number of rows of x - or in other words, why can I calculate sum(x), but not nrow(x) in .SD? Did not find anything in the documentation in this regard.
.SD is a data.table, but when you lapply over it, each x value is a column vector, for which nrow does not work. If you were to do length instead, it returns the number of rows.
DT[,lapply(.SD,function(x){sum(x)/length(x)}),by=Group]
# Group VAL
# 1: A 2.0
# 2: B 1.6

Setting key while chaining in R data.table

Imagine I have a data.table DT that has columns a, b, c. I want to filter rows based on a (say, select only those with value "A"), compute the sum of b by c. I can do this efficiently, using binary search for filtering, by
setkey(DT, a)
DT[.("A"), .(sum.B = sum(B)), by = .(C)]
What if then I want to filter rows based on the value of the newly obtained sum.b? If I want to keep rows where sum.b equals one of c(3, 4, 5), I can do that by saying
DT[.("A"), .(sum.B = sum(B)), by = .(C)][sum.b %in% c(3, 4, 5)]
but the latter operation uses vector scan which is slow. Is there a way to set keys "on the fly" while chaining? Ideally I would have
DT[.("A"), .(sum.B = sum(B)), by = .(C)][??set sum.b as key??][.(c(3, 4, 5))]
where I don't know the middle step.
The middle step you are asking in the question would be the following:
# unnamed args
DT[,.SD,,sum.b]
# named args
DT[j = .SD, keyby = sum.b]
# semi named
DT[, .SD, keyby = sum.b]
Yet you should benchmark it on your data as it may be slower than vector scan as you need to setkey.
It looks like eddi already provide that solution in comment. The FR mentioned by him is data.table#1105.

Resources