Setting key while chaining in R data.table - r

Imagine I have a data.table DT that has columns a, b, c. I want to filter rows based on a (say, select only those with value "A"), compute the sum of b by c. I can do this efficiently, using binary search for filtering, by
setkey(DT, a)
DT[.("A"), .(sum.B = sum(B)), by = .(C)]
What if then I want to filter rows based on the value of the newly obtained sum.b? If I want to keep rows where sum.b equals one of c(3, 4, 5), I can do that by saying
DT[.("A"), .(sum.B = sum(B)), by = .(C)][sum.b %in% c(3, 4, 5)]
but the latter operation uses vector scan which is slow. Is there a way to set keys "on the fly" while chaining? Ideally I would have
DT[.("A"), .(sum.B = sum(B)), by = .(C)][??set sum.b as key??][.(c(3, 4, 5))]
where I don't know the middle step.

The middle step you are asking in the question would be the following:
# unnamed args
DT[,.SD,,sum.b]
# named args
DT[j = .SD, keyby = sum.b]
# semi named
DT[, .SD, keyby = sum.b]
Yet you should benchmark it on your data as it may be slower than vector scan as you need to setkey.
It looks like eddi already provide that solution in comment. The FR mentioned by him is data.table#1105.

Related

In base or data.table for R, use a function, evaluated on a column, to select rows?

Given a data table DT with a column Col1, select the rows of DT where the values x in Col1 satisfy some boolean expression, for example f(x) == TRUE or another example f(x) <= 4, and then doing more data table operations.
For example, I tried something like
DT[f(Col1) == TRUE, Col2 := 2]
which does not work because f() acts on values not vectors. Using lapply(), seems to work but it take a long time to run with a very large DT.
A workaround would be to create a column and using that to select the rows
DT[, fvalues := f(Col1)][fvalues == TRUE, Col2 := 2]
but it would be better not to increase the size of DT.
EDIT: Here is an example.
map1<-data.table(k1=c("A","B","C"), v=c(-1,2,3))
map2<-data.table(k2=c("A","B","A","A","C","B"), k3=c("A","B","C","B","C","B"))
f <- function(x) map1[k1 == x, v]
To find the rows in map2 using the corresponding value in map1: these do not work (returning an error about a length mismatch around ==)
map2[f(k2) == 2, flag1 := TRUE]
map2[f(k2) + f(k3) == 2, flag2 := TRUE]
but using lapply() the first one works but it is somehow slower (for a large data table) than adding column to map2 with the values of f and selecting based on that new column
map2[lapply(k2,f) == 2, flag1 := TRUE]
and the second one
map2[lapply(k2,f) + lapply(k3,f) == 2, flag2 := TRUE]
returns an error (non-numeric argument).
The question would be how to do this most efficiently, particularly without making the data table larger.
I think the problem perhaps is when you're adding the parts which are aiming to modify the data.table in-place (i.e. your := parts). I don't think you can filter in place, as such filter really requires writing to a new memory location.
This works for filtering, whilst creating a new object:
library(data.table)
f <- function(x) x > 0.5
DT <- data.table(Col1 = runif(10))
DT[f(Col1),]
#> Col1
#> 1: 0.7916055
#> 2: 0.5391773
#> 3: 0.6855657
#> 4: 0.5250881
#> 5: 0.9089948
#> 6: 0.6639571
To do more data.table operations on a filtered table, assign to a new object and work with that one:
DT2 <- DT[f(Col1),]
DT2[, Col2 := 2]
Perhaps I've misunderstood your problem though - what function are you using? Could you post more code so we can replicate your problem more precisely?
If f is a function working only on scalar values, you could Vectorize it:
DT[Vectorize(f)(Col1)]
Not sure this fully answers your question because Vectorize uses lapply/mapply

R data.table: how to refer to current object in a chain?

In R's data.table, one can chain multiple operations by putting together squared braces, each of which will be able to use non-standard evaluation of e.g. column names for whatever is the current transformation in a chain - for example:
dt[,
.(agg1=mean(var1), agg2=mean(var2)),
by=.(col1, col2)
][,
.(agg2=ceiling(agg2), agg3=agg2^2)
]
Suppose I want to do an operation which involves computing some function that would take as input a data.frame, which I want to put in a data.table chain taking the current object in the chain, while using these results in e.g. a by clause. In magrittr/dplyr chains, I can use a . to refer to the object that is passed to a pipe and do arbitrary operations with it, but in data.table the chains work quite differently.
For example, suppose I have this table and I want to compute a percentage of total across groups:
library(data.table)
dt = data.table(col1 = c(1,1,1,2,2,3), col2=10:15)
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
In this snippet, I referred to the full object name inside [ to compute the total of the column as sum(dt$col2), which bypasses the by part. But if it were in a chain and these columns were calculated through other operations, I wouldn't have been able to simply use dt like that as that's an external variable (that is, it's evaluated in an environment level outside the [ scope).
How can I refer to the current object/table inside a chain? To be clear, I am not asking about how to do this specific grouping operation, but about a general way of accessing the current object so as to use it in arbitrary functions.
We could use .SDcols or directly .SD and select the column as a vector with [[
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)][,
.(agg1 = ceiling(.SD[["agg1"]]))]
Interesting question!
If I understand correctly, the OP wants to create complex chains of data.table expressions. One sample use case is to compute a percentage of total across groups in a chain like in
dt[, .(agg1 = sum(col2) / sum(dt$col2)), by = col1]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
Here col2 appears in two places. First, to compute the group totals, and second to compute the grand total.
data.table has a special symbol .SD which stands for Subset, Selfsame, or Self-reference of the Data (please, see the vignette Using .SD for Data Analysis for details and examples).
When used in grouping (by = ), .SD refers to the subset of the data within each group and thus cannot be used to refer to the whole (ungrouped) dataset. So,
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, .(agg1 = sum(col2) / sum(.SD$col2)), by = col1]
returns the trivial result 1 for each group.
Chaining within a chain
However, I believe I have found a way to accomplish OP's goal by chaining within a chain:
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, .SD[, .(agg1 = sum(col2)), by = col1][, .(col1, agg1 = agg1 / sum(col2))]]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
Here,
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[, j]
is a data.table chain with a data.table created on-the-fly, i.e., without assigning to a variable name, and with the j expression being another data.table chain
.SD[, .(agg1 = sum(col2)), by = col1][, .(col1, agg1 = agg1 / sum(col2))]
The inner chain consists of three parts:
.SD refers to the whole dataset.
The next expression computes the aggregates by group, returning a data.table with two columns, col1 and agg1.
The last expression divides agg1 by the grand total sum(col2). As col2 is not included in the result of step 2, it is taken from the whole dataset.
However, if you do not want to rely on scoping rules, we can achieve the same result by explicitely computing the grand total beforehand:
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, {tmp <- sum(col2); .SD[, .(agg1 = sum(col2) / tmp), by = col1]}]
Here, we use the fact that in data.table j can be an arbitrary expression (enclosed in curly braces).

Update data.table by reference but populate only certain rows when duplicates are present using a prioritized vector

I didn't quite know how to word the title, but here is what I'm trying to do. I'd like to grow the data table dt1 using columns from dt2. In dt1, there are duplicated data in the column I'm updating/merging by. My goal is to populate new columns in dt1 at duplicates only if a condition is met
specified by another variable. Let me demonstrate what I mean:
library(data.table)
dt1 <- data.table(common_var = c(rep("a", 3), rep("b", 2)),
condition_var = c("update1", rep(c("update2", "update3"), 2)),
other_var = 1:5)
dt2 <- data.table(common_var = c("a", "b", "C", "d"),
new_var1 = 11:14,
new_var2 = 21:24)
# What I want to obtain is the following
dt_goal <- data.table(common_var = dt1$common_var,
condition_var = dt1$condition_var,
other_var = dt1$other_var,
new_var1 = c(11, NA, NA, 12, NA),
new_var2 = c(21, NA, NA, 22, NA))
dt_goal
Updating by reference or merging populates all the matching rows (as expected), but this is not what I want:
# Updating by reference populates all the duplicate rows as expected
# (doesn't work for my purpose)
dt1[, names(dt2) := as.list(dt2[match(dt1$common_var, dt2$common_var),])]
# merging also populates duplicate rows as expected.
# dt3 <- merge(dt1, dt2, by="common_var")
I tried overriding the rows of merged dt3 (or updated dt1) with NAs where I don't want to have data:
dt3 <- dt3[which(alldup(dt3$common_var) & dt3$condition_var %in% c("update2", "update3")), names(dt2)[2:3] := NA]
dt3
The logic in the code above finds duplicates and the unwanted conditional cases, and replaces the selected columns with NA. This partially works, with two problems:
1) If the value to keep (update1) isn't present in other duplicate rows (b in my example), they get erased too
2) This approach requires hard-coding the case I want to keep. In my real-world application, I will loop this type of data prep and the conditional values will change. I know the priority for updating the data table though:
order_to_populate_dups <- c("update1", "update2", "update3")
In other words, I want a code to grow the data table as follows:
1) When no duplicates, add columns by reference (or merge) normally
2) When duplicates are present under the id variable, look at condition_var
2a) If you see update1 add data, if not, next
2b) If you see update2 add data, if not, next
2c) If you see update3 add data, if not, next, ...
I couldn't locate a solution for this problem in SO. Please let me know if this is somehow duplicate.
Thanks!
Are you looking for something like:
cols <- paste0("new_var", 1:2)
remap <- c(update1=1, update2=2, update3=3)
dt1[, rp := remap[condition_var]]
setkey(dt1, common_var, rp)
dt1[rowid(common_var)==1L, (cols) :=
dt2[.SD, on=.(common_var), mget(paste0("i.",cols))]
Explanation:
You can use factor or a vector to remap your character vector into something that can be ordered accordingly. Then use setkey to sort the data before performing an update join on the first row of each group of common_var.
Please let me know if i understood your example correctly or not. I can change the solution if needed.
# order dt1 by the common variable and
setorder(dt1, common_var, condition_var) condition
# calculate row_id for each group (grouped by common_var)
dt1[, row_index := rowid(common_var)]
# assume dt2 has only one row per common_var
dt2[, row_index := 1]
# left join on common_var and row_index, reorder columns.
dt3 <- dt2[dt1, on = c('common_var', 'row_index')][, list(common_var, condition_var, other_var, new_var1, new_var2)]

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

assigning a subset of data.table rows and columns by join

I'm trying to do something similar but different enough from what's described here:
Update subset of data.table based on join
Specifically, I'd like to assign to matching key values (person_id is a key in both tables) column values from table control. CI is the column index. The statement below says 'with=F' was not used. when I delete those parts, it also doesn't work as expected. Any suggestions?
To rephrase: I'd like to set the subset of flatData that corresponds to control FROM control.
flatData[J(eval(control$person_id)), ci, with=F] = control[, ci, with=F]
To give a reproducible example using classic R:
x = data.frame(a = 1:3, b = 1:3, key = c('a', 'b', 'c'))
y = data.frame(a = c(2, 5), b = c(11, 2), key = c('a', 'b'))
colidx = match(c('a', 'b'), colnames(y))
x[x$key %in% y$key, colidx] = y[, colidx]
As an aside, someone please explain how to easily assign SETS of columns without using indices! Indices and data.table are a marriage made in hell.
You can use the := operator along with the join simultaneously as follows:
First prepare data:
require(data.table) ## >= 1.9.0
setDT(x) ## converts DF to DT by reference
setDT(y)
setkey(x, key) ## set key column
setkey(y, key)
Now the one-liner:
x[y, c("a", "b") := list(i.a, i.b)]
:= modifies by reference (in-place). The rows to modify are provided by the indices computed from the join in i.
i.a and i.b are the column names data.table internally generates for easy access to i's columns when both x and i have identical column names, when performing a join of the form x[i].
HTH
PS: In your example y's columns a and b are of type numeric and x's are of type integer and therefore you'll get a warning when run on your data, that the types dint match and therefore a coercion had to take place.

Resources