data.table lapply dynamic column names [duplicate] - r

I have a data.table with which I'd like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular example, I'd like to multiply all of these columns by -1.
Some toy data and a vector specifying relevant columns:
library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")
Right now I'm doing it this way, looping over the character vector:
for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}
Is there a way to do this directly without the for loop?

This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
.SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
EDIT: Here's another way that is probably faster, as #Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])

I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]

UPDATE: Following is a neat way to do it without for loop
dt[,(cols):= - dt[,..cols]]
It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result
mbm = microbenchmark(
base = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]),
hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_solution2 = dt[,(cols):= - dt[,..cols]],
times=1000
)
mbm
Unit: microseconds
expr min lq mean median uq max neval
base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000
franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000
hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000
orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000
orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000
as shown in below chart
My Previous Answer:
The following also works
for (j in cols)
dt[,(j):= -1 * dt[, ..j]]

None of above solutions seems to work with calculation by group. Following is the best I got:
for(col in cols)
{
DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
}

dplyr functions work on data.tables, so here's a dplyr solution that also "avoids the for-loop" :)
dt %>% mutate(across(all_of(cols), ~ -1 * .))
I benchmarked it using orhan's code (adding rows and columns) and you'll see dplyr::mutate with across mostly executes faster than most of the other solutions and slower than the data.table solution using lapply.
library(data.table); library(dplyr)
dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>%
mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")
dt %>% mutate(across(all_of(cols), ~ -1 * .))
#> a b d a2 a3 a4 a5 a6
#> 1: -1 -1 1 -1 -1 -1 -1 -1
#> 2: -2 -2 2 -2 -2 -2 -2 -2
#> 3: -3 -3 3 -3 -3 -3 -3 -3
#> 4: -4 -4 4 -4 -4 -4 -4 -4
#> 5: -5 -5 5 -5 -5 -5 -5 -5
#> ---
#> 99996: -99996 -99996 99996 -99996 -99996 -99996 -99996 -99996
#> 99997: -99997 -99997 99997 -99997 -99997 -99997 -99997 -99997
#> 99998: -99998 -99998 99998 -99998 -99998 -99998 -99998 -99998
#> 99999: -99999 -99999 99999 -99999 -99999 -99999 -99999 -99999
#> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000
library(microbenchmark)
mbm = microbenchmark(
base_with_forloop = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_soln2_w_forloop = for (j in cols) set(dt, j = j, value = -dt[[j]]),
orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_soln2 = dt[,(cols):= - dt[,..cols]],
dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
times=1000
)
library(ggplot2)
ggplot(mbm) +
geom_violin(aes(x = expr, y = time)) +
coord_flip()
Created on 2020-10-16 by the reprex package (v0.3.0)

To add example to create new columns based on a string vector of columns. Based on Jfly answer:
dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))
col0 <- c("a", "b", "c")
col1 <- paste0("max.", col0)
for(i in seq_along(col0)) {
dt[, (col1[i]) := max(get(col0[i])), g]
}
dt[,.N, c("g", col1)]

library(data.table)
(dt <- data.table(a = 1:3, b = 1:3, d = 1:3))
Hence:
a b d
1: 1 1 1
2: 2 2 2
3: 3 3 3
Whereas (dt*(-1)) yields:
a b d
1: -1 -1 -1
2: -2 -2 -2
3: -3 -3 -3

Related

filter data.table using vector

I'm having a hard time solving this issue: for a given data.table, can I filter all rows that pass a criteria an all columns?
example:
dt <-data.table(col_a = c(1,1,0,0,1),
col_b = c(50,0,0,1,0),
col_c = c(0,0,0,0,0),
col_d = c(0,0,0,0,0),
col_e = c(1,0,0,0,10))
I want to return the rows that pass the filter<-c(T,F,F,F,T) - so row number 5
I've tried dt[, filter] - tells me that 'filter' is not found
tried dt[,c(T,F,F,F,T)] this returns a string [1] TRUE FALSE FALSE FALSE TRUE
Can I solve this by only using data.table?
It is unclear from the description of the post. Based on the comments, the OP wants to select the rows that matches the values in filter. In order to do that, first convert the columns to logical, replicate the filter to make the dimensions same before doing the comparison ==, get the rowSums, check if it equal to ncol of original dataset for subsetting the rows
dt[rowSums(dt[, lapply(.SD, as.logical)] == filter[col(dt)])== ncol(dt)]
# col_a col_b col_c col_d col_e
#1: 1 0 0 0 10
Or another option is to paste to single string and then compare
dt[dt[, do.call(paste0, lapply(.SD, function(x) +(as.logical(x))))]
== paste(+(filter), collapse = "")]
Or another approach is to loop through the columns, store the boolean comparison output as a list of vectors and Reduce
lst1 <- vector('list', ncol(dt))
for(j in seq_along(dt)) lst1[[j]] <- as.logical(dt[[j]]) == filter[j]
dt[Reduce(`&`, lst1)]
Or a similar approach with Map/Reduce
dt[dt[, Reduce(`&`, Map(`==`, lapply(.SD, as.logical), filter))]]
Considering the size of your actual dataset, you might be better off to convert it into a long format and then perform the filtering:
ans <- melt(DT[, rn := .I], id.vars="rn")[,
value := as.logical(value)][,
if (all(value==filter)) rn, rn]$V1
one timing:
library(data.table)
set.seed(0L)
nc <- 392
nr <- 2e6
filter <- sample(c(1,0), nc, TRUE)
loc <- which(filter>0L)
M <- matrix(sample(c(1,0), nc*nr, TRUE), nrow=nr)
DT <- as.data.table(M)
system.time({
ans <- melt(DT[, rn := .I], id.vars="rn")[,
value := as.logical(value)][,
if (all(value==filter)) rn, rn]$V1
})
# user system elapsed
# 2.20 0.84 1.72
some other options but not as fast as converting into a long format:
library(Matrix)
library(data.table)
library(microbenchmark)
set.seed(0L)
nc <- 392
nr <- 1e5
filter <- sample(c(1,0), nc, TRUE)
loc <- which(filter>0L)
M <- matrix(sample(c(1,0), nc*nr, TRUE), nrow=nr)
DT <- as.data.table(M)
# filter <- c(T,F,F,F,T)
# DT <- data.table(c(1,1,0,0,1), c(50,0,0,1,0), c(0,0,0,0,0), c(0,0,0,0,0), c(1,0,0,0,10))
# M <- as.matrix(DT)
loc <- which(filter>0L)
sumF <- sum(filter)
DTo_f <- copy(DT)
DTj_f <- copy(DT)
#Spare matrix
sm_f <- function() {
sM <- as(M, "dgTMatrix")
ixDT <- data.table(R=sM#i+1L, C=sM#j+1L, I=1L)
univ <- data.table(R=rep(1:nr, each=length(loc)), C=rep(loc, nr), U=1L)
mgDT <- merge(univ, ixDT, by=c("R", "C"), all=TRUE)
mgDT[, if(!(anyNA(U) | anyNA(I))) R, R]$V1
}
#melt
m_f <- function() {
melt(DT[, rn := .I], id.vars="rn")[,
value := as.logical(value)][,
if (all(value==filter)) rn, rn]$V1
}
#order
o_f <- function() {
non0 <- DTo_f[, {
m <- as.matrix(.SD)
ri <- replace(col(.SD), .SD==0L, NA_integer_)
as.data.table(matrix(ri[order(row(.SD), ri, na.last=TRUE)], nrow=.N, byrow=TRUE))
}]
non0[setNames(as.list(c(loc, rep(NA_integer_, nc - length(loc)))), names(DTo_f)),
on=.NATURAL, which=TRUE]
}
#join
j_f <- function() {
setindexv(DTj_f, names(DTj_f))
DTj_f[, names(DTj_f) := lapply(DTj_f, as.logical)]
DTj_f[as.list(as.logical(filter)), on=names(DTj_f), which=TRUE]
}
microbenchmark(sm_f(), m_f(), o_f(), j_f(), times=1L)
timings:
Unit: seconds
expr min lq mean median uq max neval
sm_f() 9.134432 9.134432 9.134432 9.134432 9.134432 9.134432 1
m_f() 2.020081 2.020081 2.020081 2.020081 2.020081 2.020081 1
o_f() 3.413685 3.413685 3.413685 3.413685 3.413685 3.413685 1
j_f() 7.149763 7.149763 7.149763 7.149763 7.149763 7.149763 1
You can use which(colSums((df>0)==filter)==nrow(df)) to get index
> which(colSums((df>0)==filter)==nrow(df))
col_e
5
such that
> df[which(colSums((df>0)==filter)==nrow(df))]
col_a col_b col_c col_d col_e
1: 1 0 0 0 10
If I understand the question correctly, this should answer the question.
Reproduce your data:
library(data.table)
dt <-data.table(col_a = c(1,1,0,0,1),
col_b = c(50,0,0,1,0),
col_c = c(0,0,0,0,0),
col_d = c(0,0,0,0,0),
col_e = c(1,0,0,0,10))
filter<-c(T,F,F,F,T)
Now create a variable that is checking for non-zero values in each row and subset accordingly
to_subset = apply(dt, 1, function(x) {
all((x > 0) == filter)
})
# the output you are looking for
dt[to_subset]
# col_a col_b col_c col_d col_e
# 1: 1 0 0 0 10
The code can be collapsed to be more concise.
dt[apply(dt, 1, function(x) all((x > 0) == filter))]
# col_a col_b col_c col_d col_e
# 1: 1 0 0 0 10

Removing rows where multiple columns equal an exact number R

I'd like to subset rows where x1 and x2 == 9. My real set has over 200 columns where the column name starts with the same string. The dummy code below creates a smaller sample of the data. I'd like to do this ideally with the R data.table package if possible.
df <- data.frame('id'=c(1,2,3), 'x1'=c(9,9,4), 'x2'=c(9,9,4))
head(df)
# does not work, but thought perhaps I could have defined the columns via a paste and then subset where columns were equal to 9.
df[which(paste0("x", 1:2)==9), ]
Update: sorry if I wasn't clear. I am aware of simply adding a filter for x1 and x2. The issue is that the real data consists of over 200 columns: x1:x200. I am in search of a cleaner solution than what is proposed below.
If you want an efficient base R solution I would simply use rowSums, e.g.
cols <- paste0("x", 1:2)
df[rowSums(df[cols] == 9) == length(cols), ]
# id x1 x2
# 1 1 9 9
# 2 2 9 9
If you want a data.table solution, I would use a binary join, e.g.
library(data.table)
setDT(df)[as.list(rep(9, length(cols))), on = cols]
# id x1 x2
# 1: 1 9 9
# 2: 2 9 9
Data
df <- data.frame(id = 1:3, x1 = c(9, 9, 4), x2 = c(9, 9, 4))
Something like this, perhaps?
df[apply(df[, paste0("x", 1:200)] == 9, 1, all), ]
A melt can allow you to not have to write out every column (for your >2 column case):
> aTbl = as.data.table(df)
> aTbl[, all9sP := F]
> aTbl[, .SD
][, !'all9sP'
][, melt(.SD, id.vars=c('id'))
][, NVars := uniqueN(variable)
][value == 9
][, .(N9s=.N), .(id, NVars)
][, all9sP := N9s == NVars
][, aTbl[.SD, all9sP := i.all9sP, on=.(id)]
][all9sP == T
][, all9sP := NULL
][, .SD
]
id x1 x2
1: 1 9 9
2: 2 9 9
>
Try:
df[df$x1 == 9 & df$x2 == 9,]
EDIT (misunderstood, now it should do the trick):
for (i in 2:200) {df = df[df[,i] == 9,]}
You could also use grep with apply
# Select all columns that have (colnames) "x"
col.names <- grep("x",colnames(df), value = TRUE)
# Select rows where row == 9
sel <- apply(df[,col.names], 1, function(row) 9 %in% row)
df[sel,]
And the output
id x1 x2
1 1 9 9
2 2 9 9
Solution using data.table
Create dataset
ncols <- 5
cnms <- paste0("x", 1:ncols)
X <- data.table(ID = 1:1e6)
X[, (cnms) := NA_integer_]
X[, (cnms) := lapply(X = 1:ncols, sample, size = .N, x = 1:10)]
Find rows where sum equals 9
X1 <- X[, s := rowSums(.SD), .SDcols = cnms][s == 9, ][, s:= NULL][]
X1
Find rows where all columns are equal to 9
X[, s := NULL]
ind <- rowSums(X[, lapply(.SD, is.element, set = 9), .SDcols = cnms])
X2 <- X[ind == length(cnms)][]
X2
Edit
This is acutally a lot faster:
X[, s := NULL]
ind <- rowSums(X[, .SD , .SDcols = cnms] == 9)
X2 <- X[ind == length(cnms)][]
X2
Edit2
See answer from https://stackoverflow.com/users/3001626/david-arenburg. A lot faster.
In the tidyverse, try rowwise and use filter as usual
df %>%
rowwise() %>%
filter(x1 %in% 9 & x2 %in% 9 )
Source: local data frame [2 x 3]
Groups: <by row>
# A tibble: 2 x 3
id x1 x2
<dbl> <dbl> <dbl>
1 1 9 9
2 2 9 9

Sum over rows by group (many columns at once)

I need to take column sums over a large range of select columns. For example:
library(data.table)
set.seed(123)
DT = data.table(grp = c("A", "B", "C"),
x1 = sample(1:10, 3),
x2 = sample(1:10, 3),
x3 = sample(1:10, 3),
x4 = sample(1:10, 3))
> DT
grp x1 x2 x3 x4
1: A 3 9 6 5
2: B 8 10 9 9
3: C 4 1 5 4
Say, I want to sum over x2 and x3. I would normally do this using:
> DT[, .(total = sum(x2, x3)), by=grp]
grp total
1: A 15
2: B 19
3: C 6
However, if the range of columns is very large, say 100, how can this be coded elegantly, without spelling each column by name?
What I tried (and what didn't work):
my_cols <- paste0("x", 2:3)
DT[, .(total = sum(get(my_cols))), by=grp]
grp total
1: A 9
2: B 10
3: C 1
Appears to use only the first column (x2) and disregard the rest.
I didn't find an exact dupe (that deals with sum by row by group) so here 5 different possibilities I could think off.
The main thing to remember here that you are working with a data.table per group, hence, some functions won't work without unlist
## Create an example data
library(data.table)
set.seed(123)
DT <- data.table(grp = c("A", "B", "C"),
matrix(sample(1:10, 30 * 4, replace = TRUE), ncol = 4))
my_cols <- paste0("V", 2:3)
## 1- This won't work with `NA`s. It will work without `unlist`,
## but won't return correct results.
DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp]
## 2 - Convert to long format first and then aggregate
melt(DT, "grp", measure = my_cols)[, sum(value), by = grp]
## 3 - Using `base::sum` which can handle data.frames,
## see `?S4groupGeneric` (a data.table is also a data.frame)
DT[, base::sum(.SD), .SDcols = my_cols, by = grp]
## 4 - This will use data.tables enhanced `gsum` function,
## but it can't handle data.frames/data.tables
## Hence, requires unlist first. Will be interesting to measure the tradeoff
DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp]
## 5 - This is a modification to your original attempt that both handles multiple columns
## (`mget` instead of `get`) and adds `unlist`
## (no point trying wuth `base::sum` instead, because it will also require `unlist`)
DT[, sum(unlist(mget(my_cols))), by = grp]
All of these will return the same result
# grp V1
# 1: A 115
# 2: B 105
# 3: C 96
Some benchmarks
library(data.table)
library(microbenchmark)
library(stringi)
set.seed(123)
N <- 1e5
cols <- 50
DT <- data.table(grp = stri_rand_strings(N / 1e4, 2),
matrix(sample(1:10, N * cols, replace = TRUE),
ncol = cols))
my_cols <- paste0("V", 1:20)
mbench <- microbenchmark(
"Reduce/unlist: " = DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp],
"melt: " = melt(DT, "grp", measure = my_cols)[, sum(value), by = grp],
"base::sum: " = DT[, base::sum(.SD), .SDcols = my_cols, by = grp],
"gsum/unlist: " = DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp],
"gsum/mget/unlist: " = DT[, sum(unlist(mget(my_cols))), by = grp]
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Reduce/unlist: 1968.93628 2185.45706 2332.66770 2301.10293 2440.43138 3161.15522 100 c
# melt: 33.91844 58.18254 66.70419 64.52190 74.29494 132.62978 100 a
# base::sum: 18.00297 22.44860 27.21083 25.14174 29.20080 77.62018 100 a
# gsum/unlist: 780.53878 852.16508 929.65818 894.73892 968.28680 1430.91928 100 b
# gsum/mget/unlist: 797.99854 876.09773 963.70562 928.27375 1003.04632 1578.76408 100 b
library(ggplot2)
autoplot(mbench)

Specific Ordering in R

I am making all possible combinations for a specific input, but it has to be ordered according to the order of the input aswell. Since the combinations are different sized, I'm struggling with the answers previously posted.
I would like to know if this is possible.
Input:
D N A 3
This means I need to output it in all combinations up to 3 character strings:
D
DD
DDD
DDN
DDA
DND
DNA
.
.
Which is basically ascending order if we consider D<N<A
So far my output looks like this:
A
AA
AAA
AAD
AAN
AD
ADA
ADD
ADN
AN
.
.
I have tried converting the input as factor c("D","N","A") and sort my output, but then it disappears any string bigger than 1 character.
Here's one possible solution:
generateCombs <- function(x, n){
if (n == 1) return(x[1]) # Base case
# Create a grid with all possible permutations of 0:n. 0 == "", and 1:n correspond to elements of x
permutations = expand.grid(replicate(n, 0:n, simplify = F))
# Order permutations
orderedPermutations = permutations[do.call(order, as.list(permutations)),]
# Map permutations now such that 0 == "", and 1:n correspond to elements of x
mappedPermutations = sapply(orderedPermutations, function(y) c("", x)[y + 1])
# Collapse each row into a single string
collapsedPermutations = apply(mappedPermutations, 1, function(x) paste0(x, collapse = ""))
# Due to the 0's, there will be duplicates. We remove the duplicates in reverse order
collapsedPermutations = rev(unique(rev(collapsedPermutations)))[-1] # -1 removes blank
# Return as data frame
return (as.data.frame(collapsedPermutations))
}
x = c("D", "N", "A")
n = 3
generateCombs(x, n)
The output is:
collapsedPermutations
1 D
2 DD
3 DDD
4 DDN
5 DDA
6 DN
7 DND
8 DNN
9 DNA
10 DA
11 DAD
...
A solution using a random library I just found (so I might be using it wrong) called iterpc.
Generate all the combinations, factor the elements, sort, then hack into a string.
ordered_combn = function(elems) {
require(data.table)
require(iterpc)
I = lapply(seq_along(elems), function(i) iterpc::iterpc(table(elems), i, replace=TRUE, ordered=TRUE))
I = lapply(I, iterpc::getall)
I = lapply(I, as.data.table)
dt = rbindlist(I, fill = TRUE)
dt[is.na(dt)] = ""
cols = paste0("V", 1:length(elems))
dt[, (cols) := lapply(.SD, factor, levels = c("", elems)), .SDcols = cols]
setkey(dt)
dt[, ID := 1:.N]
dt[, (cols) := lapply(.SD, as.character), .SDcols = cols]
dt[, ord := paste0(.SD, collapse = ""), ID, .SDcols = cols]
# return dt[, ord] as an ordered factor for neatness
dt
}
elems = c("D", "N", "A")
combs = ordered_combn(elems)
combs
Output
V1 V2 V3 ID ord
1: D 1 D
2: D D 2 DD
3: D D D 3 DDD
4: D D N 4 DDN
5: D D A 5 DDA
6: D N 6 DN
7: D N D 7 DND
8: D N N 8 DNN
...

Understanding .I in data.table in R

I was playing around with data.table and I came across a distinction that I'm not sure I quite understand. Given the following dataset:
library(data.table)
set.seed(400)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
Can you please explain to me the difference between the following expressions?
1) DT[J("E"), .I]
2) DT[ , .I[x == "E"] ]
3) DT[x == "E", .I]
set.seed(400)
library(data.table)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
1)
DT[ , .I[x == "E"] ] # [1] 18 19 20
is a data.table where .I is a vector representing the row number of E in the ORIGINAL dataset DT
2)
DT[J("E") , .I] # [1] 1 2 3
DT["E" , .I] # [1] 1 2 3
DT[x == "E", .I] # [1] 1 2 3
are all the same, producing a vector where .Is are vectors representing the row numbers of the Es in the NEW subsetted data

Resources