Related
I have a data.table with which I'd like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular example, I'd like to multiply all of these columns by -1.
Some toy data and a vector specifying relevant columns:
library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")
Right now I'm doing it this way, looping over the character vector:
for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}
Is there a way to do this directly without the for loop?
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
.SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
EDIT: Here's another way that is probably faster, as #Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
UPDATE: Following is a neat way to do it without for loop
dt[,(cols):= - dt[,..cols]]
It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result
mbm = microbenchmark(
base = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]),
hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_solution2 = dt[,(cols):= - dt[,..cols]],
times=1000
)
mbm
Unit: microseconds
expr min lq mean median uq max neval
base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000
franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000
hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000
orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000
orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000
as shown in below chart
My Previous Answer:
The following also works
for (j in cols)
dt[,(j):= -1 * dt[, ..j]]
None of above solutions seems to work with calculation by group. Following is the best I got:
for(col in cols)
{
DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
}
dplyr functions work on data.tables, so here's a dplyr solution that also "avoids the for-loop" :)
dt %>% mutate(across(all_of(cols), ~ -1 * .))
I benchmarked it using orhan's code (adding rows and columns) and you'll see dplyr::mutate with across mostly executes faster than most of the other solutions and slower than the data.table solution using lapply.
library(data.table); library(dplyr)
dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>%
mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")
dt %>% mutate(across(all_of(cols), ~ -1 * .))
#> a b d a2 a3 a4 a5 a6
#> 1: -1 -1 1 -1 -1 -1 -1 -1
#> 2: -2 -2 2 -2 -2 -2 -2 -2
#> 3: -3 -3 3 -3 -3 -3 -3 -3
#> 4: -4 -4 4 -4 -4 -4 -4 -4
#> 5: -5 -5 5 -5 -5 -5 -5 -5
#> ---
#> 99996: -99996 -99996 99996 -99996 -99996 -99996 -99996 -99996
#> 99997: -99997 -99997 99997 -99997 -99997 -99997 -99997 -99997
#> 99998: -99998 -99998 99998 -99998 -99998 -99998 -99998 -99998
#> 99999: -99999 -99999 99999 -99999 -99999 -99999 -99999 -99999
#> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000
library(microbenchmark)
mbm = microbenchmark(
base_with_forloop = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_soln2_w_forloop = for (j in cols) set(dt, j = j, value = -dt[[j]]),
orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_soln2 = dt[,(cols):= - dt[,..cols]],
dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
times=1000
)
library(ggplot2)
ggplot(mbm) +
geom_violin(aes(x = expr, y = time)) +
coord_flip()
Created on 2020-10-16 by the reprex package (v0.3.0)
To add example to create new columns based on a string vector of columns. Based on Jfly answer:
dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))
col0 <- c("a", "b", "c")
col1 <- paste0("max.", col0)
for(i in seq_along(col0)) {
dt[, (col1[i]) := max(get(col0[i])), g]
}
dt[,.N, c("g", col1)]
library(data.table)
(dt <- data.table(a = 1:3, b = 1:3, d = 1:3))
Hence:
a b d
1: 1 1 1
2: 2 2 2
3: 3 3 3
Whereas (dt*(-1)) yields:
a b d
1: -1 -1 -1
2: -2 -2 -2
3: -3 -3 -3
I'm having a hard time solving this issue: for a given data.table, can I filter all rows that pass a criteria an all columns?
example:
dt <-data.table(col_a = c(1,1,0,0,1),
col_b = c(50,0,0,1,0),
col_c = c(0,0,0,0,0),
col_d = c(0,0,0,0,0),
col_e = c(1,0,0,0,10))
I want to return the rows that pass the filter<-c(T,F,F,F,T) - so row number 5
I've tried dt[, filter] - tells me that 'filter' is not found
tried dt[,c(T,F,F,F,T)] this returns a string [1] TRUE FALSE FALSE FALSE TRUE
Can I solve this by only using data.table?
It is unclear from the description of the post. Based on the comments, the OP wants to select the rows that matches the values in filter. In order to do that, first convert the columns to logical, replicate the filter to make the dimensions same before doing the comparison ==, get the rowSums, check if it equal to ncol of original dataset for subsetting the rows
dt[rowSums(dt[, lapply(.SD, as.logical)] == filter[col(dt)])== ncol(dt)]
# col_a col_b col_c col_d col_e
#1: 1 0 0 0 10
Or another option is to paste to single string and then compare
dt[dt[, do.call(paste0, lapply(.SD, function(x) +(as.logical(x))))]
== paste(+(filter), collapse = "")]
Or another approach is to loop through the columns, store the boolean comparison output as a list of vectors and Reduce
lst1 <- vector('list', ncol(dt))
for(j in seq_along(dt)) lst1[[j]] <- as.logical(dt[[j]]) == filter[j]
dt[Reduce(`&`, lst1)]
Or a similar approach with Map/Reduce
dt[dt[, Reduce(`&`, Map(`==`, lapply(.SD, as.logical), filter))]]
Considering the size of your actual dataset, you might be better off to convert it into a long format and then perform the filtering:
ans <- melt(DT[, rn := .I], id.vars="rn")[,
value := as.logical(value)][,
if (all(value==filter)) rn, rn]$V1
one timing:
library(data.table)
set.seed(0L)
nc <- 392
nr <- 2e6
filter <- sample(c(1,0), nc, TRUE)
loc <- which(filter>0L)
M <- matrix(sample(c(1,0), nc*nr, TRUE), nrow=nr)
DT <- as.data.table(M)
system.time({
ans <- melt(DT[, rn := .I], id.vars="rn")[,
value := as.logical(value)][,
if (all(value==filter)) rn, rn]$V1
})
# user system elapsed
# 2.20 0.84 1.72
some other options but not as fast as converting into a long format:
library(Matrix)
library(data.table)
library(microbenchmark)
set.seed(0L)
nc <- 392
nr <- 1e5
filter <- sample(c(1,0), nc, TRUE)
loc <- which(filter>0L)
M <- matrix(sample(c(1,0), nc*nr, TRUE), nrow=nr)
DT <- as.data.table(M)
# filter <- c(T,F,F,F,T)
# DT <- data.table(c(1,1,0,0,1), c(50,0,0,1,0), c(0,0,0,0,0), c(0,0,0,0,0), c(1,0,0,0,10))
# M <- as.matrix(DT)
loc <- which(filter>0L)
sumF <- sum(filter)
DTo_f <- copy(DT)
DTj_f <- copy(DT)
#Spare matrix
sm_f <- function() {
sM <- as(M, "dgTMatrix")
ixDT <- data.table(R=sM#i+1L, C=sM#j+1L, I=1L)
univ <- data.table(R=rep(1:nr, each=length(loc)), C=rep(loc, nr), U=1L)
mgDT <- merge(univ, ixDT, by=c("R", "C"), all=TRUE)
mgDT[, if(!(anyNA(U) | anyNA(I))) R, R]$V1
}
#melt
m_f <- function() {
melt(DT[, rn := .I], id.vars="rn")[,
value := as.logical(value)][,
if (all(value==filter)) rn, rn]$V1
}
#order
o_f <- function() {
non0 <- DTo_f[, {
m <- as.matrix(.SD)
ri <- replace(col(.SD), .SD==0L, NA_integer_)
as.data.table(matrix(ri[order(row(.SD), ri, na.last=TRUE)], nrow=.N, byrow=TRUE))
}]
non0[setNames(as.list(c(loc, rep(NA_integer_, nc - length(loc)))), names(DTo_f)),
on=.NATURAL, which=TRUE]
}
#join
j_f <- function() {
setindexv(DTj_f, names(DTj_f))
DTj_f[, names(DTj_f) := lapply(DTj_f, as.logical)]
DTj_f[as.list(as.logical(filter)), on=names(DTj_f), which=TRUE]
}
microbenchmark(sm_f(), m_f(), o_f(), j_f(), times=1L)
timings:
Unit: seconds
expr min lq mean median uq max neval
sm_f() 9.134432 9.134432 9.134432 9.134432 9.134432 9.134432 1
m_f() 2.020081 2.020081 2.020081 2.020081 2.020081 2.020081 1
o_f() 3.413685 3.413685 3.413685 3.413685 3.413685 3.413685 1
j_f() 7.149763 7.149763 7.149763 7.149763 7.149763 7.149763 1
You can use which(colSums((df>0)==filter)==nrow(df)) to get index
> which(colSums((df>0)==filter)==nrow(df))
col_e
5
such that
> df[which(colSums((df>0)==filter)==nrow(df))]
col_a col_b col_c col_d col_e
1: 1 0 0 0 10
If I understand the question correctly, this should answer the question.
Reproduce your data:
library(data.table)
dt <-data.table(col_a = c(1,1,0,0,1),
col_b = c(50,0,0,1,0),
col_c = c(0,0,0,0,0),
col_d = c(0,0,0,0,0),
col_e = c(1,0,0,0,10))
filter<-c(T,F,F,F,T)
Now create a variable that is checking for non-zero values in each row and subset accordingly
to_subset = apply(dt, 1, function(x) {
all((x > 0) == filter)
})
# the output you are looking for
dt[to_subset]
# col_a col_b col_c col_d col_e
# 1: 1 0 0 0 10
The code can be collapsed to be more concise.
dt[apply(dt, 1, function(x) all((x > 0) == filter))]
# col_a col_b col_c col_d col_e
# 1: 1 0 0 0 10
I need to take column sums over a large range of select columns. For example:
library(data.table)
set.seed(123)
DT = data.table(grp = c("A", "B", "C"),
x1 = sample(1:10, 3),
x2 = sample(1:10, 3),
x3 = sample(1:10, 3),
x4 = sample(1:10, 3))
> DT
grp x1 x2 x3 x4
1: A 3 9 6 5
2: B 8 10 9 9
3: C 4 1 5 4
Say, I want to sum over x2 and x3. I would normally do this using:
> DT[, .(total = sum(x2, x3)), by=grp]
grp total
1: A 15
2: B 19
3: C 6
However, if the range of columns is very large, say 100, how can this be coded elegantly, without spelling each column by name?
What I tried (and what didn't work):
my_cols <- paste0("x", 2:3)
DT[, .(total = sum(get(my_cols))), by=grp]
grp total
1: A 9
2: B 10
3: C 1
Appears to use only the first column (x2) and disregard the rest.
I didn't find an exact dupe (that deals with sum by row by group) so here 5 different possibilities I could think off.
The main thing to remember here that you are working with a data.table per group, hence, some functions won't work without unlist
## Create an example data
library(data.table)
set.seed(123)
DT <- data.table(grp = c("A", "B", "C"),
matrix(sample(1:10, 30 * 4, replace = TRUE), ncol = 4))
my_cols <- paste0("V", 2:3)
## 1- This won't work with `NA`s. It will work without `unlist`,
## but won't return correct results.
DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp]
## 2 - Convert to long format first and then aggregate
melt(DT, "grp", measure = my_cols)[, sum(value), by = grp]
## 3 - Using `base::sum` which can handle data.frames,
## see `?S4groupGeneric` (a data.table is also a data.frame)
DT[, base::sum(.SD), .SDcols = my_cols, by = grp]
## 4 - This will use data.tables enhanced `gsum` function,
## but it can't handle data.frames/data.tables
## Hence, requires unlist first. Will be interesting to measure the tradeoff
DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp]
## 5 - This is a modification to your original attempt that both handles multiple columns
## (`mget` instead of `get`) and adds `unlist`
## (no point trying wuth `base::sum` instead, because it will also require `unlist`)
DT[, sum(unlist(mget(my_cols))), by = grp]
All of these will return the same result
# grp V1
# 1: A 115
# 2: B 105
# 3: C 96
Some benchmarks
library(data.table)
library(microbenchmark)
library(stringi)
set.seed(123)
N <- 1e5
cols <- 50
DT <- data.table(grp = stri_rand_strings(N / 1e4, 2),
matrix(sample(1:10, N * cols, replace = TRUE),
ncol = cols))
my_cols <- paste0("V", 1:20)
mbench <- microbenchmark(
"Reduce/unlist: " = DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp],
"melt: " = melt(DT, "grp", measure = my_cols)[, sum(value), by = grp],
"base::sum: " = DT[, base::sum(.SD), .SDcols = my_cols, by = grp],
"gsum/unlist: " = DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp],
"gsum/mget/unlist: " = DT[, sum(unlist(mget(my_cols))), by = grp]
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Reduce/unlist: 1968.93628 2185.45706 2332.66770 2301.10293 2440.43138 3161.15522 100 c
# melt: 33.91844 58.18254 66.70419 64.52190 74.29494 132.62978 100 a
# base::sum: 18.00297 22.44860 27.21083 25.14174 29.20080 77.62018 100 a
# gsum/unlist: 780.53878 852.16508 929.65818 894.73892 968.28680 1430.91928 100 b
# gsum/mget/unlist: 797.99854 876.09773 963.70562 928.27375 1003.04632 1578.76408 100 b
library(ggplot2)
autoplot(mbench)
I have a Dataframe that looks like this:
Tree Species
5 rops_002
6 tico_001
8 tico_004
I need to add a column with less characters, like this:
Tree Species Species1
5 rops_002 rops
6 tico_001 tico
8 tico_004 tico
does somebody know how to do this?
Thank you very much!
dt <- data.frame(a = 1:2)
dt$Species <- c("assa_12", "bssa_12")
dt
# a Species
# 1 1 assa_12
# 2 2 bssa_12
One way:
dt$Species1 <- substr(dt$Species, 1, 4)
dt
# a Species Species1
# 1 1 assa_12 assa
# 2 2 bssa_12 bssa
Second option:
dt$Species1 <- sapply(strsplit(dt$Species, "_"), function(x) x[1])
dt
# a Species Species1
# 1 1 assa_12 assa
# 2 2 bssa_12 bssa
More functions and benchmarks:
minem1 <- function(x) substr(x, 1, 4) # takes firs 4 characters
minem2 <- function(x) sapply(strsplit(x, "_"), function(x) x[1]) # splits by "_" and takes first part
minem3 <- function(x) sapply(strsplit(x, "_", fixed = T), function(x) x[1]) # the same
andrewGustar <- function(x) gsub("_\\d+", "", x) # replaces anything after "_" with ""
koenV <- function(x) sub(x, pattern = "_.+", replacement = "") #changed a little
require(data.table)
setDT(dt)
minem4 <- function(x) data.table::tstrsplit(x, "_", fixed = T)[[1]]
# also splits and takes first part
# creata large test case:
n <- 100000
dt <- data.frame(a = 1:n,
Species = sample(c("aaaa", "abda", "asdf", "dads"), n, replace = T))
dt$Species <- paste(dt$Species, dt$a, sep = "_")
require(microbenchmark)
bench <- microbenchmark(minem1(dt$Species),
minem2(dt$Species),
andrewGustar(dt$Species),
koenV(dt$Species),
minem3(dt$Species),
minem4(dt$Species))
bench
Unit: milliseconds
# expr min lq mean median uq max neval cld
# minem1(dt$Species) 5.12257 5.465827 5.655002 5.620615 5.818871 6.94633 100 a
# minem2(dt$Species) 126.19138 133.780757 167.598675 176.696708 186.330236 627.31002 100 d
# andrewGustar(dt$Species) 40.24816 41.988833 42.591255 42.549435 42.942418 48.48893 100 b
# koenV(dt$Species) 37.91208 39.528120 40.369007 40.412091 40.885594 46.52658 100 b
# minem3(dt$Species) 80.40778 86.622198 112.163038 90.496686 137.788859 575.97141 100 c
# minem4(dt$Species) 15.28590 16.111006 17.737274 16.552911 17.054645 69.07255 100 a
autoplot(bench)
Conclusions: if you are sure that Species1 is 4 character long string then use substr, if not, then try tstrsplit from data.table. Also you could look at stringr and stringi packages for faster character sub-setting.
Or df$Species1 <- gsub("_\\d+","",df$Species)
This will remove the _nnn part, whereas minem's answer just keeps the first four characters. It depends what you want! If they are always in the AAAA_nnn format, then both are equivalent.
one very simple way may be:
df$Species1 <- sub(x = df$Species, pattern = "_00.", replacement = "")
if your pattern to remove is always _00x, where x is one digit
Consider a data set consisting of a grouping variable (here id) and an ordered variable (here date)
(df <- data.frame(
id = rep(1:2,2),
date = 4:1
))
# id date
# 1 1 4
# 2 2 3
# 3 1 2
# 4 2 1
I'm wondering what the easiest way is in data.table to do the equivalent of this dplyr code:
library(dplyr)
df %>%
group_by(id) %>%
filter(min_rank(date)==1)
# Source: local data frame [2 x 2]
# Groups: id
#
# id date
# 1 1 2
# 2 2 1
i.e. for each id get the first according to date.
Based on a similar stackoverflow question (Create an "index" for each element of a group with data.table), I came up with this
library(data.table)
dt <- data.table(df)
setkey(dt, id, date)
for(k in unique(dt$id)){
dt[id==k, index := 1:.N]
}
dt[index==1,]
But it seems like there should be a one-liner for this. Being unfamiliar with data.table I thought something like this
dt[,,mult="first", by=id]
should work, but alas! The last bit of code seems like it should group by id and then take the first (which within id would be determined by date since I've set the keys in this way.)
EDIT
Thanks to Ananda Mahto, this one-liner will now be in my data.table repertoire
dt[,.SD[1], by=id]
# id date
# 1: 1 2
# 2: 2 1
Working directly with your source data.frame, you can try:
setkey(as.data.table(df), id, date)[, .SD[1], by = id]
# id date
# 1: 1 2
# 2: 2 1
Extending your original idea, you can just do:
dt <- data.table(df)
setkey(dt, id, date)
dt[, index := sequence(.N), by = id][index == 1]
# id date index
# 1: 1 2 1
# 2: 2 1 1
It might be that at a certain scale, David is correct about head vs [1], but I'm not sure what scale that would be.
set.seed(1)
nrow <- 10000
ncol <- 20
df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
fun1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
fun2 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
library(microbenchmark)
microbenchmark(fun1(), fun2())
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1() 12.178189 12.496777 13.400905 12.808523 13.483545 30.28425 100
# fun2() 4.474345 4.554527 4.948255 4.620596 4.965912 8.17852 100
Here's another option using data.tables binary search
setkey(dt[, indx := seq_len(.N), by = id], indx)[J(1)]
# id date indx
# 1: 1 2 1
# 2: 2 1 1
Some benchmarks:
It seems that all the methods perform more or less the same, but on huge data set (1e+06*1e+2) binrary search wins
set.seed(1)
nrow <- 1e6
ncol <- 1e2
df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
library(data.table)
funAM1 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
funAM2 <- function() setkey(as.data.table(df), X1, X2)[, index := sequence(.N), by = X1][index == 1]
funDA1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
funDA2 <- function() setkey(as.data.table(df)[, indx := seq_len(.N), by = X1], X1)[J(1)]
library(microbenchmark)
Res <- microbenchmark(funAM1(), funAM2(), funDA1(), funDA2())
Res
# Unit: milliseconds
# expr min lq median uq max neval
# funAM1() 737.5690 758.3015 771.9344 794.1417 910.1019 100
# funAM2() 631.7822 693.8286 704.6912 729.6960 806.5556 100
# funDA1() 757.0327 772.4353 784.3107 810.0759 938.6344 100
# funDA2() 564.7291 578.1089 587.6470 611.7269 740.4077 100
boxplot(Res)