data.table and .SDcols with paste0 to create a character vector - r

Given the data.table, DF below, I would like to select all except the first rows of the groups numbered 6 and 8. I was told that I should use paste0( ). I have a solution that gives the expected result but without paste0( ).
DF <- data.table(grp=c(6,6,8,8,8), Q1=c(2,2,3,5,2), Q2=c(5,5,4,4,1), Q3=c(2,1,4,2,4), H1=c(3,4,5,2,4), H2=c(5,2,4,1,2) )
Desired result:
desired_result <- data.table(grp=c(6,8,8), Q1=c(2,4,2), Q2=c(5,4,1), Q3=c(1,2,4) )
One method that achieves this result:
DF[ , .SD[-1], .SDcols = c("Q1", "Q2", "Q3"), by = grp]
How can I use paste0( ) rather than c( )? Is there any advantage to one of these or an example where only one would work?

This method seems to work:
DF[ , .SD, .SDcols = paste0("Q", 1:3), by = grp]
grp Q1 Q2 Q3
1: 6 2 5 2
2: 6 2 5 1
3: 8 3 4 4
4: 8 5 4 2
5: 8 2 1 4
Comparing one method to another.
all.equal(DF[ , .SD, .SDcols = c("Q1", "Q2", "Q3"), by = grp],
DF[ , .SD, .SDcols = paste0("Q", 1:3), by = grp])
[1] TRUE
Note that .SDcols selects columns and has nothing to do with dropping the first rows of each group. .SDcols can take a character vector, and paste0 produces character vectors, so selecting the columns can work either way.
One method to drop the first row of each group is tail that frivolously includes the paste0 function is:
DF[ , tail(.SD, -1), .SDcols = paste0("Q", 1:3), by = grp]
grp Q1 Q2 Q3
1: 6 2 5 1
2: 8 5 4 2
3: 8 2 1 4

Related

R data.table: New data table with named columns and drop the rest

I want to do something very simple but so far I have failed to do it in one command. I want to create a new data table by applying a function to some columns of an existing one while giving them a name and droppinh the rest.
Let's see a minimal example:
library(data.table)
dt = data.table(A = c('a', 'a', 'a', 'b', 'b'),
B = c(1 , 2 , 3 , 4 , 5 ),
C = c(10 , 20 , 30 , 40 , 50))
dt
A B C
a 1 10
a 2 20
a 3 30
b 4 40
b 5 50
For a single column, we can do:
dt1 = dt[, .(totalB = sum(B)), by=A]
dt1
A totalB
a 6
b 9
For more than 1 columns, we can do:
dt2 = dt[, .(totalB = sum(B), totalC = sum(C)), by=A]
dt2
A totalB totalC
a 6 60
b 9 90
But if the columns are many that's not the best practice. So I guess we should go with lapply like that:
dt3 = dt[, lapply(.SD, sum), by = A]
dt3
A B C
a 6 60
b 9 90
That creates the table but without the names. So we can add them:
names = c("totalA", "totalB")
dt4 = dt[, c("totalA", "totalB") := lapply(.SD, sum), by = A ]
dt4
A B C totalA totalB
a 1 10 6 60
a 2 20 6 60
a 3 30 6 60
b 4 40 9 90
b 5 50 9 90
But now the columns remained. How can we prevent that? Also note that in my actual problem I use a subset of the columns, via SDcols, which I didn't include here for simplicity.
EDIT: My desired output is the same as dt2 but I don't want to write down all columns.
Do you mean something like below?
dt[, setNames(lapply(.SD, sum), paste0("total", names(.SD))), A]
Output
A totalB totalC
1: a 6 60
2: b 9 90
Another option is setnames. Create a vector of column names that we want to apply the function other than the grouping variable ('nm1'), grouped by 'A', get the sum, and use setnames with old and new specified
nm1 <- setdiff(names(dt), "A")
setnames(dt[, lapply(.SD, sum), A], nm1, paste0('total', nm1))[]

Rename grouping variable in data.table [duplicate]

This question already has an answer here:
Is it possible to rename a "by" grouping variable in data.table in R en passant?
(1 answer)
Closed 2 years ago.
I want to group a data.table but use a different name for the grouping variable in the final output.
Data
library(data.table)
set.seed(1)
d <- data.table(grp = sample(4, 100, TRUE))
Options
I can use chaining like this:
d[, .(Frequency = .N), keyby = grp][
, .("My Fancy Group Name" = grp, Frequency)]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
or rename the column before:
d[, c("My Fancy Group Name" = list(grp), .SD)][
, .(Frequency = .N), keyby = "My Fancy Group Name"]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
or define an alias for the grouping variable and remove the grouping variable afterwards:
d[, .("My Fancy Group Name" = grp, Frequency = .N), keyby = grp][
, grp := NULL][]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
but all forms use a chain.
I can avoid the chaining by the not recommended approach from here (which is not only a hack, but very inefficient on top):
d[, .("My Fancy Group Name" = .SD[, .N, keyby = grp]$grp,
Frequency = .SD[, .N, keyby = grp]$N)]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
Questions
Conceptually I would like to use something like this
# d[, .(Frequency = .N), keyby = c("My Fancy Group Name" = grp)]
Is it possible to achieve the solution chain free not using the hack I showed?
Which option performs "best" in terms of memory/time if we have a huge data.table?
You can actually do similar to your attempt but use list instead of c :
library(data.table)
d[, .(Frequency = .N), keyby = list(`My Fancy Group Name` = grp)]
#Also works with quotes
#d[, .(Frequency = .N), keyby = list("My Fancy Group Name" = grp)]
# My Fancy Group Name Frequency
#1: 1 27
#2: 2 31
#3: 3 22
#4: 4 20
Shorter version :
d[, .(Frequency = .N), .("My Fancy Group Name" = grp)]
Using setnames() should also be efficient:
setnames(d[, .N, keyby = grp], c("My Fancy Group Name", "Frequency"))

Accessing column name within the SD construct

I have a data table in R that looks like this
DT = data.table(a = c(1,2,3,4,5), a_mean = c(1,1,2,2,2), b = c(6,7,8,9,10), b_mean = c(3,2,1,1,2))
I want to create two more columns a_final and b_final defined as a_final = (a - a_mean) and b_final = (b - b_mean). In my real life use case, there can be a large number of such column pairs and I want a scalable solution in the spirit of R's data tables.
I tried something along the lines of
DT[,paste0(c('a','b'),'_final') := lapply(.SD, function(x) ((x-get(paste0(colnames(.SD),'_mean'))))), .SDcols = c('a','b')]
but this doesn't quite work. Any idea of how I can access the column name of the column being processed within the lapply statement?
We can create a character vector with columns names, subset it from the original data.table, get their corresponding "mean" columns, subtract and add as new columns.
library(data.table)
cols <- unique(sub('_.*', '', names(DT))) #Thanks to #Sotos
#OR just
#cols <- c('a', 'b')
DT[,paste0(cols, '_final')] <- DT[,cols, with = FALSE] -
DT[,paste0(cols, "_mean"), with = FALSE]
DT
# a a_mean b b_mean a_final b_final
#1: 1 1 6 3 0 3
#2: 2 1 7 2 1 5
#3: 3 2 8 1 1 7
#4: 4 2 9 1 2 8
#5: 5 2 10 2 3 8
Another option is using mget with Map:
cols <- c('a', 'b')
DT[, paste0(cols,'_final') := Map(`-`, mget(cols), mget(paste0(cols,"_mean")))]
Relying on the .SD construct you could do something along the lines of:
cols <- c('a', 'b')
DT[, paste0(cols, "_final") :=
DT[, .SD, .SDcols = cols] -
DT[, .SD, .SDcols = paste0(cols, "_mean")]]

How to combine data.tables by= with its shift() without having to create new variables?

I'm trying to generate row sums of a variable and its lag(s). Say I have:
library(data.table)
data <- data.table(id = rep(c("AT","DE"), each = 3),
time = rep(2001:2003, 2), var1 = c(1:6), var2 = c(NA, 1:3, NA, 8))
And I want to create a variable which adds 'var1' and the first lag of 'var2' by 'id'. If I create the lag first and the sum, I know how to:
data[ , lag := shift(var2, 1), by = id]
data[ , goalmessy := sum(var1, lag, na.rm = TRUE), by = 1:NROW(data)]
But is there a way to use shift inside sum or something similar (like apply sum or sth)? The intuitive problem I have, is that the by command is evaluated first as far as I know so we will be in a single row which makes the shifting unfeasible. Any hints?
I think this will do what you want in one line:
dt[, myVals := rowSums(cbind(var1, shift(var2)), na.rm=TRUE), by=id]
dt
id time var1 var2 myVals
1: AT 2001 1 NA 1
2: AT 2002 2 1 2
3: AT 2003 3 2 4
4: DE 2001 4 3 4
5: DE 2002 5 NA 8
6: DE 2003 6 8 6
The two variables of interest are put into cbind which is used to feed rowSums and NAs are dropped as in your code.
We can use rowSums
data[, goalmessy := rowSums(setDT(.(var1, shift(var2))), na.rm = TRUE), by = id]

calculate row sum and product in data.frame

I would like to append a columns to my data.frame in R that contain row sums and products
Consider following data frame
x y z
1 2 3
2 3 4
5 1 2
I want to get the following
x y z sum prod
1 2 3 6 6
2 3 4 9 24
5 1 2 8 10
I have tried
sum = apply(ages,1,add)
but it gives me a row vector. Can some one please show me an efficient command to sum and product and append them to original data frame as shown above?
Try
transform(df, sum=rowSums(df), prod=x*y*z)
# x y z sum prod
#1 1 2 3 6 6
#2 2 3 4 9 24
#3 5 1 2 8 10
Or
transform(df, sum=rowSums(df), prod=Reduce(`*`, df))
# x y z sum prod
#1 1 2 3 6 6
#2 2 3 4 9 24
#3 5 1 2 8 10
Another option would be to use rowProds from matrixStats
library(matrixStats)
transform(df, sum=rowSums(df), prod=rowProds(as.matrix(df)))
If you are using apply
df[,c('sum', 'prod')] <- t(apply(df, 1, FUN=function(x) c(sum(x), prod(x))))
df
# x y z sum prod
#1 1 2 3 6 6
#2 2 3 4 9 24
#3 5 1 2 8 10
Another approach.
require(data.table)
# Create data
dt <- data.table(x = c(1,2,5), y = c(2,3,1), z = c(3,4,2))
# Create index
dt[, i := .I]
# Compute sum and prod
dt[, sum := sum(x, y, z), by = i]
dt[, prod := prod(x, y, z), by = i]
dt
# Compute sum and prod using .SD
dt[, c("sum", "prod") := NULL]
dt
dt[, sum := sum(.SD), by = i, .SDcols = c("x", "y", "z")]
dt[, prod := prod(.SD), by = i, .SDcols = c("x", "y", "z")]
dt
# Compute sum and prod using .SD and list
dt[, c("sum", "prod") := NULL]
dt
dt[, c("sum", "prod") := list(sum(.SD), prod(.SD)), by = i,
.SDcols = c("x", "y", "z")]
dt
# Compute sum and prod using .SD and lapply
dt[, c("sum", "prod") := NULL]
dt
dt[, c("sum", "prod") := lapply(list(sum, prod), do.call, .SD), by = i,
.SDcols = c("x", "y", "z")]
dt
Following can also be done but column names need to be entered:
ddf$sum = with(ddf, x+y+z)
ddf$prod = with(ddf, x*y*z)
ddf
x y z sum prod
1 1 2 3 6 6
2 2 3 4 9 24
3 5 1 2 8 10
With data.table, another form can be:
library(data.table)
cbind(dt, dt[,list(sum=x+y+z, product=x*y*z),])
x y z sum product
1: 1 2 3 6 6
2: 2 3 4 9 24
3: 5 1 2 8 10
A simpler version is suggested by #David Arenberg in comments:
dt[, ":="(sum = x+y+z, product = x*y*z)]
Only a partial answer, but if all values are greater than or equal to 0, rowSums/rowsum can be used to calculate products:
df <- data.frame(x = c(1, 2, 5), y = c(2, 3, 1), z = c(3, 4, 2))
# custom row-product-function
my_rowprod <- function(x) exp(rowSums(log(x)))
df$prod <- my_rowprod(df)
df
The generic version is (including negatives):
my_rowprod_2 <- function(x) {
sign <- ifelse((rowSums(x < 0) %% 2) == 1, -1, 1)
prod <- exp(rowSums(log(abs(x)))) * sign
prod
}
df$prod <- my_rowprod_2(df)
df

Resources