Rename grouping variable in data.table [duplicate] - r

This question already has an answer here:
Is it possible to rename a "by" grouping variable in data.table in R en passant?
(1 answer)
Closed 2 years ago.
I want to group a data.table but use a different name for the grouping variable in the final output.
Data
library(data.table)
set.seed(1)
d <- data.table(grp = sample(4, 100, TRUE))
Options
I can use chaining like this:
d[, .(Frequency = .N), keyby = grp][
, .("My Fancy Group Name" = grp, Frequency)]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
or rename the column before:
d[, c("My Fancy Group Name" = list(grp), .SD)][
, .(Frequency = .N), keyby = "My Fancy Group Name"]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
or define an alias for the grouping variable and remove the grouping variable afterwards:
d[, .("My Fancy Group Name" = grp, Frequency = .N), keyby = grp][
, grp := NULL][]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
but all forms use a chain.
I can avoid the chaining by the not recommended approach from here (which is not only a hack, but very inefficient on top):
d[, .("My Fancy Group Name" = .SD[, .N, keyby = grp]$grp,
Frequency = .SD[, .N, keyby = grp]$N)]
# My Fancy Group Name Frequency
# 1: 1 27
# 2: 2 31
# 3: 3 22
# 4: 4 20
Questions
Conceptually I would like to use something like this
# d[, .(Frequency = .N), keyby = c("My Fancy Group Name" = grp)]
Is it possible to achieve the solution chain free not using the hack I showed?
Which option performs "best" in terms of memory/time if we have a huge data.table?

You can actually do similar to your attempt but use list instead of c :
library(data.table)
d[, .(Frequency = .N), keyby = list(`My Fancy Group Name` = grp)]
#Also works with quotes
#d[, .(Frequency = .N), keyby = list("My Fancy Group Name" = grp)]
# My Fancy Group Name Frequency
#1: 1 27
#2: 2 31
#3: 3 22
#4: 4 20
Shorter version :
d[, .(Frequency = .N), .("My Fancy Group Name" = grp)]

Using setnames() should also be efficient:
setnames(d[, .N, keyby = grp], c("My Fancy Group Name", "Frequency"))

Related

R data.table: New data table with named columns and drop the rest

I want to do something very simple but so far I have failed to do it in one command. I want to create a new data table by applying a function to some columns of an existing one while giving them a name and droppinh the rest.
Let's see a minimal example:
library(data.table)
dt = data.table(A = c('a', 'a', 'a', 'b', 'b'),
B = c(1 , 2 , 3 , 4 , 5 ),
C = c(10 , 20 , 30 , 40 , 50))
dt
A B C
a 1 10
a 2 20
a 3 30
b 4 40
b 5 50
For a single column, we can do:
dt1 = dt[, .(totalB = sum(B)), by=A]
dt1
A totalB
a 6
b 9
For more than 1 columns, we can do:
dt2 = dt[, .(totalB = sum(B), totalC = sum(C)), by=A]
dt2
A totalB totalC
a 6 60
b 9 90
But if the columns are many that's not the best practice. So I guess we should go with lapply like that:
dt3 = dt[, lapply(.SD, sum), by = A]
dt3
A B C
a 6 60
b 9 90
That creates the table but without the names. So we can add them:
names = c("totalA", "totalB")
dt4 = dt[, c("totalA", "totalB") := lapply(.SD, sum), by = A ]
dt4
A B C totalA totalB
a 1 10 6 60
a 2 20 6 60
a 3 30 6 60
b 4 40 9 90
b 5 50 9 90
But now the columns remained. How can we prevent that? Also note that in my actual problem I use a subset of the columns, via SDcols, which I didn't include here for simplicity.
EDIT: My desired output is the same as dt2 but I don't want to write down all columns.
Do you mean something like below?
dt[, setNames(lapply(.SD, sum), paste0("total", names(.SD))), A]
Output
A totalB totalC
1: a 6 60
2: b 9 90
Another option is setnames. Create a vector of column names that we want to apply the function other than the grouping variable ('nm1'), grouped by 'A', get the sum, and use setnames with old and new specified
nm1 <- setdiff(names(dt), "A")
setnames(dt[, lapply(.SD, sum), A], nm1, paste0('total', nm1))[]

Accessing column name within the SD construct

I have a data table in R that looks like this
DT = data.table(a = c(1,2,3,4,5), a_mean = c(1,1,2,2,2), b = c(6,7,8,9,10), b_mean = c(3,2,1,1,2))
I want to create two more columns a_final and b_final defined as a_final = (a - a_mean) and b_final = (b - b_mean). In my real life use case, there can be a large number of such column pairs and I want a scalable solution in the spirit of R's data tables.
I tried something along the lines of
DT[,paste0(c('a','b'),'_final') := lapply(.SD, function(x) ((x-get(paste0(colnames(.SD),'_mean'))))), .SDcols = c('a','b')]
but this doesn't quite work. Any idea of how I can access the column name of the column being processed within the lapply statement?
We can create a character vector with columns names, subset it from the original data.table, get their corresponding "mean" columns, subtract and add as new columns.
library(data.table)
cols <- unique(sub('_.*', '', names(DT))) #Thanks to #Sotos
#OR just
#cols <- c('a', 'b')
DT[,paste0(cols, '_final')] <- DT[,cols, with = FALSE] -
DT[,paste0(cols, "_mean"), with = FALSE]
DT
# a a_mean b b_mean a_final b_final
#1: 1 1 6 3 0 3
#2: 2 1 7 2 1 5
#3: 3 2 8 1 1 7
#4: 4 2 9 1 2 8
#5: 5 2 10 2 3 8
Another option is using mget with Map:
cols <- c('a', 'b')
DT[, paste0(cols,'_final') := Map(`-`, mget(cols), mget(paste0(cols,"_mean")))]
Relying on the .SD construct you could do something along the lines of:
cols <- c('a', 'b')
DT[, paste0(cols, "_final") :=
DT[, .SD, .SDcols = cols] -
DT[, .SD, .SDcols = paste0(cols, "_mean")]]

How to combine data.tables by= with its shift() without having to create new variables?

I'm trying to generate row sums of a variable and its lag(s). Say I have:
library(data.table)
data <- data.table(id = rep(c("AT","DE"), each = 3),
time = rep(2001:2003, 2), var1 = c(1:6), var2 = c(NA, 1:3, NA, 8))
And I want to create a variable which adds 'var1' and the first lag of 'var2' by 'id'. If I create the lag first and the sum, I know how to:
data[ , lag := shift(var2, 1), by = id]
data[ , goalmessy := sum(var1, lag, na.rm = TRUE), by = 1:NROW(data)]
But is there a way to use shift inside sum or something similar (like apply sum or sth)? The intuitive problem I have, is that the by command is evaluated first as far as I know so we will be in a single row which makes the shifting unfeasible. Any hints?
I think this will do what you want in one line:
dt[, myVals := rowSums(cbind(var1, shift(var2)), na.rm=TRUE), by=id]
dt
id time var1 var2 myVals
1: AT 2001 1 NA 1
2: AT 2002 2 1 2
3: AT 2003 3 2 4
4: DE 2001 4 3 4
5: DE 2002 5 NA 8
6: DE 2003 6 8 6
The two variables of interest are put into cbind which is used to feed rowSums and NAs are dropped as in your code.
We can use rowSums
data[, goalmessy := rowSums(setDT(.(var1, shift(var2))), na.rm = TRUE), by = id]

data.table and .SDcols with paste0 to create a character vector

Given the data.table, DF below, I would like to select all except the first rows of the groups numbered 6 and 8. I was told that I should use paste0( ). I have a solution that gives the expected result but without paste0( ).
DF <- data.table(grp=c(6,6,8,8,8), Q1=c(2,2,3,5,2), Q2=c(5,5,4,4,1), Q3=c(2,1,4,2,4), H1=c(3,4,5,2,4), H2=c(5,2,4,1,2) )
Desired result:
desired_result <- data.table(grp=c(6,8,8), Q1=c(2,4,2), Q2=c(5,4,1), Q3=c(1,2,4) )
One method that achieves this result:
DF[ , .SD[-1], .SDcols = c("Q1", "Q2", "Q3"), by = grp]
How can I use paste0( ) rather than c( )? Is there any advantage to one of these or an example where only one would work?
This method seems to work:
DF[ , .SD, .SDcols = paste0("Q", 1:3), by = grp]
grp Q1 Q2 Q3
1: 6 2 5 2
2: 6 2 5 1
3: 8 3 4 4
4: 8 5 4 2
5: 8 2 1 4
Comparing one method to another.
all.equal(DF[ , .SD, .SDcols = c("Q1", "Q2", "Q3"), by = grp],
DF[ , .SD, .SDcols = paste0("Q", 1:3), by = grp])
[1] TRUE
Note that .SDcols selects columns and has nothing to do with dropping the first rows of each group. .SDcols can take a character vector, and paste0 produces character vectors, so selecting the columns can work either way.
One method to drop the first row of each group is tail that frivolously includes the paste0 function is:
DF[ , tail(.SD, -1), .SDcols = paste0("Q", 1:3), by = grp]
grp Q1 Q2 Q3
1: 6 2 5 1
2: 8 5 4 2
3: 8 2 1 4

Concise R data.table syntax for modal value (most frequent) by group

What is efficient and elegant data.table syntax for finding the most common category for each id? I keep a boolean vector indicating NA positions (for other purposes)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
In this toy example, ignoring NA, x is common category for id==1 and y for id==2.
If you want to ignore NA's, you have to exclude them first with !is.na(category), group by id and category (by = .(id, category)) and create a frequency variable with .N:
dt[!is.na(category), .N, by = .(id, category)]
which gives:
id category N
1: 1 x 3
2: 2 y 3
3: 2 x 2
4: 1 y 2
Ordering this by id will give you a clearer picture:
dt[!is.na(category), .N, by = .(id, category)][order(id)]
which results in:
id category N
1: 1 x 3
2: 1 y 2
3: 2 y 3
4: 2 x 2
If you just want the rows which indicate the top results:
dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
or:
dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
which both give:
id category N
1: 1 x 3
2: 2 y 3

Resources