Summarizing multiple columns with data.table - r

I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows
library(data.table)
a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10))
b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10))
dt = merge(a,b,by=intersect(names(a),names(b)),all=T)
dt$category = sample(letters[1:3],10,replace=T)
and I wondered if there was a more efficient way than the following to summarize the data.
summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),
d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]
I don't really want to type all 50 column calculations by hand and a eval(paste(...)) seems clunky somehow.
I had a look at the example below but it seems a bit complicated for my needs. thanks
how to summarize a data.table across multiple columns

You can use a simple lapply statement with .SD
dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]
category index a b z c d
1: c 19 51.13289 48.49994 42.50884 9.535588 11.53253
2: b 9 17.34860 20.35022 10.32514 11.764105 10.53127
3: a 27 25.91616 31.12624 0.00000 29.197343 31.71285
If you only want to summarize over certain columns, you can add the .SDcols argument
# note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ]
category a c z
1: c 51.13289 9.535588 42.50884
2: b 17.34860 11.764105 10.32514
3: a 25.91616 29.197343 0.00000
This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).
Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.
Documentation
See ?.SD, ?data.table and its .SDcols argument, and the vignette Using .SD for Data Analysis.
Also have a look at data.table FAQ 2.1.

Related

Flexible mixing of multiple aggregations in data.table for different column combinations

The following problem prevents me so far from a really flexible usage of data.table aggregations.
Example:
library(data.table)
set.seed(1)
DT <- data.table(C1=c("a","b","b"),
C2=round(rnorm(4),4),
C3=1:12,
C4=9:12)
sum_cols <- c("C2","C3")
#I want to apply a custom aggregation over multiple columns
DT[,lapply(.SD,sum),by=C1,.SDcols=sum_cols]
### Part 1 of question ###
#but what if I want to add another aggregation, e.g. count
DT[,.N,by=C1]
#this is not working as intended (creates 4 rows instead of 2 and doesnt contain sum_cols)
DT[,.(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
### Part 2 of question ###
# or another function for another set of colums and adding a prefix to keep them appart?
mean_cols <- c("C3","C4")
#intended table structure (with 2 rows again)
C1 sum_C2 sum_C3 mean_C3 mean_C4
I know I can always merge various single aggregation results by some key but I`m sure there must be a correct, flexible and easy way to do what I would like to do (especially Part 2).
The first thing to notice is that data.table's j argument expects a list output, which can be built with c, as mentioned in #akrun's answer. Here are two ways to do it:
set.seed(1)
DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12)
sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")
# with the development version, 1.10.1+
DT[, c(
.N,
sum = lapply(.SD[, ..sum_cols], sum),
mean = lapply(.SD[, ..mean_cols], mean)
), by=C1]
# in earlier versions
DT[, c(
.N,
sum = lapply(.SD[, sum_cols, with=FALSE], sum),
mean = lapply(.SD[, mean_cols, with=FALSE], mean)
), by=C1]
mget returns a list and c connects elements together to make a list.
Comments
If you turn on the verbose data.table option for these calls, you'll see a message:
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Also, you'll see that the optimized group mean and sum are not being used (see ?GForce for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.
The output are lists, so we use c to concatenate both the list outputs
DT[,c(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
# C1 N C2 C3
# 1: a 4 0.288 22
# 2: b 8 0.576 56

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Using data.table to calculate a function which depends on many columns

There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:
# Create a data table with 26 columns. Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)
Now, say I would like to count the number of 'a's in columns 5,6,7 and 8.
I cannot see how to do this with SDcols and end up doing:
data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]
Which is very tedious. Is there a more sensible way to do this?
Thanks
I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD and .SDcols.
.SD is just a data.table containing the data for current group. And .SDcols tells the columns .SD should have. A useful way is to use print to see the content.
# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]
Since there is no by here, .SD contains all the rows of data.dt, corresponding to the columns specified in .SDcols.
Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.
data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]
We return a logical matrix by comparing all the columns in .SD to "a". And then use rowSums to sum them up.
Another way using Reduce:
data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]

Getting NA when summarizing by columns in data.table

I'm trying to summarize (take the mean) of various columns based on a single column within a data.table.
Here's a toy example of my data and the code I used that shows the problem I'm having:
library(data.table)
a<- data.table(
a=c(1213.1,NA,113.41,133.4,121.1,45.34),
b=c(14.131,NA,1.122,113.11,45.123,344.3),
c=c(101.2,NA,232.1,194.3,12.12,7645.3),
d=c(11.32,NA,32.121,94.3213,1223.1,34.1),
e=c(1311.32,NA,12.781,13.2,2.1,623.2),
f=c("A", "B", "B", "A", "B", "X"))
a
setkey(a,f) # column "f" is what I want to summarize columns by
a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)] # I just want to summarize first 4 columns
The output of the last line:
> a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)]
f a b c d
1: A 673.25 63.6205 147.75 52.82065
2: B NA NA NA NA
3: X 45.34 344.3000 7645.30 34.10000
Why are B entries NA? Shouldn't NA be ignored in the calculation of the mean? I think I found a similar issue here, but perhaps this is different and/or I've got the syntax messed up.
If this isn't possible in data.table, I'm open to other suggestions.
In R, the default behavior of the mean() function is to output NA if there are missing values. To ignore NAs in the mean calculation, you need to set the argument na.rm=TRUE. lapply takes in additional arguments to the function it is passed, so for your problem, you can try
a[, lapply(.SD, mean, na.rm=TRUE), by=f, .SDcols=c(1:4)]

Summing many columns with data.table in R, remove NA [duplicate]

This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 9 years ago.
This is really two questions I guess. I'm trying to use the data.table package to summarize a large dataset. Say my original large dataset is df1 and unfortunately df1 has 50 columns (y0... y49) that I want the sum of by 3 fields (segmentfield1, segmentfield2, segmentfield3). Is there a simpler way to do this than typing every y0...y49 column out? Related to this, is there a generic na.rm=T for the data.table instead of typing that with each sum too?
dt1 <- data.table(df1)
setkey(dt1, segmentfield1, segmentfield2, segmentfield3)
dt2 <- dt1[,list( y0=sum(y0,na.rm=T), y1=sum(y1,na.rm=T), y2=sum(y2,na.rm=T), ...
y49=sum(y49,na.rm=T) ),
by=list(segmentfield1, segmentfield2, segmentfield3)]
First, create the object variables for the names in use:
colsToSum <- names(dt1) # or whatever you need
summedNms <- paste0( "y", seq_along(colsToSum) )
If you'd like to copy it to a new data.table
dt2 <- dt1[, lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
setnames(dt2, summedNms)
If alternatively, youd like to append the columns to the original
dt1[, c(summedNms) := lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
As far as a general na.rm process, there is not one specific to data.table, but have a look at ?na.omit and ?na.exclude

Resources