Getting NA when summarizing by columns in data.table - r

I'm trying to summarize (take the mean) of various columns based on a single column within a data.table.
Here's a toy example of my data and the code I used that shows the problem I'm having:
library(data.table)
a<- data.table(
a=c(1213.1,NA,113.41,133.4,121.1,45.34),
b=c(14.131,NA,1.122,113.11,45.123,344.3),
c=c(101.2,NA,232.1,194.3,12.12,7645.3),
d=c(11.32,NA,32.121,94.3213,1223.1,34.1),
e=c(1311.32,NA,12.781,13.2,2.1,623.2),
f=c("A", "B", "B", "A", "B", "X"))
a
setkey(a,f) # column "f" is what I want to summarize columns by
a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)] # I just want to summarize first 4 columns
The output of the last line:
> a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)]
f a b c d
1: A 673.25 63.6205 147.75 52.82065
2: B NA NA NA NA
3: X 45.34 344.3000 7645.30 34.10000
Why are B entries NA? Shouldn't NA be ignored in the calculation of the mean? I think I found a similar issue here, but perhaps this is different and/or I've got the syntax messed up.
If this isn't possible in data.table, I'm open to other suggestions.

In R, the default behavior of the mean() function is to output NA if there are missing values. To ignore NAs in the mean calculation, you need to set the argument na.rm=TRUE. lapply takes in additional arguments to the function it is passed, so for your problem, you can try
a[, lapply(.SD, mean, na.rm=TRUE), by=f, .SDcols=c(1:4)]

Related

Flexible mixing of multiple aggregations in data.table for different column combinations

The following problem prevents me so far from a really flexible usage of data.table aggregations.
Example:
library(data.table)
set.seed(1)
DT <- data.table(C1=c("a","b","b"),
C2=round(rnorm(4),4),
C3=1:12,
C4=9:12)
sum_cols <- c("C2","C3")
#I want to apply a custom aggregation over multiple columns
DT[,lapply(.SD,sum),by=C1,.SDcols=sum_cols]
### Part 1 of question ###
#but what if I want to add another aggregation, e.g. count
DT[,.N,by=C1]
#this is not working as intended (creates 4 rows instead of 2 and doesnt contain sum_cols)
DT[,.(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
### Part 2 of question ###
# or another function for another set of colums and adding a prefix to keep them appart?
mean_cols <- c("C3","C4")
#intended table structure (with 2 rows again)
C1 sum_C2 sum_C3 mean_C3 mean_C4
I know I can always merge various single aggregation results by some key but I`m sure there must be a correct, flexible and easy way to do what I would like to do (especially Part 2).
The first thing to notice is that data.table's j argument expects a list output, which can be built with c, as mentioned in #akrun's answer. Here are two ways to do it:
set.seed(1)
DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12)
sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")
# with the development version, 1.10.1+
DT[, c(
.N,
sum = lapply(.SD[, ..sum_cols], sum),
mean = lapply(.SD[, ..mean_cols], mean)
), by=C1]
# in earlier versions
DT[, c(
.N,
sum = lapply(.SD[, sum_cols, with=FALSE], sum),
mean = lapply(.SD[, mean_cols, with=FALSE], mean)
), by=C1]
mget returns a list and c connects elements together to make a list.
Comments
If you turn on the verbose data.table option for these calls, you'll see a message:
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Also, you'll see that the optimized group mean and sum are not being used (see ?GForce for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.
The output are lists, so we use c to concatenate both the list outputs
DT[,c(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
# C1 N C2 C3
# 1: a 4 0.288 22
# 2: b 8 0.576 56

Simultaneous order, row-filter and column-select with data.table

I am trying to do multiple steps in one line in R to select a value from a data.table (dt) with multiple criteria.
For example:
set.seed(123)
dt <- data.table(id = rep(letters[1:2],2),
time = rnorm(4),
value = rnorm(4)*100)
# id time value
# 1: a -0.56047565 12.92877
# 2: b -0.23017749 171.50650
# 3: a 1.55870831 46.09162
# 4: b 0.07050839 -126.50612
# Now I want to select the last (maximum time) value from id == "a"
# My pseudo data.table code looks like this
dt[order(time) & id == "a" & .N, value]
# [1] 12.92877 46.09162
Instead of getting the two values I want to get only the last value (which has the higher time-value).
If I do it step-by-step it works:
dt <- dt[order(time) & id == "a"]
dt[.N, value]
# [1] 46.09162
Bonus:
How can I order a data.table without copying the data.table: ie
dt <- dt[order(time)]
without the <-. Similar to the :=-operator such as in dt[, new_val := value*2] which creates the new variable without copying the whole data.table.
Thank you, any idea is greatly appreciated!
For you first question, try
dt[id == "a", value[which.max(time)]]
## [1] 46.09162
For bonus question, try the setorder function which will order your data in place (you can also order in descending order by adding - in front of time)
setorder(dt, time)
dt
# id time value
# 1: a -0.56047565 12.92877
# 2: b -0.23017749 171.50650
# 3: b 0.07050839 -126.50612
# 4: a 1.55870831 46.09162
Also, if you already ordering your data by time, you could do both - order by reference and select value by condition- in single line
setorder(dt, time)[id == "a", value[.N]]
I know this is an older question, but I'd like to add something. Having a similar problem I stumbled on this question and although David Arenburg's answer does provide a solution to this exact question, I had trouble with it when trying to replace/overwrite values from that filtered and ordered data.table, so here is an alternative way which also lets you apply <- calls directly onto the filtered and ordered data.tabe.
The key is that data.table lets you concatenate several [] to each other.
Example:
dt[id=="a", ][order(time), ][length(value), "value"] <- 0
This also works for more than one entry, simply provide a suitable vector as replacement value.
Note however that the .N which is a list object needs to be replaced with e.g. length of the column because data.table expects an integer at this position in i and the column which you want to select in j needs to be wrapped by "".
I found this to be the more intuitive way and it lets you not only filter the data table, but also manipulate its values without needing to worry about temporary tables.

aggregating categories in R

Hi I'm new to R and I'm trying to aggregate a list and count the total but not sure how to do it.
myList =c("A", "B", "A", "A", "B")
I can create a function that loops through the list and groups each category and counts them. But I'm sure there must be an easier way to group this so that I can get the category and the number each category. That is A would be 3 and B would be 2.
I tried using the function below but I believe I don't have the proper syntax.
aggr <-aggregate(myList, count)
Thanks for your help in advance.
I'm guessing that you're just looking for table and not actually aggregate:
myList =c("A", "B", "A", "A", "B")
table(myList)
# myList
# A B
# 3 2
tapply can also be handy here:
tapply(myList, myList, length)
# A B
# 3 2
And, I suppose you could "trick" aggregate in the following way:
aggregate(ind ~ myList, data.frame(myList, ind = 1), length)
# myList ind
# 1 A 3
# 2 B 2
If you're looking to understand why as well, aggregate generally takes a data.frame as its input, and you specify one or more columns to be aggregated grouped by one or more other columns (or vectors in your workspace of the same length as the number of rows).
In the example above, I converted your vector into a data.frame, adding a dummy column where all the values were "1". I then used the formula ind ~ myList (where ~ is sort of like "is grouped by") and set the aggregation function to length (there is no count in base R, though that function can be found in different packages).

How to avoid listing out function arguments but still subset too?

I have a function myFun(a,b,c,d,e,f,g,h) which contains a vectorised expression of its parameters within it.
I'd like to add a new column: data$result <- with(data, myFun(A,B,C,D,E,F,G,H)) where A,B,C,D,E,F,G,H are column names of data. I'm using data.table but data.frame answers are appreciated too.
So far the parameter list (column names) can be tedious to type out, and I'd like to improve readability. Is there a better way?
> myFun <- function(a,b,c) a+b+c
> dt <- data.table(a=1:5,b=1:5,c=1:5)
> with(dt,myFun(a,b,c))
[1] 3 6 9 12 15
The ultimate thing I would like to do is:
dt[isFlag, newCol:=myFun(A,B,C,D,E,F,G,H)]
However:
> dt[a==1,do.call(myFun,dt)]
[1] 3 6 9 12 15
Notice that the j expression seems to ignore the subset. The result should be just 3.
Ignoring the subset aspect for now: df$result <- do.call("myFun", df). But that copies the whole df whereas data.table allows you to add the column by reference: df[,result:=myFun(A,B,C,D,E,F,G,H)].
To include the comment from #Eddi (and I'm not sure how to combine these operations in data.frame so easily) :
dt[isFlag, newCol := do.call(myFun, .SD)]
Note that .SD can be used even when you aren't grouping, just subsetting.
Or if your function is literally just adding its arguments together :
dt[isFlag, newCol := do.call(sum, .SD)]
This automatically places NA into newCol where isFlag is FALSE.
You can use
df$result <- do.call(myFun, df)

Summarizing multiple columns with data.table

I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows
library(data.table)
a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10))
b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10))
dt = merge(a,b,by=intersect(names(a),names(b)),all=T)
dt$category = sample(letters[1:3],10,replace=T)
and I wondered if there was a more efficient way than the following to summarize the data.
summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),
d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]
I don't really want to type all 50 column calculations by hand and a eval(paste(...)) seems clunky somehow.
I had a look at the example below but it seems a bit complicated for my needs. thanks
how to summarize a data.table across multiple columns
You can use a simple lapply statement with .SD
dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]
category index a b z c d
1: c 19 51.13289 48.49994 42.50884 9.535588 11.53253
2: b 9 17.34860 20.35022 10.32514 11.764105 10.53127
3: a 27 25.91616 31.12624 0.00000 29.197343 31.71285
If you only want to summarize over certain columns, you can add the .SDcols argument
# note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ]
category a c z
1: c 51.13289 9.535588 42.50884
2: b 17.34860 11.764105 10.32514
3: a 25.91616 29.197343 0.00000
This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).
Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.
Documentation
See ?.SD, ?data.table and its .SDcols argument, and the vignette Using .SD for Data Analysis.
Also have a look at data.table FAQ 2.1.

Resources