data.table aggregations that return vectors, such as scale() - r

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?

You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
Your
dt[,zScore(value),by="category"]
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).

Your claim that data.table does not group is wrong:
library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)
category value
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 4
6: 2 4
7: 2 10
8: 2 10
dt[,value-mean(value),by=category]
category V1
1: 1 -0.5
2: 1 -0.5
3: 1 0.5
4: 1 0.5
5: 2 -3.0
6: 2 -3.0
7: 2 3.0
8: 2 3.0
If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.

Related

How to crosstabulate the missings with data.table

Say we have this toy example:
prueba <- data.table(aa=1:7,bb=c(1,2,NA, NA, 3,1,1),
cc=c(1,2,NA, NA, 3,1,1) , YEAR=c(1,1,1,2,2,2,2))
aa bb cc YEAR
1: 1 1 1 1
2: 2 2 2 1
3: 3 NA NA 1
4: 4 NA NA 2
5: 5 3 3 2
6: 6 1 1 2
7: 7 1 1 2
I want to create a table with the values of something by YEAR.
In this simple example I will just ask for the table that says how many missing and non-missing I have.
This is an ugly way to do it, specifying everything by hand:
prueba[,.(sum(is.na(.SD)),sum(!is.na(.SD))), by=YEAR]
Though it doesn't label automatically the new columns we see it says I have 2 missings and 7 non-missing values for year 1, and ...
YEAR V1 V2
1: 1 2 7
2: 2 2 10
It works but what I would really like is to be able to use table() or some data.table equivalent command instead of specifying by hand every term. That would be much more efficient if I have many of them or if we don't know them beforehand.
I've tried with:
prueba[,table(is.na(.SD)), by=YEAR]
but it doesn't work, I get this:
YEAR V1
1: 1 7
2: 1 2
3: 2 10
4: 2 2
How can I get the same format than above?
I've unluckily tried by using as.datable, unlist, lapply, and other things. I think some people use dcast but I don't know how to use it here.
Is there a simple way to do it?
My real table is very large.
Is it better to use the names of the columns instead of .SD?
You can convert the table to a list if you want it as two separate columns
prueba[, as.list(table(is.na(.SD))), by=YEAR]
# YEAR FALSE TRUE
# 1: 1 7 2
# 2: 2 10 2
I suggest not using TRUE and FALSE as column names though.
prueba[, setNames(as.list(table(is.na(.SD))), c('notNA', 'isNA'))
, by = YEAR]
# YEAR notNA isNA
# 1: 1 7 2
# 2: 2 10 2
Another option is to add a new column and then dcast
na_summ <- prueba[, table(is.na(.SD)), by = YEAR]
na_summ[, vname := c('notNA', 'isNA'), YEAR]
dcast(na_summ, YEAR ~ vname, value.var = 'V1')
# YEAR isNA notNA
# 1: 1 2 7
# 2: 2 2 10

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Create aggregate output data.table from function returning multiple output

I am struggling with solving a particular issue I have and I have searched stackoverflow and found examples that are close but not quite what I want.
The example that comes closest is here
This post (here) also comes close but I can't get my multiple output function to work with list()
What I want to do, is to create table with aggregated values (min, max, mean, MyFunc) grouped by a key.
I have also have some complex functions that returns multiple outputs. I could return single outputs but that would mean running the complex function many times and would take too long.
Using Matt Dowle's example from the this post with some change …
x <- data.table(a=1:3,b=1:6)[]
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
This is the type of output I want. An aggregate table (here only with mean and sum)
agg.dt <- x[ , list(mean=mean(b), sum=sum(b)), by=a][]
a mean sum
1: 1 2.5 5
2: 2 3.5 7
3: 3 4.5 9
This example function f returns 3 outputs. My real function is much more complex, and the constituents can't be split out like this.
f <- function(x) {list(length(x), min(x), max(x))}
Matt Dowle's suggestion on previous post works great, but doesn't produce and aggregate table, instead the aggregates are added to the main table (which is also very useful in other circumstances)
x[, c("length","min", "max"):= f(b), by=a][]
a b length min max
1: 1 1 2 1 4
2: 2 2 2 2 5
3: 3 3 2 3 6
4: 1 4 2 1 4
5: 2 5 2 2 5
6: 3 6 2 3 6
What I really want to do (if possible), is something along these lines …
agg.dt <- x[ , list(mean=mean(b)
, sum=sum(b)
, c("length","min", "max") = f(b)
), by=a]
and return an aggregate table looking something like this …
a mean sum length min max
1: 1 2.5 5 2 1 4
2: 2 3.5 7 2 2 5
3: 3 4.5 9 2 3 6
I can only really see a solution where this is a two stage process and merging/joining tables together?
library(data.table)
x <- data.table(a=1:3,b=1:6)
#have the function return a named list
f <- function(x) {list(length=length(x),
min=min(x),
max=max(x))}
# c can combine lists
# c(vector, vector, 3-list) is a 5-list
agg.dt <- x[ , c(mean=mean(b),
sum=sum(b),
f(b)),
by=a]
# a mean sum length min max
#1: 1 2.5 5 2 1 4
#2: 2 3.5 7 2 2 5
#3: 3 4.5 9 2 3 6
Alternatively, drop names from f() to save the time and cost of creating the same names for each group :
f <- function(x) {list(length(x),
min(x),
max(x))}
agg.dt <- x[ , c(mean(b),
sum(b),
f(b)),
by=a]
setnames(agg.dt, c("a", "mean","sum","length", "min", "max"))
This drop-names-and-put-them-back-afterwards trick (for speed when you have lots of groups) does't reach inside f(). f() could return anything so that's harder for data.table to optimize automatically.
Just to mention as well that base::list() no longer copies named inputs, as from R 3.1. So the common R idiom of a function f() doing some complex steps then returning a list() of local variables at the end, should be faster now.

R Data.table for computing summary stats across multiple columns

I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?
I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.
> set.seed(123)
>
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
V1 V2 data
1: 1 2 2
2: 1 3 3
3: 2 3 6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)
> #set keys
> setkey(dat,V1,V2)
>
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.5 0.5
2: 1 3 3 2.5 0.5
3: 2 3 6 6.0 NA
> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.0 NA
2: 1 3 3 4.5 4.5
3: 2 3 6 4.5 4.5
>
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2 NA
2: 1 3 3 3 NA
3: 2 3 6 6 NA
>
> #Meaningless but hopefull attempt.
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 3.666667 4.333333
2: 1 3 3 3.666667 4.333333
3: 2 3 6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
1 2.5
2 4.0
3 4.5
My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.
Thank you!
It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):
dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
# V1 MEAN
#1: 1 2.5
#2: 2 4.0
#3: 3 4.5

R data.table subsetting on multiple conditions.

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?
Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.
I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..
library("data.table")
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))
dt=as.data.table(df)
This is similar to a previous answer, but here the subsetting works in a more data.table like manner.
First, lets take the cids that meet our condition:
matching_cids = dt[sku==1, cid]
the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:
dt[cid %in% matching_cids]
or on one line:
> dt[cid %in% dt[sku==1, cid]]
cid order sku
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 2 2
5: 1 3 3
6: 2 4 1
7: 2 4 2
8: 2 4 3
9: 2 5 1
10: 2 5 3
11: 4 7 1
12: 6 9 1
13: 6 9 2
I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):
# Set initial key
setkey(dt,sku)
# Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]
# change key of dt to cid to match customer id
setkey(dt,cid)
# join based on common key
dt[dts,.SD]
# cid order sku
# 1: 1 1 1
# 2: 1 1 2
# 3: 1 2 2
# 4: 1 1 3
# 5: 1 3 3
# 6: 2 4 1
# 7: 2 5 1
# 8: 2 4 2
# 9: 2 4 3
#10: 2 5 3
#11: 4 7 1
#12: 6 9 1
#13: 6 9 2
An alternative that you can do on one line is to use a data.table merge like so...
setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )

Resources