R Data.table for computing summary stats across multiple columns - r

I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?
I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.
> set.seed(123)
>
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
V1 V2 data
1: 1 2 2
2: 1 3 3
3: 2 3 6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)
> #set keys
> setkey(dat,V1,V2)
>
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.5 0.5
2: 1 3 3 2.5 0.5
3: 2 3 6 6.0 NA
> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.0 NA
2: 1 3 3 4.5 4.5
3: 2 3 6 4.5 4.5
>
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2 NA
2: 1 3 3 3 NA
3: 2 3 6 6 NA
>
> #Meaningless but hopefull attempt.
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 3.666667 4.333333
2: 1 3 3 3.666667 4.333333
3: 2 3 6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
1 2.5
2 4.0
3 4.5
My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.
Thank you!

It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):
dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
# V1 MEAN
#1: 1 2.5
#2: 2 4.0
#3: 3 4.5

Related

How to crosstabulate the missings with data.table

Say we have this toy example:
prueba <- data.table(aa=1:7,bb=c(1,2,NA, NA, 3,1,1),
cc=c(1,2,NA, NA, 3,1,1) , YEAR=c(1,1,1,2,2,2,2))
aa bb cc YEAR
1: 1 1 1 1
2: 2 2 2 1
3: 3 NA NA 1
4: 4 NA NA 2
5: 5 3 3 2
6: 6 1 1 2
7: 7 1 1 2
I want to create a table with the values of something by YEAR.
In this simple example I will just ask for the table that says how many missing and non-missing I have.
This is an ugly way to do it, specifying everything by hand:
prueba[,.(sum(is.na(.SD)),sum(!is.na(.SD))), by=YEAR]
Though it doesn't label automatically the new columns we see it says I have 2 missings and 7 non-missing values for year 1, and ...
YEAR V1 V2
1: 1 2 7
2: 2 2 10
It works but what I would really like is to be able to use table() or some data.table equivalent command instead of specifying by hand every term. That would be much more efficient if I have many of them or if we don't know them beforehand.
I've tried with:
prueba[,table(is.na(.SD)), by=YEAR]
but it doesn't work, I get this:
YEAR V1
1: 1 7
2: 1 2
3: 2 10
4: 2 2
How can I get the same format than above?
I've unluckily tried by using as.datable, unlist, lapply, and other things. I think some people use dcast but I don't know how to use it here.
Is there a simple way to do it?
My real table is very large.
Is it better to use the names of the columns instead of .SD?
You can convert the table to a list if you want it as two separate columns
prueba[, as.list(table(is.na(.SD))), by=YEAR]
# YEAR FALSE TRUE
# 1: 1 7 2
# 2: 2 10 2
I suggest not using TRUE and FALSE as column names though.
prueba[, setNames(as.list(table(is.na(.SD))), c('notNA', 'isNA'))
, by = YEAR]
# YEAR notNA isNA
# 1: 1 7 2
# 2: 2 10 2
Another option is to add a new column and then dcast
na_summ <- prueba[, table(is.na(.SD)), by = YEAR]
na_summ[, vname := c('notNA', 'isNA'), YEAR]
dcast(na_summ, YEAR ~ vname, value.var = 'V1')
# YEAR isNA notNA
# 1: 1 2 7
# 2: 2 2 10

R - efficient comparison of subsets of rows between data frames

thank you for any help.
I need to check the total number of matches from the elements of each row of a data frame (df1) on rows of another data frame (df2).
The data frames have different number of columns (5 in the first one versus 6 in the second one, for instance). And there is no exact formation rule for the rows (so I can not find a way of doing this through combinatory analysis)
This routine must check all the rows from the first data frame against all the rows of the second data frame, resulting a total number of occurences by the number of hits.
Not all the possible sums are of interest. Actually I am looking for a specific total (which I call "hits" in this text).
In other words: how many times a subset of each row of df2 of size "hits" can be found in rows of df1.
Here is an example:
> ### Example
> ### df1 and df2 here are regularly formed just for illustration purposes
>
> require(combinat)
>
> df1 <- as.data.frame(t(combn(6,5)))
> df2 <- as.data.frame(t(combn(7,6)))
>
> df1
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 5 6
4 1 2 4 5 6
5 1 3 4 5 6
6 2 3 4 5 6
>
> df2
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 2 3 4 5 7
3 1 2 3 4 6 7
4 1 2 3 5 6 7
5 1 2 4 5 6 7
6 1 3 4 5 6 7
7 2 3 4 5 6 7
>
In this example, please note, for instance, that subsets of size 5, from row #1 of df2 can be found 6 times in the rows of df1. And so on.
I tried something like this:
> ### Check how many times subsets of size "hits" from rows from df2 are found in rows of df1
>
> myfn <- function(dfa,dfb,hits) {
+ sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
+ }
>
> r1 <- myfn(df1,df2,5)
>
> cbind(df2,"hits.eq.5" = r1)
V1 V2 V3 V4 V5 V6 hits.eq.5
1 1 2 3 4 5 6 6
2 1 2 3 4 5 7 1
3 1 2 3 4 6 7 1
4 1 2 3 5 6 7 1
5 1 2 4 5 6 7 1
6 1 3 4 5 6 7 1
7 2 3 4 5 6 7 1
This seems to do what I need, but it is too slow! I need using this routine on large data frames (about 200 K rows)
I am currently using R 3.1.2 GUI 1.65 Mavericks build (6833)
Can anyone provide a faster or more clever way of doing this? Than you again.
Best regards,
Vaccaro
Using apply(...) on data frames is very inefficient. This is because apply(...) takes a matrix as argument, so if you pass a data frame it will coerce that to a matrix. In your example you convert df1 to a matrix every time you call apply(...), which is nrow(df2) times.
Also, by using sapply(1:nrow(df2),...) and dfb[i,] you are using data frame row indexing, which is also very inefficient. You are much better off converting everything to matrix class at the beginning and then using apply(...) twice.
Finally, there is no reason to use a call to c(...). apply(...) already returns a vector (in this case), so you are just incurring the overhead of another function call to no effect.
Doing these things alone speeds up your code by about a factor of 20.
set.seed(1)
nrows <- 100
df1 <- data.frame(matrix(sample(1:5,5*nrows,replace=TRUE),nc=5))
df2 <- data.frame(matrix(sample(1:6,6*nrows,replace=TRUE),nc=6))
myfn <- function(dfa,dfb,hits) {
sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
}
myfn.2 <- function(dfa,dfb,hits) {
ma <- as.matrix(dfa)
mb <- as.matrix(dfb)
apply(mb,1,function(y) { sum(apply(ma,1,function(x) { sum(x %in% y) })==hits) })
}
system.time(r1<-myfn(df1,df2,3))
# user system elapsed
# 1.99 0.00 2.00
system.time(r2<-myfn.2(df1,df2,3))
# user system elapsed
# 0.09 0.00 0.10
identical(r1,r2)
# [1] TRUE
There is another approach which takes advantage of the fact that R is extremely efficient at manipulating lists. Since a data frame is just a list of vectors, we can improve performance by putting your rows into data frame columns and then using sapply(..) on that. This is faster than myfn.2(...) above, but only by about 20%.
myfn.3 <-function(dfa,dfb,hits) {
df1.t <- data.frame(t(dfa)) # rows into columns
df2.t <- data.frame(t(dfb))
sapply(df2.t,function(col2)sum(sapply(df1.t,function(col1)sum(col1 %in% col2)==hits)))
}
library(microbenchmark)
microbenchmark(myfn.2(df1,df2,5),myfn.3(df1,df2,5),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# myfn.2(df1, df2, 5) 92.84713 94.06418 96.41835 98.44738 99.88179 10
# myfn.3(df1, df2, 5) 75.53468 77.44348 79.24123 82.28033 84.12457 10
If you really have a dataset with 55MM rows, then I think you need to rethink this problem. I have no idea what you are trying to accomplish, but this seems like a brute force approach.

Create aggregate output data.table from function returning multiple output

I am struggling with solving a particular issue I have and I have searched stackoverflow and found examples that are close but not quite what I want.
The example that comes closest is here
This post (here) also comes close but I can't get my multiple output function to work with list()
What I want to do, is to create table with aggregated values (min, max, mean, MyFunc) grouped by a key.
I have also have some complex functions that returns multiple outputs. I could return single outputs but that would mean running the complex function many times and would take too long.
Using Matt Dowle's example from the this post with some change …
x <- data.table(a=1:3,b=1:6)[]
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
This is the type of output I want. An aggregate table (here only with mean and sum)
agg.dt <- x[ , list(mean=mean(b), sum=sum(b)), by=a][]
a mean sum
1: 1 2.5 5
2: 2 3.5 7
3: 3 4.5 9
This example function f returns 3 outputs. My real function is much more complex, and the constituents can't be split out like this.
f <- function(x) {list(length(x), min(x), max(x))}
Matt Dowle's suggestion on previous post works great, but doesn't produce and aggregate table, instead the aggregates are added to the main table (which is also very useful in other circumstances)
x[, c("length","min", "max"):= f(b), by=a][]
a b length min max
1: 1 1 2 1 4
2: 2 2 2 2 5
3: 3 3 2 3 6
4: 1 4 2 1 4
5: 2 5 2 2 5
6: 3 6 2 3 6
What I really want to do (if possible), is something along these lines …
agg.dt <- x[ , list(mean=mean(b)
, sum=sum(b)
, c("length","min", "max") = f(b)
), by=a]
and return an aggregate table looking something like this …
a mean sum length min max
1: 1 2.5 5 2 1 4
2: 2 3.5 7 2 2 5
3: 3 4.5 9 2 3 6
I can only really see a solution where this is a two stage process and merging/joining tables together?
library(data.table)
x <- data.table(a=1:3,b=1:6)
#have the function return a named list
f <- function(x) {list(length=length(x),
min=min(x),
max=max(x))}
# c can combine lists
# c(vector, vector, 3-list) is a 5-list
agg.dt <- x[ , c(mean=mean(b),
sum=sum(b),
f(b)),
by=a]
# a mean sum length min max
#1: 1 2.5 5 2 1 4
#2: 2 3.5 7 2 2 5
#3: 3 4.5 9 2 3 6
Alternatively, drop names from f() to save the time and cost of creating the same names for each group :
f <- function(x) {list(length(x),
min(x),
max(x))}
agg.dt <- x[ , c(mean(b),
sum(b),
f(b)),
by=a]
setnames(agg.dt, c("a", "mean","sum","length", "min", "max"))
This drop-names-and-put-them-back-afterwards trick (for speed when you have lots of groups) does't reach inside f(). f() could return anything so that's harder for data.table to optimize automatically.
Just to mention as well that base::list() no longer copies named inputs, as from R 3.1. So the common R idiom of a function f() doing some complex steps then returning a list() of local variables at the end, should be faster now.

Compute mean of vectors in data.table

I am implementing k-Means. This is my main datastructures:
dt1 is a Data.table with{Filename,featureVector,GroupItBelongsTo}
dt1<- data.table(Filename=files[1:limit],Vector=list(),G=-1)
setkey(dt1,Filename)
featureVector is a list. It has words associated with occurance, I am adding the occurance to each word using this line:
featureVector[[item]] <- emaildt[email==item]$N
A typical excerpt from my console when I call dt1 is.
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 3
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 3
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 3
I now want to compute new centroids for each group number. Meaning I want to sum all vector positions at position 1 with each other, [2] etc.. until the end, and after that - average them all.
Example: v1=[1,1,1], v2=[2,2,2],I would expect the centroid to be = c1=[1,5;1,5;1,5]
I tried to do: sapply(dt1[tt]$Vector,mean) (also tried with "sum") and it sums and "means" row-wise(inside each vector), not column wise(each n-th component) like I would like it to do.
How to do it?
====Update, answering a question in comments====
> head(dt1)
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 1
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 1
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 4
5: 000fcfac9e0a468a27b5e2ad0f78d842.txt 0,0,0,0,0,0, 1
6: 00166a4964d6c939f8f62280b85e706d.txt 0,0,0,1,0,0, 1
> class(dt1)
[1] "data.table" "data.frame"
>
Typing dt1$Vector gives(I only copied a small sample, it has many more words but they all look the same):
[[1]]
homosexuality articles church people interest
1 1 1 1 1
thread email send warning worth
1 1 1 1 1
And here is the class() output
> class(dt1$Vector)
[1] "list"
Screenshots when typing:
A<-as.matrix(t(as.data.frame(dt1$Vector)))
Result of class(dt1$Vector[[1]]):
[1] "numeric"
First, (the obligatory) you might consider using the R function kmeans to do your k-means clustering. If you prefer to roll your own, you can easily compute centroids of a data table as follows. First, I'll build some random data that looks like yours:
> set.seed(123)
> dt<-data.table(name=LETTERS[1:20],replicate(5,sample(0:4,20,T)),G=sample(3,20,T))
> head(dt)
name V1 V2 V3 V4 V5 G
1: A 1 4 0 3 1 2
2: B 3 3 2 0 3 1
3: C 2 3 2 1 2 2
4: D 4 4 1 1 3 3
5: E 4 3 0 4 0 2
6: F 0 3 0 2 2 3
The centroids can be computed in one line:
> dt[,lapply(.SD[,-1],mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
If you're going to do this, you might want to drop the names from the data table (temporarily), in which case you can just do:
> dt2<-copy(dt)
> dt2$name<-NULL
> dt2[,lapply(.SD,mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
Edit: a better way to do this, suggested by #Roland, is to use .SDcols:
dt[,lapply(.SD,mean),by=G,.SDcols=2:6]

data.table aggregations that return vectors, such as scale()

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?
You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
Your
dt[,zScore(value),by="category"]
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).
Your claim that data.table does not group is wrong:
library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)
category value
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 4
6: 2 4
7: 2 10
8: 2 10
dt[,value-mean(value),by=category]
category V1
1: 1 -0.5
2: 1 -0.5
3: 1 0.5
4: 1 0.5
5: 2 -3.0
6: 2 -3.0
7: 2 3.0
8: 2 3.0
If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.

Resources