Create aggregate output data.table from function returning multiple output - r

I am struggling with solving a particular issue I have and I have searched stackoverflow and found examples that are close but not quite what I want.
The example that comes closest is here
This post (here) also comes close but I can't get my multiple output function to work with list()
What I want to do, is to create table with aggregated values (min, max, mean, MyFunc) grouped by a key.
I have also have some complex functions that returns multiple outputs. I could return single outputs but that would mean running the complex function many times and would take too long.
Using Matt Dowle's example from the this post with some change …
x <- data.table(a=1:3,b=1:6)[]
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
This is the type of output I want. An aggregate table (here only with mean and sum)
agg.dt <- x[ , list(mean=mean(b), sum=sum(b)), by=a][]
a mean sum
1: 1 2.5 5
2: 2 3.5 7
3: 3 4.5 9
This example function f returns 3 outputs. My real function is much more complex, and the constituents can't be split out like this.
f <- function(x) {list(length(x), min(x), max(x))}
Matt Dowle's suggestion on previous post works great, but doesn't produce and aggregate table, instead the aggregates are added to the main table (which is also very useful in other circumstances)
x[, c("length","min", "max"):= f(b), by=a][]
a b length min max
1: 1 1 2 1 4
2: 2 2 2 2 5
3: 3 3 2 3 6
4: 1 4 2 1 4
5: 2 5 2 2 5
6: 3 6 2 3 6
What I really want to do (if possible), is something along these lines …
agg.dt <- x[ , list(mean=mean(b)
, sum=sum(b)
, c("length","min", "max") = f(b)
), by=a]
and return an aggregate table looking something like this …
a mean sum length min max
1: 1 2.5 5 2 1 4
2: 2 3.5 7 2 2 5
3: 3 4.5 9 2 3 6
I can only really see a solution where this is a two stage process and merging/joining tables together?

library(data.table)
x <- data.table(a=1:3,b=1:6)
#have the function return a named list
f <- function(x) {list(length=length(x),
min=min(x),
max=max(x))}
# c can combine lists
# c(vector, vector, 3-list) is a 5-list
agg.dt <- x[ , c(mean=mean(b),
sum=sum(b),
f(b)),
by=a]
# a mean sum length min max
#1: 1 2.5 5 2 1 4
#2: 2 3.5 7 2 2 5
#3: 3 4.5 9 2 3 6
Alternatively, drop names from f() to save the time and cost of creating the same names for each group :
f <- function(x) {list(length(x),
min(x),
max(x))}
agg.dt <- x[ , c(mean(b),
sum(b),
f(b)),
by=a]
setnames(agg.dt, c("a", "mean","sum","length", "min", "max"))
This drop-names-and-put-them-back-afterwards trick (for speed when you have lots of groups) does't reach inside f(). f() could return anything so that's harder for data.table to optimize automatically.
Just to mention as well that base::list() no longer copies named inputs, as from R 3.1. So the common R idiom of a function f() doing some complex steps then returning a list() of local variables at the end, should be faster now.

Related

Extract data from data.frame based on coordinates in another data.frame

So here is what my problem is. I have a really big data.frame woth two columns, first one represents x coordinates (rows) and another one y coordinates (columns), for example:
x y
1 1
2 3
3 1
4 2
3 4
In another frame I have some data (numbers actually):
a b c d
8 7 8 1
1 2 3 4
5 4 7 8
7 8 9 7
1 5 2 3
I would like to add a third column in first data.frame with data from second data.frame based on coordinates from first data.frame. So the result should look like this:
x y z
1 1 8
2 3 3
3 1 5
4 2 8
3 4 8
Since my data.frames are really big the for loops are too slow. I think there is a way to do this with apply loop family, but I can't find how. Thanks in advance (and sorry for ugly message layout, this is my first post here and I don't know how to produce this nice layout with code and proper data.frames like in another questions).
This is a simple indexing question. No need in external packages or *apply loops, just do
df1$z <- df2[as.matrix(df1)]
df1
# x y z
# 1 1 1 8
# 2 2 3 3
# 3 3 1 5
# 4 4 2 8
# 5 3 4 8
A base R solution: (df1 and df2 are coordinates and numbers as data frames):
df1$z <- mapply(function(x,y) df2[x,y], df1$x, df1$y )
It works if the last y in the first data frame is corrected from 5 to 4.
I guess it was a typo since you don't have 5 columns in the second data drame.
Here's how I would do this.
First, use data.table for fast merging; then convert your data frames (I'll call them dt1 with coordinates and vals with values) to data.tables.
dt1<-data.table(dt)
vals<-data.table(vals)
Second, put vals into a new data.table with coordinates:
vals_dt<-data.table(x=rep(1:dim(vals)[1],dim(vals)[2]),
y=rep(1:dim(vals)[2],each=dim(vals)[1]),
z=matrix(vals,ncol=1)[,1],key=c("x","y"))
Now merge:
setkey(dt1,x,y)[vals_dt,z:=z]
You can also try the data.table package and update df1 by reference
library(data.table)
setDT(df1)[, z := df2[cbind(x, y)]][]
# x y z
# 1: 1 1 8
# 2: 2 3 3
# 3: 3 1 5
# 4: 4 2 8
# 5: 3 4 8

R Data.table for computing summary stats across multiple columns

I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?
I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.
> set.seed(123)
>
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
V1 V2 data
1: 1 2 2
2: 1 3 3
3: 2 3 6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)
> #set keys
> setkey(dat,V1,V2)
>
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.5 0.5
2: 1 3 3 2.5 0.5
3: 2 3 6 6.0 NA
> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.0 NA
2: 1 3 3 4.5 4.5
3: 2 3 6 4.5 4.5
>
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2 NA
2: 1 3 3 3 NA
3: 2 3 6 6 NA
>
> #Meaningless but hopefull attempt.
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 3.666667 4.333333
2: 1 3 3 3.666667 4.333333
3: 2 3 6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
1 2.5
2 4.0
3 4.5
My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.
Thank you!
It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):
dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
# V1 MEAN
#1: 1 2.5
#2: 2 4.0
#3: 3 4.5

data.table aggregations that return vectors, such as scale()

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?
You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
Your
dt[,zScore(value),by="category"]
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).
Your claim that data.table does not group is wrong:
library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)
category value
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 4
6: 2 4
7: 2 10
8: 2 10
dt[,value-mean(value),by=category]
category V1
1: 1 -0.5
2: 1 -0.5
3: 1 0.5
4: 1 0.5
5: 2 -3.0
6: 2 -3.0
7: 2 3.0
8: 2 3.0
If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.

sort and number within levels of a factor in r

if i have the following data frame G:
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
I am trying to get:
z type x y
3 a 6 3
2 a 5 2
1 a 4 1
4 b 1 2
5 b 0.9 1
6 c 4 1
I.e. i want to sort the whole data frame within the levels of factor type based on vector x. Get the length of of each level a = 3 b=2 c=1 and then number in a decreasing fashion in a new vector y.
My starting place is currently with sort()
tapply(y, x, sort)
Would it be best to first try and use sapply to split everything first?
There are many ways to skin this cat. Here is one solution using base R and vectorized code in two steps (without any apply):
Sort the data using order and xtfrm
Use rle and sequence to genereate the sequence.
Replicate your data:
dat <- read.table(text="
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
", header=TRUE, stringsAsFactors=FALSE)
Two lines of code:
r <- dat[order(dat$type, -xtfrm(dat$x)), ]
r$y <- sequence(rle(r$type)$lengths)
Results in:
r
z type x y
3 3 a 6.0 1
2 2 a 5.0 2
1 1 a 4.0 3
4 4 b 1.0 1
5 5 b 0.9 2
6 6 c 4.0 1
The call to order is slightly complicated. Since you are sorting one column in ascending order and a second in descending order, use the helper function xtfrm. See ?xtfrm for details, but it is also described in ?order.
I like Andrie's better:
dat <- read.table(text="z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4", header=T)
Three lines of code:
dat <- dat[order(dat$type), ]
x <- by(dat, dat$type, nrow)
dat$y <- unlist(sapply(x, function(z) z:1))
I Edited my response to adapt for the comments Andrie mentioned. This works but if you went this route instead of Andrie's you're crazy.

Create a vector listing run length of original vector with same length as original vector

This problem seems trivial but I'm at my wits end after hours of reading.
I need to generate a vector of the same length as the input vector that lists for each value of the input vector the total count for that value. So, by way of example, I would want to generate the last column of this dataframe:
> df
customer.id transaction.count total.transactions
1 1 1 4
2 1 2 4
3 1 3 4
4 1 4 4
5 2 1 2
6 2 2 2
7 3 1 3
8 3 2 3
9 3 3 3
10 4 1 1
I realise this could be done two ways, either by using run lengths of the first column, or grouping the second column using the first and applying a maximum.
I've tried both tapply:
> tapply(df$transaction.count, df$customer.id, max)
And rle:
> rle(df$customer.id)
But both return a vector of shorter length than the original:
[1] 4 2 3 1
Any help gratefully accepted!
You can do it without creating transaction counter with:
df$total.transactions <- with( df,
ave( transaction.count , customer.id , FUN=length) )
You can use rle with rep to get what you want:
x <- rep(1:4, 4:1)
> x
[1] 1 1 1 1 2 2 2 3 3 4
rep(rle(x)$lengths, rle(x)$lengths)
> rep(rle(x)$lengths, rle(x)$lengths)
[1] 4 4 4 4 3 3 3 2 2 1
For performance purposes, you could store the rle object separately so it is only called once.
Or as Karsten suggested with ddply from plyr:
require(plyr)
#Expects data.frame
dat <- data.frame(x = rep(1:4, 4:1))
ddply(dat, "x", transform, total = length(x))
You are probably looking for split-apply-combine approach; have a look at ddply in the plyr package or the split function in base R.

Resources