merge a data.table with itself after a reference lookup - r

If I have the data.tables DT and neighbors:
set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
n is a lookup table. Whenever ok == 0, I want to look up the corresponding y1 in n and use that value for x and the given idx. By way of example, row 4 of DT:
> DT
idx x y ok
1: 1 -0.6264538 a 1
2: 1 0.1836433 b 1
3: 1 -0.8356286 c 1
4: 1 1.5952808 d 0
5: 1 0.3295078 e 1
6: 2 -0.8204684 a 1
The y1 from n for d is e:
> n[y == 'd']
y y1
1: d e
and idx for row 4 is 1. So I would use:
> DT[idx == 1 & y == 'e', x]
[1] 0.3295078
I want my output to be a data.table just like DT[ok == 0] with all the x values replaced by their appropriate n['y1'] x value:
> output
idx x y ok
1: 1 0.3295078 d 0
2: 2 -0.3053884 d 0
3: 3 0.3898432 a 0
4: 5 0.7821363 a 0
5: 7 1.3586800 e 0
6: 8 0.7631757 d 0
I can think of a few ways of doing this with base R or with plyr... and maybe its late on Friday... but whatever the sequences of merges that this would require in data.table is beyond me!

Great question. Using the functions in the other answers and wrapping Blue's answer into a function blue, how about the following. The benchmarks include the time to setkey in all cases.
red = function() {
ans = DT[ok==0]
# Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
# If lots of lookups to "ok" need to be done, then setkey may be worth it
# If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster
# After extracting ans in the original order of DT, we can now set the key :
setkey(DT,idx,y)
setkey(n,y)
# Now working with the reduced ans ...
ans[,y1:=n[y,y1,mult="first"]]
# Add a new column y1 by reference containing the lookup in n
# mult="first" because we know n's key is unique, for speed (to save looking
# for groups of matches in n). Future version of data.table won't need this.
# Also, mult="first" has the advantage of dropping group columns (so we don't
# need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".
ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
# Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
# how to pick the columns of ans to join to DT's key when they are not the key
# columns of ans and not the first 1:n columns of ans. There is no need to key
# ans, especially since that would change ans's order and not strictly answer
# the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
# wouldn't need that part, just
# ans[,x:=DT[ans,x,mult="first"]]
# would do (relying on DT having 2 columns in its key). That has the advantage
# of not copying the idx and y1 columns into a new data.table to pass as the i
# DT. To save that copy y1 could be moved to column 2 using setcolorder first.
redans <<- ans
}
crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
red={DT=copy(origDT); system.time(red())},
fun={DT=copy(origDT); system.time(fun(DT,n))},
replications=3, order="relative")
test replications elapsed relative user.self sys.self user.child sys.child
red 3 1.107 1.000 1.100 0.004 0 0
blue 3 5.797 5.237 5.660 0.120 0 0
fun 3 8.255 7.457 8.041 0.184 0 0
crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
red 3 14.647 1.000 14.613 0.000 0 0
blue 3 87.589 5.980 87.197 0.124 0 0
fun 3 197.243 13.466 195.240 0.644 0 0
identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE
The order is needed in the identical because red returns the result in the same order as DT[ok==0] whereas blue appears to be ordered by y1 in the case of ties in idx.
If y1 is unwanted in the result it can be removed instantly (regardless of table size) using ans[,y1:=NULL]; i.e., this can be included above to produce the exact result requested in question, without affecting the timings at all.

library(data.table)
crdt <- function(i=10){
set.seed(1)
DT <<- data.table(idx=rep(1:i, each=5), x=rnorm(5*i),
y=letters[1:5], ok=rbinom(5*i, 1, 0.90))
n <<- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
}
fun <- function(DT,n){
setkey(DT,ok)
n1 <- merge(n,DT[J(0),list(y,idx)],by="y")
DT[J(0),x:=DT[paste0(y,idx) %in% paste0(n1[,y1],n1[,idx]),x]]
}
crdt(10)
fun(DT,n)[J(0)]
ok idx x y
[1,] 0 1 0.3295078 d
[2,] 0 2 -0.3053884 d
[3,] 0 3 0.3898432 a
[4,] 0 5 0.7821363 a
[5,] 0 7 1.3586796 e
[6,] 0 8 0.7631757 d
But it is still pretty slow for bigger data.tables:
crdt(1e6)
system.time(fun(DT,n)[J(0)])
User System elapsed
4.213 0.162 4.374
crdt(1e7)
system.time(fun(DT,n)[J(0)])
User System elapsed
195.685 3.949 199.592
I'm interested to learn a faster solution.

Super convoluted answer:
setkey(
setkey(
setkey(DT,y)[setkey(n,y),nomatch=0] #inner joins DT to n
#matches the new x value by idx and y, and assigns it
,idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x]
,ok)[list(0)] #pulls things where ok == 0
It looks like Roland's answer is better for smaller tables, but mine eventually catches up at larger sizes. I haven't done a lot of checking, though.
> library(rbenchmark)
> benchmark(fun(DT,n)[J(0)],setkey(setkey(setkey(DT,y)[setkey(n,y),nomatch=0],idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x],ok)[list(0)])
test
1 fun(DT, n)[J(0)]
2 setkey(setkey(setkey(DT, y)[setkey(n, y), nomatch = 0], idx, y1)[setkey(J(idx, y, new.x = x), idx, y), `:=`(x, new.x)], ok)[list(0)]
replications elapsed relative user.self sys.self user.child sys.child
1 100 13.21 1.000000 13.08 0.02 NA NA
2 100 15.08 1.141559 14.76 0.06 NA NA
> crdt(1e5)
> benchmark(fun(DT,n)[J(0)],setkey(setkey(setkey(DT,y)[setkey(n,y),nomatch=0],idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x],ok)[list(0)])
test
1 fun(DT, n)[J(0)]
2 setkey(setkey(setkey(DT, y)[setkey(n, y), nomatch = 0], idx, y1)[setkey(J(idx, y, new.x = x), idx, y), `:=`(x, new.x)], ok)[list(0)]
replications elapsed relative user.self sys.self user.child sys.child
1 100 150.49 1.000000 148.98 0.89 NA NA
2 100 155.33 1.032162 151.04 2.25 NA NA
>

Related

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

Select one row from each group in a large data.table based on a condition [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 5 years ago.
I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.
This example demonstrates the solution I have at the moment:
N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
X Y
1: 1 -1.37925206
2: 1 -0.53837461
3: 2 0.26516340
4: 2 -0.04643483
5: 3 0.40331424
6: 3 0.28667275
7: 4 -0.30342327
8: 4 -2.13143267
9: 5 2.11178673
10: 5 -0.98047230
11: 6 -0.27230783
12: 6 -0.79540934
13: 7 1.54264549
14: 7 0.40079650
15: 8 -0.98474297
16: 8 0.73179201
17: 9 -0.34590491
18: 9 -0.55897393
19: 10 0.97523187
20: 10 1.16924293
> DT[, .SD[Y == max(Y)], by = X]
X Y
1: 1 -0.5383746
2: 2 0.2651634
3: 3 0.4033142
4: 4 -0.3034233
5: 5 2.1117867
6: 6 -0.2723078
7: 7 1.5426455
8: 8 0.7317920
9: 9 -0.3459049
10: 10 1.1692429
The problem is that for larger data.tables this take a very long time:
N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
user system elapsed
9.69 0.00 9.69
My actual table about 100 million rows...
Can anyone suggest a more efficient solution?
Edit - importance of set key
The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:
See Example without "each" in rep:
N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
X Y
1: 1 1.26925708
2: 4 -0.66625732
3: 5 0.41498548
4: 8 0.03531185
5: 9 0.30608380
6: 1 0.50308578
7: 4 0.19848227
8: 6 0.86458423
9: 8 0.69825500
10: 10 -0.38160503
This would be faster compared to .SD
system.time({setkey(DT, X)
DT[DT[,Y==max(Y), by=X]$V1,]})
# user system elapsed
#0.016 0.000 0.016
Or
system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1])
# user system elapsed
# 0.023 0.000 0.023
If there are only two columns,
system.time(DT[,list(Y=max(Y)), by=X])
# user system elapsed
# 0.006 0.000 0.007
Compared to,
system.time(DT[, .SD[Y == max(Y)], by = X] )
# user system elapsed
# 2.946 0.006 2.962
Based on comments from #Khashaa, #AnandaMahto, the CRAN version (1.9.4) gives a different result for the .SD method compared to devel version (1.9.5) (which I used). You could get the same result for "CRAN" version (from #Arun's comments) by setting the options
options(datatable.auto.index=FALSE)
NOTE: In case of "ties", the solutions described here will return multiple rows for each group (as mentioned by #docendo discimus). My solutions are based on the "code" posted by the OP.
If there are "ties", then you could use unique with by option (in case the number of columns are > 2)
setkey(DT,X)
unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))
microbenchmarks
library(microbenchmark)
f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]}
f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]}
f3 <- function(){DT[, list(Y=max(Y)), by=X]}
f4 <- function(){DT[, .SD[Y==max(Y)], by=X]}
microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# f1() 2.794435 2.733706 3.024097 2.756398 2.832654 6.697893 20
# f2() 4.302534 4.291715 4.535051 4.271834 4.342437 8.114811 20
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691 20
# cld
# a
# a
# a
# b
data
N = 10000
k = 25
set.seed(25)
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))

Combinatorical partial sum

I am looking in R for a function partial.sum() which takes a vector of numbers and returns an ascending sorted vector of all partial sums:
test=c(2,5,10)
partial.sum(test)
# 2 5 7 10 12 15 17
## 2 is the sum of element 2
## 5 is the sum of element 5
## 7 is the sum of elements 2 & 5
## 10 is the sum of element 10
## 12 is the sum of elements 2 & 10
## 15 is the sum of elements 5 & 10
## 17 is the sum of elements 2 & 5 & 10
Here is one using recursion. (Not making claims about it being efficient either)
partial.sum <- function(x) {
slave <- function(x) {
if (length(x)) {
y <- Recall(x[-1])
c(y + 0, y + x[1])
} else 0
}
sort(unique(slave(x)[-1]))
}
partial.sum(c(2,5,10))
# [1] 2 5 7 10 12 15 17
Edit: well, turns out it is a little faster than I thought:
x <- 1:20
microbenchmark(flodel(x), dason(x), matthew(x), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# flodel(x) 86.31128 86.9966 94.12023 125.1013 163.5824 10
# dason(x) 2407.27062 2461.2022 2633.73003 2846.2639 3031.7250 10
# matthew(x) 3084.59227 3191.7810 3304.36064 3693.8595 3883.2767 10
(I added sort and/or unique to dason and matthew's functions where appropriate for fair comparison.)
This probably doesn't scale too well and doesn't account for possible duplicates in the input vector or duplicates in the answer. You can use unique later if that is a concern for you.
partial.sum <- function(x){
n <- length(x)
# Something that will help get us every possible subset
# of the original vector
out <- do.call(expand.grid, replicate(n, c(T,F), simplify = F))
# Don't include the case where we don't grab any elements
out <- head(out, -1)
# ans <- apply(out, 1, function(row){sum(x[row])})
# As flodel points out the following will be faster than
# the previous line
ans <- data.matrix(out) %*% x
# If you want only unique value then add a call to unique here
ans <- sort(unname(ans))
ans
}
Here's an iterative approach using combn to produce combinations to sum. It works for vectors of of length greater than 1.
partial.sum <- function(x) {
sort(unique(unlist(sapply(seq_along(x), function(i) colSums(combn(x,i))))))
}
## [1] 2 5 7 10 12 15 17
To handle lengths less than 2, test for the length:
partial.sum <- function(x) {
if (length(x) > 1) {
sort(unique(unlist(sapply(seq_along(x), function(i) colSums(combn(x,i))))))
} else {
x
}
}
Some timings, out of rbenchmark, which don't entirely agree with flodel's results. I modified Dason's code, removing the comments and adding a call to unique. The version of my code is the first, without the if. flodel's code is a clear winner here.
> test <- 1:10
> benchmark(matthew(test), flodel(test), dason(test), replications=100)
test replications elapsed relative user.self sys.self user.child sys.child
3 dason(test) 100 0.180 12.857 0.175 0.004 0 0
2 flodel(test) 100 0.014 1.000 0.015 0.000 0 0
1 matthew(test) 100 0.244 17.429 0.242 0.001 0 0
> test <- 1:20
> benchmark(matthew(test), flodel(test), dason(test), replications=1)
test replications elapsed relative user.self sys.self user.child sys.child
3 dason(test) 1 5.231 98.698 5.158 0.058 0 0
2 flodel(test) 1 0.053 1.000 0.053 0.000 0 0
1 matthew(test) 1 2.184 41.208 2.180 0.000 0 0
> test <- 1:25
> benchmark(matthew(test), flodel(test), dason(test), replications=1)
test replications elapsed relative user.self sys.self user.child sys.child
3 dason(test) 1 288.957 163.345 264.068 23.859 0 0
2 flodel(test) 1 1.769 1.000 1.727 0.038 0 0
1 matthew(test) 1 75.712 42.799 74.745 0.847 0 0

Complicated reshaping

I want to reshape my dataframe from long to wide format and I loose some data that I'd like to keep.
For the following example:
df <- data.frame(Par1 = unlist(strsplit("AABBCCC","")),
Par2 = unlist(strsplit("DDEEFFF","")),
ParD = unlist(strsplit("foo,bar,baz,qux,bla,xyz,meh",",")),
Type = unlist(strsplit("pre,post,pre,post,pre,post,post",",")),
Val = c(10,20,30,40,50,60,70))
# Par1 Par2 ParD Type Val
# 1 A D foo pre 10
# 2 A D bar post 20
# 3 B E baz pre 30
# 4 B E qux post 40
# 5 C F bla pre 50
# 6 C F xyz post 60
# 7 C F meh post 70
dfw <- dcast(df,
formula = Par1 + Par2 ~ Type,
value.var = "Val",
fun.aggregate = mean)
# Par1 Par2 post pre
# 1 A D 20 10
# 2 B E 40 30
# 3 C F 65 50
this is almost what I need but I would like to have
some field keeping data from ParD field (for example, as single merged string),
number of observations used for aggregation.
i.e. I would like the resulting data.frame to be as follows:
# Par1 Par2 post pre Num.pre Num.post ParD
# 1 A D 20 10 1 1 foo_bar
# 2 B E 40 30 1 1 baz_qux
# 3 C F 65 50 1 2 bla_xyz_meh
I would be grateful for any ideas. For example, I tried to solve the second task by writing in dcast: fun.aggregate=function(x) c(Val=mean(x),Num=length(x)) - but this causes an error.
Late to the party, but here's another alternative using data.table:
require(data.table)
dt <- data.table(df, key=c("Par1", "Par2"))
dt[, list(pre=mean(Val[Type == "pre"]),
post=mean(Val[Type == "post"]),
pre.num=length(Val[Type == "pre"]),
post.num=length(Val[Type == "post"]),
ParD = paste(ParD, collapse="_")),
by=list(Par1, Par2)]
# Par1 Par2 pre post pre.num post.num ParD
# 1: A D 10 20 1 1 foo_bar
# 2: B E 30 40 1 1 baz_qux
# 3: C F 50 65 1 2 bla_xyz_meh
[from Matthew] +1 Some minor improvements to save repeating the same ==, and to demonstrate local variables inside j.
dt[, list(pre=mean(Val[.pre <- Type=="pre"]), # save .pre
post=mean(Val[.post <- Type=="post"]), # save .post
pre.num=sum(.pre), # reuse .pre
post.num=sum(.post), # reuse .post
ParD = paste(ParD, collapse="_")),
by=list(Par1, Par2)]
# Par1 Par2 pre post pre.num post.num ParD
# 1: A D 10 20 1 1 foo_bar
# 2: B E 30 40 1 1 baz_qux
# 3: C F 50 65 1 2 bla_xyz_meh
dt[, { .pre <- Type=="pre" # or save .pre and .post up front
.post <- Type=="post"
list(pre=mean(Val[.pre]),
post=mean(Val[.post]),
pre.num=sum(.pre),
post.num=sum(.post),
ParD = paste(ParD, collapse="_")) }
, by=list(Par1, Par2)]
# Par1 Par2 pre post pre.num post.num ParD
# 1: A D 10 20 1 1 foo_bar
# 2: B E 30 40 1 1 baz_qux
# 3: C F 50 65 1 2 bla_xyz_meh
And if a list column is ok rather than a paste, then this should be faster :
dt[, { .pre <- Type=="pre"
.post <- Type=="post"
list(pre=mean(Val[.pre]),
post=mean(Val[.post]),
pre.num=sum(.pre),
post.num=sum(.post),
ParD = list(ParD)) } # list() faster than paste()
, by=list(Par1, Par2)]
# Par1 Par2 pre post pre.num post.num ParD
# 1: A D 10 20 1 1 foo,bar
# 2: B E 30 40 1 1 baz,qux
# 3: C F 50 65 1 2 bla,xyz,meh
Solution in 2 steps using ddply ( i am not happy with , but I get the result)
dat <- ddply(df,.(Par1,Par2),function(x){
data.frame(ParD=paste(paste(x$ParD),collapse='_'),
Num.pre =length(x$Type[x$Type =='pre']),
Num.post = length(x$Type[x$Type =='post']))
})
merge(dfw,dat)
Par1 Par2 post pre ParD Num.pre Num.post
1 A D 2.0 1 foo_bar 1 1
2 B E 4.0 3 baz_qux 1 1
3 C F 6.5 5 bla_xyz_meh 1 2
You could do a merge of two dcasts and an aggregate, here all wrapped into one large expression mostly to avoid having intermediate objects hanging around afterwards:
Reduce(merge, list(
dcast(df, formula = Par1+Par2~Type, value.var="Val",
fun.aggregate=mean),
setNames(dcast(df, formula = Par1+Par2~Type, value.var="Val",
fun.aggregate=length), c("Par1", "Par2", "Num.post",
"Num.pre")),
aggregate(df["ParD"], df[c("Par1", "Par2")], paste, collapse="_")
))
I'll post but agstudy's puts me to shame:
step1 <- with(df, split(df, list(Par1, Par2)))
step2 <- step1[sapply(step1, nrow) > 0]
step3 <- lapply(step2, function(x) {
piece1 <- tapply(x$Val, x$Type, mean)
piece2 <- tapply(x$Type, x$Type, length)
names(piece2) <- paste0("Num.", names(piece2))
out <- x[1, 1:2]
out[, 3:6] <- c(piece1, piece2)
names(out)[3:6] <- names(c(piece1, piece2))
out$ParD <- paste(unique(x$ParD), collapse="_")
out
})
data.frame(do.call(rbind, step3), row.names=NULL)
Yielding:
Par1 Par2 post pre Num.post Num.pre ParD
1 A D 2.0 1 1 1 foo_bar
2 B E 4.0 3 1 1 baz_qux
3 C F 6.5 5 2 1 bla_xyz_meh
What a great opprotunity to benchmark!
Below are some runs of the plyr method (as suggested by #agstudy) compared with the data.table method (as suggested by #Arun)
using different sample sizes (N = 900, 2700, 10800)
Summary:
The data.table method outperforms the plyr method by a factor of 7.5
#-------------------#
# M E T H O D S #
#-------------------#
# additional methods below, in the updates
# Method 1 -- suggested by #agstudy
plyrMethod <- quote({
dfw<-dcast(df,
formula = Par1+Par2~Type,
value.var="Val",
fun.aggregate=mean)
dat <- ddply(df,.(Par1,Par2),function(x){
data.frame(ParD=paste(paste(x$ParD),collapse='_'),
Num.pre =length(x$Type[x$Type =='pre']),
Num.post = length(x$Type[x$Type =='post']))
})
merge(dfw,dat)
})
# Method 2 -- suggested by #Arun
dtMethod <- quote(
dt[, list(pre=mean(Val[Type == "pre"]),
post=mean(Val[Type == "post"]),
Num.pre=length(Val[Type == "pre"]),
Num.post=length(Val[Type == "post"]),
ParD = paste(ParD, collapse="_")),
by=list(Par1, Par2)]
)
# Method 3 -- suggested by #regetz
reduceMethod <- quote(
Reduce(merge, list(
dcast(df, formula = Par1+Par2~Type, value.var="Val",
fun.aggregate=mean),
setNames(dcast(df, formula = Par1+Par2~Type, value.var="Val",
fun.aggregate=length), c("Par1", "Par2", "Num.post",
"Num.pre")),
aggregate(df["ParD"], df[c("Par1", "Par2")], paste, collapse="_")
))
)
# Method 4 -- suggested by #Ramnath
castddplyMethod <- quote(
reshape::cast(Par1 + Par2 + ParD ~ Type,
data = ddply(df, .(Par1, Par2), transform,
ParD = paste(ParD, collapse = "_")),
fun = c(mean, length)
)
)
# SAMPLE DATA #
#-------------#
library(data.table)
library(plyr)
library(reshape2)
library(rbenchmark)
# for Par1, ParD
LLL <- apply(expand.grid(LETTERS, LETTERS, LETTERS, stringsAsFactors=FALSE), 1, paste0, collapse="")
lll <- apply(expand.grid(letters, letters, letters, stringsAsFactors=FALSE), 1, paste0, collapse="")
# max size is 17568 with current sample data setup, ie: floor(length(LLL) / 18) * 18
size <- 17568
size <- 10800
size <- 900
set.seed(1)
df<-data.frame(Par1=rep(LLL[1:(size/2)], times=rep(c(2,2,3), size)[1:(size/2)])[1:(size)]
, Par2=rep(lll[1:(size/2)], times=rep(c(2,2,3), size)[1:(size/2)])[1:(size)]
, ParD=sample(unlist(lapply(c("f", "b"), paste0, lll)), size, FALSE)
, Type=rep(c("pre","post"), size/2)
, Val =sample(seq(10,100,10), size, TRUE)
)
dt <- data.table(df, key=c("Par1", "Par2"))
# Confirming Same Results #
#-------------------------#
# Evaluate
DF1 <- eval(plyrMethod)
DF2 <- eval(dtMethod)
# Convert to DF and sort columns and sort ParD levels, for use in identical
colOrder <- sort(names(DF1))
DF1 <- DF1[, colOrder]
DF2 <- as.data.frame(DF2)[, colOrder]
DF2$ParD <- factor(DF2$ParD, levels=levels(DF1$ParD))
identical((DF1), (DF2))
# [1] TRUE
#-------------------------#
RESULTS
#--------------------#
# BENCHMARK #
#--------------------#
benchmark(plyr=eval(plyrMethod), dt=eval(dtMethod), reduce=eval(reduceMethod), castddply=eval(castddplyMethod),
replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order="relative")
# SAMPLE SIZE = 900
relative test elapsed user.self sys.self replications
1.000 reduce 0.392 0.375 0.018 5
1.003 dt 0.393 0.377 0.016 5
7.064 plyr 2.769 2.721 0.047 5
8.003 castddply 3.137 3.030 0.106 5
# SAMPLE SIZE = 2,700
relative test elapsed user.self sys.self replications
1.000 dt 1.371 1.327 0.090 5
2.205 reduce 3.023 2.927 0.102 5
7.291 plyr 9.996 9.644 0.377 5
# SAMPLE SIZE = 10,800
relative test elapsed user.self sys.self replications
1.000 dt 8.678 7.168 1.507 5
2.769 reduce 24.029 23.231 0.786 5
6.946 plyr 60.277 52.298 7.947 5
13.796 castddply 119.719 113.333 10.816 5
# SAMPLE SIZE = 17,568
relative test elapsed user.self sys.self replications
1.000 dt 27.421 13.042 14.470 5
4.030 reduce 110.498 75.853 34.922 5
5.414 plyr 148.452 105.776 43.156 5
Update : Added results for baseMethod1
# Used only sample size of 90, as it was taking long
relative test elapsed user.self sys.self replications
1.000 dt 0.044 0.043 0.001 5
7.773 plyr 0.342 0.339 0.003 5
65.614 base1 2.887 2.866 0.028 5
Where
baseMethod1 <- quote({
step1 <- with(df, split(df, list(Par1, Par2)))
step2 <- step1[sapply(step1, nrow) > 0]
step3 <- lapply(step2, function(x) {
piece1 <- tapply(x$Val, x$Type, mean)
piece2 <- tapply(x$Type, x$Type, length)
names(piece2) <- paste0("Num.", names(piece2))
out <- x[1, 1:2]
out[, 3:6] <- c(piece1, piece2)
names(out)[3:6] <- names(c(piece1, piece2))
out$ParD <- paste(unique(x$ParD), collapse="_")
out
})
data.frame(do.call(rbind, step3), row.names=NULL)
})
Update 2: Added keying the DT as part of the metric
Adding the indexing step to the benchmark for fairness as per #MatthewDowle s comment.
However, presumably, if data.table is used, it will be in place of the data.frame and
hence the indexing will occur once and not simply for this procedure
dtMethod.withkey <- quote({
dt <- data.table(df, key=c("Par1", "Par2"))
dt[, list(pre=mean(Val[Type == "pre"]),
post=mean(Val[Type == "post"]),
Num.pre=length(Val[Type == "pre"]),
Num.post=length(Val[Type == "post"]),
ParD = paste(ParD, collapse="_")),
by=list(Par1, Par2)]
})
# SAMPLE SIZE = 10,800
relative test elapsed user.self sys.self replications
1.000 dt 9.155 7.055 2.137 5
1.043 dt.withkey 9.553 7.245 2.353 5
3.567 reduce 32.659 31.196 1.586 5
6.703 plyr 61.364 54.080 7.600 5
Update 3: Benchmarking #MD's edits to #Arun's original answer
dtMethod.MD1 <- quote(
dt[, list(pre=mean(Val[.pre <- Type=="pre"]), # save .pre
post=mean(Val[.post <- Type=="post"]), # save .post
pre.num=sum(.pre), # reuse .pre
post.num=sum(.post), # reuse .post
ParD = paste(ParD, collapse="_")),
by=list(Par1, Par2)]
)
dtMethod.MD2 <- quote(
dt[, { .pre <- Type=="pre" # or save .pre and .post up front
.post <- Type=="post"
list(pre=mean(Val[.pre]),
post=mean(Val[.post]),
pre.num=sum(.pre),
post.num=sum(.post),
ParD = paste(ParD, collapse="_")) }
, by=list(Par1, Par2)]
)
dtMethod.MD3 <- quote(
dt[, { .pre <- Type=="pre"
.post <- Type=="post"
list(pre=mean(Val[.pre]),
post=mean(Val[.post]),
pre.num=sum(.pre),
post.num=sum(.post),
ParD = list(ParD)) } # list() faster than paste()
, by=list(Par1, Par2)]
)
benchmark(dt.M1=eval(dtMethod.MD1), dt.M2=eval(dtMethod.MD2), dt.M3=eval(dtMethod.MD3), dt=eval(dtMethod),
replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order="relative")
#--------------------#
Comparing the different data.table methods amongst themselves
# SAMPLE SIZE = 900
relative test elapsed user.self sys.self replications
1.000 dt.M3 0.198 0.197 0.001 5 <~~~ "list()" Method
1.242 dt.M1 0.246 0.243 0.004 5
1.253 dt.M2 0.248 0.242 0.007 5
1.884 dt 0.373 0.367 0.007 5
# SAMPLE SIZE = 17,568
relative test elapsed user.self sys.self replications
1.000 dt.M3 33.492 24.487 9.122 5 <~~~ "list()" Method
1.086 dt.M1 36.388 11.442 25.086 5
1.086 dt.M2 36.388 10.845 25.660 5
1.126 dt 37.701 13.256 24.535 5
Comparing MD3 ("list" method) with MD1 (best of DT non-list methods)
Using a clean session (ie, removing string cache)
_Note: Ran the following twice, fresh session each time, with practically identical results
Then re-ran in the *same* session, with reps=5. Results very different._
benchmark(dt.M1=eval(dtMethod.MD1), dt.M3=eval(dtMethod.MD3), replications=1, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# SAMPLE SIZE=17,568; CLEAN SESSION
relative test elapsed user.self sys.self replications
1.000 dt.M1 8.885 4.260 4.617 1
1.633 dt.M3 14.506 12.821 1.677 1
# SAMPLE SIZE=17,568; *SAME* SESSION
relative test elapsed user.self sys.self replications
1.000 dt.M1 33.443 10.200 23.226 5
1.048 dt.M3 35.060 26.127 8.915 5
#--------------------#
New benchmarks against previous methods
_Note: Not using the "list method" as results are not the same as other methods_
# SAMPLE SIZE = 900
relative test elapsed user.self sys.self replications
1.000 dt.M1 0.254 0.247 0.008 5
1.705 reduce 0.433 0.425 0.010 5
11.280 plyr 2.865 2.842 0.031 5
# SAMPLE SIZE = 17,568
relative test elapsed user.self sys.self replications
1.000 dt.M1 24.826 10.427 14.458 5
4.348 reduce 107.935 70.107 38.314 5
5.942 plyr 147.508 106.958 41.083 5
One Step solution combining reshape::cast with plyr::ddply
cast(Par1 + Par2 + ParD ~ Type, data = ddply(df, .(Par1, Par2), transform,
ParD = paste(ParD, collapse = "_")), fun = c(mean, length))
NOTE that the dcast function in reshape2 does not allow multiple aggregate functions to be passed, while the cast function in reshape does.
I believe this base R solution is comparable with #Arun's data table solution. (Which isn't to say I would prefer it; that code is much simpler!)
baseMethod2 <- quote({
is <- unname(split(1:nrow(df), with(df, paste(Par1, Par2, sep="\b"))))
i1 <- sapply(is, `[`, 1)
out <- with(df, data.frame(Par1=Par1[i1], Par2=Par2[i1]))
js <- lapply(is, function(i) split(i, df$Type[i]))
out$post <- sapply(js, function(j) mean(df$Val[j$post]))
out$pre <- sapply(js, function(j) mean(df$Val[j$pre]))
out$Num.pre <- sapply(js, function(j) length(j$pre))
out$Num.post <- sapply(js, function(j) length(j$post))
out$ParD <- sapply(is, function(x) paste(df$ParD[x], collapse="_"))
out
})
Using #RicardoSaporta's timing code with 900, 2700, and 10,800, respectively:
> relative test elapsed user.self sys.self replications
3 1.000 baseMethod2 0.230 0.229 0 5
1 1.130 dt 0.260 0.257 0 5
2 8.752 plyr 2.013 2.006 0 5
> relative test elapsed user.self sys.self replications
3 1.000 baseMethod2 0.877 0.872 0 5
1 1.068 dt 0.937 0.934 0 5
2 8.060 plyr 7.069 7.043 0 5
> relative test elapsed user.self sys.self replications
1 1.000 dt 6.232 6.178 0.031 5
3 1.085 baseMethod2 6.763 6.683 0.054 5
2 7.263 plyr 45.261 44.983 0.104 5
Trying to wrap different aggregation expressions into a self-contained function (expressions should yield atomic values)...
multi.by <- function(X, INDEX,...) {
expressions <- substitute(...())
duplicates <- duplicated(INDEX)
res <- do.call(rbind,sapply(split(X,cumsum(!duplicates),drop=T), function(part)
sapply(expressions,eval,part,simplify=F),simplify=F))
if (is.data.frame(INDEX)) res <- cbind(INDEX[!duplicates,],res)
else rownames(res) <- INDEX[!duplicates]
res
}
multi.by(df,df[,1:2],
pre=mean(Val[Type=="pre"]),
post=mean(Val[Type=="post"]),
Num.pre=sum(Type=="pre"),
Num.post=sum(Type=="post"),
ParD=paste(ParD, collapse="_"))

Replacing columns names using a data frame in r

I have the matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("tom", "dick","bob")))
tom dick bob
s1 1 2 3
s2 4 5 6
s3 7 8 9
#and the data frame
current<-c("tom", "dick","harry","bob")
replacement<-c("x","y","z","b")
df<-data.frame(current,replacement)
current replacement
1 tom x
2 dick y
3 harry z
4 bob b
#I need to replace the existing names i.e. df$current with df$replacement if
#colnames(m) are equal to df$current thereby producing the following matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("x", "y","b")))
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
Any advice? Should I use an 'if' loop? Thanks.
You can use which to match the colnames from m with the values in df$current. Then, when you have the indices, you can subset the replacement colnames from df$replacement.
colnames(m) = df$replacement[which(df$current %in% colnames(m))]
In the above:
%in% tests for TRUE or FALSE for any matches between the objects being compared.
which(df$current %in% colnames(m)) identifies the indexes (in this case, the row numbers) of the matched names.
df$replacement[...] is the basic way to subset the column df$replacement returning only the rows matched with step 2.
A slightly more direct way to find the indices is to use match:
> id <- match(colnames(m), df$current)
> id
[1] 1 2 4
> colnames(m) <- df$replacement[id]
> m
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
As discussed below %in% is generally more intuitive to use and the difference in efficiency is marginal unless the sets are relatively large, e.g.
> n <- 50000 # size of full vector
> m <- 10000 # size of subset
> query <- paste("A", sort(sample(1:n, m)))
> names <- paste("A", 1:n)
> all.equal(which(names %in% query), match(query, names))
[1] TRUE
> library(rbenchmark)
> benchmark(which(names %in% query))
test replications elapsed relative user.self sys.self user.child sys.child
1 which(names %in% query) 100 0.267 1 0.268 0 0 0
> benchmark(match(query, names))
test replications elapsed relative user.self sys.self user.child sys.child
1 match(query, names) 100 0.172 1 0.172 0 0 0

Resources