paste grid -- expand.grid for string concatenation - r

If we want to get all combinations of two vectors, we can use rep/recycling rules:
x <- 1:4
y <- 1:2
cbind(rep(x, each = length(y)), rep(y, length(x)))
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 2 1
# [4,] 2 2
# [5,] 3 1
# [6,] 3 2
# [7,] 4 1
# [8,] 4 2
But expand.grid is much nicer -- it handles all the repetition for us.
expand.grid(x, y)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 1 2
# 6 2 2
# 7 3 2
# 8 4 2
Is there a simple version of this for concatenating strings? Like paste.grid? I have a named object where a lot of the objects have names like x_y_z where x, y, and z vary like x and y above.
For example, suppose x can be "avg" or "median", y can be "male" or "female", and z can be "height" or "weight". How can we concisely get all 8 combinations of the three?
Using rep is a pain:
x <- c("avg", "median")
y <- c("male", "female")
z <- c("height", "weight")
paste(rep(x, each = length(y) * length(z)),
rep(rep(y, each = length(z)), length(x)),
rep(z, length(x) * length(y)), sep = "_")
And repurposing expand.grid is a bit clunky (and probably inefficient):
apply(expand.grid(x, y, z), 1, paste, collapse = "_")
Am I missing something? Is there a better way to do this?

Yes, this is what interaction does
levels(interaction(x,y,z,sep='_'))
The implementation is pretty much the same as your rep code.
Outputs:
[1] "avg_female_height" "median_female_height" "avg_male_height" "median_male_height" "avg_female_weight"
[6] "median_female_weight" "avg_male_weight" "median_male_weight"

Using data.table's CJ cross-joining function:
library(data.table)
CJ(x,y,z)[, paste(V1,V2,V3, sep = "_")]
#[1] "avg_female_height" "avg_female_weight" "avg_male_height" "avg_male_weight"
#[5] "median_female_height" "median_female_weight" "median_male_height" "median_male_weight"
Or a variation of your apply approach would be:
do.call(paste, c(expand.grid(x, y, z), sep = "_"))
#[1] "avg_male_height" "median_male_height" "avg_female_height" "median_female_height"
#[5] "avg_male_weight" "median_male_weight" "avg_female_weight" "median_female_weight"

Rudimentary (microbenchmark::microbenchmark) benchmarking shows a pretty significant speed-up by using:
library(tidyr)
library(magrittr)
df <- data.frame(x, y, z)
df %>%
complete(x, y, z) %>%
unite("combo", x, y, z, sep = "_")
A bit slower, but perhaps more straight forward and vectorized variant the apply technique:
df <- expand.grid(x, y, z)
df$combo <- paste(df$Var1, df$Var1, df$Var3, sep = "_")
Someone should chime in with a data.table approach...
Benchmarking: Small Grid (256 elements)
set.seed(21034)
x <- sample(letters, 4, TRUE)
y <- sample(letters, 4, TRUE)
z <- sample(letters, 4, TRUE)
a <- sample(letters, 4, TRUE)
library(data.table)
library(microbenchmark)
library(magrittr)
library(tidyr)
microbenchmark(times = 25L,
DT1 = CJ(x, y, z, a)[ , paste(V1, V2, V3, V4, sep = "_")],
DT2 = CJ(x, y, z, a)[ , do.call(paste, c(.SD, sep = "_"))],
app1 = do.call(paste, c(expand.grid(x, y, z, a), sep = "_")),
app2 = paste((df <- expand.grid(x, y, z, a))$Var1,
df$Var2, df$Var3, sep = "_"),
magg_outer = outer(x, y, paste, sep = "_") %>%
outer(z, paste, sep = "_") %>%
outer(a, paste, sep = "_") %>% as.vector,
magg_tidy = data.frame(x, y, z, a) %>%
complete(x, y, z, a) %>%
unite("combo", x, y, z, a, sep = "_"),
interaction = levels(interaction(x, y, z, a, sep = "_")),
original = apply(expand.grid(x, y, z, a), 1, paste, collapse = "_"),
rep = paste(rep(x, each = (ny <- length(y)) * (nz <- length(z)) *
(na <- length(a))),
rep(rep(y, each = nz * na), (nx <- length(x))),
rep(rep(z, each = na), nx * ny), sep = "_"),
Reduce = Reduce(function(x, y) paste(rep(x, each = length(y)),
rep(y, length(x)), sep = "_"),
list(x, y, z, a)))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# DT1 529.578 576.6400 624.00002 589.8270 604.9845 5449.287 1000 d
# DT2 561.028 606.4220 639.94659 620.4335 636.2735 5484.514 1000 d
# app1 201.043 225.4475 240.36960 233.4795 243.7090 4244.687 1000 b
# app2 196.692 225.6130 244.33543 234.0455 243.7925 4110.605 1000 b
# magg_outer 164.352 194.1395 205.30300 204.4220 211.1990 456.122 1000 b
# magg_tidy 1872.228 2038.1560 2150.98234 2067.8770 2126.1025 21891.884 1000 f
# interaction 254.885 295.1935 313.54392 306.6680 316.8095 4196.465 1000 c
# original 852.018 935.4960 976.24388 954.5115 972.5550 4973.724 1000 e
# rep 50.737 54.1515 60.22671 55.3660 56.9220 3823.655 1000 a
# Reduce 58.395 65.3860 68.46049 66.8920 68.5640 158.184 1000 a
Benchmarking: Large Grid (1,000,000 elements)
set.seed(21034)
x <- sprintf("%03d", sample(100))
y <- sprintf("%03d", sample(100))
z <- sprintf("%02d", sample(10))
a <- sprintf("%02d", sample(10))
library(data.table)
library(microbenchmark)
library(magrittr)
library(tidyr)
microbenchmark(times = 25L,
DT1 = CJ(x, y, z, a)[ , paste(V1, V2, V3, V4, sep = "_")],
DT2 = CJ(x, y, z, a)[ , do.call(paste, c(.SD, sep = "_"))],
app1 = do.call(paste, c(expand.grid(x, y, z, a), sep = "_")),
app2 = paste((df <- expand.grid(x, y, z, a))$Var1,
df$Var2, df$Var3, sep = "_"),
magg_outer = outer(x, y, paste, sep = "_") %>%
outer(z, paste, sep = "_") %>%
outer(a, paste, sep = "_") %>% as.vector,
magg_tidy = data.frame(x, y, z, a) %>%
complete(x, y, z, a) %>%
unite("combo", x, y, z, a, sep = "_"),
interaction = levels(interaction(x, y, z, a, sep = "_")),
original = apply(expand.grid(x, y, z, a), 1, paste, collapse = "_"),
rep = paste(rep(x, each = (ny <- length(y)) * (nz <- length(z)) *
(na <- length(a))),
rep(rep(y, each = nz * na), (nx <- length(x))),
rep(rep(z, each = na), nx * ny), sep = "_"),
Reduce = Reduce(function(x, y) paste(rep(x, each = length(y)),
rep(y, length(x)), sep = "_"),
list(x, y, z, a)))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# DT1 360.6528 467.8408 517.4579 520.1484 549.1756 861.1567 25 ab
# DT2 355.0438 504.9642 572.0732 551.9106 615.6621 927.3210 25 b
# app1 727.4513 766.3053 926.1888 910.3998 957.7610 1690.1540 25 c
# app2 472.5724 567.1121 633.5304 600.3779 634.3158 1135.7535 25 b
# magg_outer 384.0112 475.5070 600.6317 525.8936 676.7134 927.6736 25 b
# magg_tidy 520.6428 602.5028 695.5500 680.8821 748.8746 1180.1107 25 bc
# interaction 353.7317 481.4732 531.0035 518.7084 585.0872 693.5171 25 ab
# original 4965.1156 5358.8704 5914.3560 5780.6609 6074.7470 9024.6476 25 d
# rep 206.0964 236.5811 273.1093 252.8179 285.0910 455.1776 25 a
# Reduce 322.0695 390.2595 446.3948 424.9185 508.5235 621.1878 25 ab

What about using outer()? Your two examples become
x <- 1:4
y <- 1:2
as.vector(outer(x, y, paste, sep = "_"))
## [1] "1_1" "2_1" "3_1" "4_1" "1_2" "2_2" "3_2" "4_2"
library(magrittr)
x <- c("avg", "median")
y <- c("male", "female")
z <- c("height", "weight")
outer(x, y, paste, sep = "_") %>% outer(z, paste, sep = "_") %>% as.vector
## [1] "avg_male_height" "median_male_height" "avg_female_height" "median_female_height" "avg_male_weight"
## [6] "median_male_weight" "avg_female_weight" "median_female_weight"
The second example can be simplified a little with Reduce():
Reduce(function(a, b) outer(a, b, paste, sep = "_"), list(x, y, z)) %>% as.vector
It's not efficient, however. Using microbenchmark, I find that your solution using rep() is about 10 times faster.

Related

How to apply one more for loop in R?

I would like to write a for loop in R style (lapply?) to avoid the following repetitive code.
df1$fusion <- apply(df1[, cols], 1, paste, collapse = "-" )
df2$fusion <- apply(df2[, cols], 1, paste, collapse = "-" )
df3$fusion <- apply(df3[, cols], 1, paste, collapse = "-" )
df4$fusion <- apply(df4[, cols], 1, paste, collapse = "-" )
df5$fusion <- apply(df5[, cols], 1, paste, collapse = "-" )
df6$fusion <- apply(df6[, cols], 1, paste, collapse = "-" )
df7$fusion <- apply(df7[, cols], 1, paste, collapse = "-" )
df8$fusion <- apply(df8[, cols], 1, paste, collapse = "-" )
df9$fusion <- apply(df9[, cols], 1, paste, collapse = "-" )
df10$fusion <- apply(df10[, cols], 1, paste, collapse = "-" )
df11$fusion <- apply(df11[, cols], 1, paste, collapse = "-" )
df12$fusion <- apply(df12[, cols], 1, paste, collapse = "-" )
How do I do it?
Something like in shell script style?
df_ls=("df1 df2 df3 df4 df5 df6 df7 df8 df9 df10 df11 df12")
for i in $df_ls
do
${i}$fusion <- apply(${i}[, cols], 1, paste, collapse = "-" )
done
With lists. For example in your case.
list_df_general <- list(df, df2,...,df12) # load your date frames
lista_new <- list() # list empty
for(i in 12){
lista_new[[i]] <- apply(lista_df_general[[i]][, cols], 1, paste, collapse = "-" )
}
For browser inside list you should do
list_new[[1]] or list_new[[2]] with 3,4,...,12.
You can use something like
my_Fun <- function(df, cols)
{
df$fusion <- apply(df[, cols], 1, paste, collapse = "-" )
}
for(i in 1 : 12)
{
variable_Name <- paste0("df", i)
assign(x = variable_Name, value = my_Fun(df = get(x = variable_Name), cols = cols))
}
You can use a combination of lapply and apply on the list of dataframes. This will return a list of dataframes.
df_ls <- mget(paste0("df", 1:12))
lapply(df_ls, function(x) {x$fusion <- apply(x[, cols], 1, paste, collapse = "-" ); x})
Multiple data frames in a list are always handier, find them in the .GlobalEnv and put them into one using mget. Now we may use lapply to apply any function to this list; we will define an anonymous function that uses Reduce which conveniently does the looping job to paste0 over the rows and columns for us, and is fast.
df_lst <- mget(paste0('df', 1:n)) ## put DFs in list
cols <- paste0('X', 2:5) ## define columns, say 2 to 5
res <- lapply(df_lst, function(df) {df$fusion <- Reduce(\(x, y) paste0(x, '-', y), df[cols]); df})
res
# $df1
# X1 X2 X3 X4 X5 fusion
# 1 A A T G C A-T-G-C
# 2 A T T C G T-T-C-G
# 3 A G A G A G-A-G-A
#
# $df2
# X1 X2 X3 X4 X5 fusion
# 1 A T C C C T-C-C-C
# 2 T T A G A T-A-G-A
# 3 G C A A A C-A-A-A
#
# $df3
# X1 X2 X3 X4 X5 fusion
# 1 G G A T T G-A-T-T
# 2 T C T C T C-T-C-T
# 3 G T G C G T-G-C-G
#
# $df4
# X1 X2 X3 X4 X5 fusion
# 1 G T A G G T-A-G-G
# 2 G A G T C A-G-T-C
# 3 T G T C A G-T-C-A
If for some reason, you need the data frames in the .GlobalEnv you may use list2env(res, .GlobalEnv) thereafter (similar as I did with the example data below).
Data:
set.seed(42)
n <- 4
replicate(n, data.frame(matrix(sample(LETTERS[c(1, 20, 3, 7)], 15, replace=TRUE), 3, 5)), simplify=FALSE) |>
setNames(paste0('df', seq_len(n))) |> list2env(.GlobalEnv)

Find the union and intersection of grouped variables

I have two vectors that are binned.
Basically, I want a function to find the union and intersection of these two vectors (output).
It seems there is no function that supports this feature. Any idea of how i can carry out the desired output vector?
example1 <- c("18--25", "26--30", "31--50", "51+")
example2 <- c("18--23", "24--30", "31--65", "66+")
output <- c("18--23", "24--25", "26--30", "31--50", "51--65", "66+")
We can remove duplicates and combine a sorted vector every 2 elements like this (R version 4.0 and later for pipe |>):
f <- function(x, y, sep, max){
m <- paste0("\\", max)
gsub(m, "", c(x, y)) |>
strsplit(sep, fixed = T) |>
unlist(use.names = F) |>
sort() |>
unique() |>
as.numeric() |>
(\(.) tapply(., gl(length(.), 2, length(.)), paste, collapse = sep, simplify = T))() |>
(\(.) .[!is.na(.)])() |>
as.character() |>
(\(.) {.[length(.)] <- paste0(.[length(.)], max) ; .})()
}
# for older R versions
f <- function(x, y, sep, max){
x <- gsub(paste0("\\", max), "", c(x, y))
x <- as.numeric(unique(sort(unlist(strsplit(x, sep, T), use.names = F))))
x <- tapply(x, gl(length(x), 2L, length(x)), paste, collapse = sep, simplify = T)
x <- as.character(x[!is.na(x)])
x[length(x)] <- paste0(x[length(x)], max)
x
}
f(example1, example2, "--", "+")
[1] "18--23" "24--25" "26--30" "31--50" "51--65" "66+"

Find variables that occur only in ONE row in R

Using BASE R, I wonder how to answer the following question:
Are there any value on X or Y that occurs only in one row but not others? If yes, produce my desired output below.
f <- data.frame(id = c(rep("AA",4), rep("BB",2), rep("CC",2)), X = c(1,2,2,3,1,4,3,3),
Y = c(99,7,8,7,6,7,7,7))
Desired output:
list(BB = c(X = 4, Y = 6), AA = c(Y = c(99, 8)))
# $BB
# X Y
# 4 6
# $AA
# Y1 Y2 # Would be a plus if shows `Y Y` instead of `Y1 Y2`
# 99 8
There are two big ideas with this base approach:
Since we need to compare all the values, we should just recombine everything into one data.frame.
Making the unsplit data.frame long will save us some extra steps.
#https://stackoverflow.com/questions/58786052/find-variables-that-occur-only-once-across-a-split-data-frame-in-r/58788854#58788854
f <- data.frame(id = c(rep("AA",4), rep("BB",2), rep("CC",2)), X = c(1,2,2,3,1,4,3,3),
Y = c(99,7,8,7,6,7,7,7))
m <- split(f, f$id) # Here is `m`
unsplit <- do.call(rbind, c(m, make.row.names = F))
molten <- data.frame(unsplit[, 1, drop = F], stack(unsplit[, -1]))
# res <- subset(molten, !duplicated(values) & !duplicated(values, fromLast = T))
res <- molten[as.logical(ave(molten[['values']], molten[['ind']], FUN = function(x) !duplicated(x) & !duplicated(x, fromLast = T))), ]
#I would stop here
res
#> id values ind
#> 6 BB 4 X
#> 9 AA 99 Y
#> 11 AA 8 Y
#> 13 BB 6 Y
#to get exact output
res_vector <- res$values
names(res_vector) <- res$ind
split(res_vector, as.character(res$id))
#> $AA
#> Y Y
#> 99 8
#>
#> $BB
#> X Y
#> 4 6
Created on 2019-11-10 by the reprex package (v0.3.0)
Here's another base approach that may be less complicated:
####Way 1 with rapply
vec <- rapply(lapply(m, '[', mods), I)
unique_vec <- vec[!duplicated(vec) & !duplicated(vec, fromLast = T)]
vec_names <- do.call(rbind, strsplit(names(unique_vec), '.', fixed = T))
names(unique_vec) <- substr(vec_names[, 2], 1, 1) #turns Y1 into Y
split(unique_vec, vec_names[, 1])
###Way 2 with data.frame already do.call(rbind, m)
vec <- unlist(
lapply(f[, -1],
function(x){
ind <- !duplicated(x) & !duplicated(x, fromLast = T)
ret <- x[ind]
names(ret) <- f[ind, 1]
ret
}
)
)
#this is likely overly simplified:
split(vec, sub('.*\\.', '', names(vec)))
#this leads to exact result
vec_names <- do.call(rbind, strsplit(names(vec), '.', fixed = T))
names(vec) <- vec_names[, 1]
split(vec, vec_names[, 2])
$AA
Y Y
99 8
$BB
X Y
4 6
OP brings up using table() in a hint. duplicated() is very performant:
unlist(lapply(f[mods], function(y) names(which(table(y) == 1))))
# X Y1 Y2 Y3
# "4" "6" "8" "99"
vec
#X.BB Y.AA Y.AA Y.BB
# 4 99 8 6
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch> <bch:> <dbl> <bch:byt>
1 table_meth 321us 336us 2794. 10.3KB
2 dup_meth 132us 136us 7105. 31.7KB
bench::mark(
table_meth = {unlist(lapply(f[mods], function(y) names(which(table(y) == 1))))},
dup_meth = {
#could get slight performance boost with
#f_id <- f[['id']]
unlist(
lapply(f[, -1],
function(x){
ind <- !duplicated(x) & !duplicated(x, fromLast = T)
ret <- x[ind]
names(ret) <- f[ind, 1]
#names(ret) <- f_id[ind]
ret
}
)
)}
, check = F
)
And similar idea in data.table:
library(data.table)
molten_dt <- melt(rbindlist(m), id.vars = 'id')
molten_dt[!duplicated(value, by = variable) &
!duplicated(value, by = variable, fromLast = T)]
And similar idea in dplyr:
library(dplyr)
library(tidyr)
m%>%
bind_rows()%>%
pivot_longer(cols = -id)%>%
group_by(name)%>%
filter(!duplicated(value) & !duplicated(value, fromLast = T))%>%
group_by(id)%>%
group_split()
It's not pure functional programming but it is base R:
lapply(split(df, df$id), function(z){
X <- z$X[which(!(z$X %in% df$X[duplicated(df$X)]))]
Y <- z$Y[which(!(z$Y %in% df$Y[duplicated(df$Y)]))]
cbind(X, Y)
}
)
Data:
df <-
structure(list(
id = structure(
c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L),
.Label = c("AA", "BB", "CC"),
class = "factor"
),
X = c(1,
2, 2, 3, 1, 4, 3, 3),
Y = c(99, 7, 8, 7, 6, 7, 7, 7)
),
class = "data.frame",
row.names = c(NA,-8L))

Efficient way to paste multiple column pairs in R data.table

I'm looking for an efficient way to paste/combine multiple pairs of adjacent columns at once using data.table. My feeble attempt is slow and not so elegant:
library(data.table)
dt <- data.table(ids = 1:3,
x1 = c("A","B","C"),
x2 = 1:3,
y1 = c("D", "E", "F"),
y2 = 4:6,
z1 = c("G", "H", "I"),
z3 = 7:9)
paste.pairs <- function(x, sep = "-"){
xx <- unlist(x)
x.len <- length(x)
r <- rep(NA, x.len/2)
s <- seq(1, x.len, by = 2)
for(i in 1:(x.len/2)) {
r[i] <- paste(xx[i], xx[i+1], sep = sep)
}
return(as.list(r))
}
dt[, paste.pairs(.SD), by = "ids"]
Is there a better way?
An option with Map by creating column index with seq
i1 <- seq(1, length(dt)-1, 2)
i2 <- seq(2, length(dt)-1, 2)
dt[, Map(paste,
.SD[, i1, with = FALSE], .SD[, i2, with = FALSE],
MoreArgs = list(sep="-")),
by = "ids"]
Another option would be to split by the names of the dataset and then paste
data.frame(lapply(split.default(dt[, -1, with = FALSE],
sub("\\d+$", "", names(dt)[-1])), function(x) do.call(paste, c(x, sep="-"))))
# x y z
#1 A-1 D-4 G-7
#2 B-2 E-5 H-8
#3 C-3 F-6 I-9
Or another option is with melt/dcast
dcast(melt(dt, id.var = 'ids')[, paste(value, collapse = "-"),
.(grp = sub("\\d+", "", variable), ids)], ids ~ grp, value.var = 'V1')
a solution using matrices
#create matrices
#use the columns you want to paste together...
m1 <- as.matrix( dt[,c(2,4,6)] )
m2 <- as.matrix( dt[, c(3,5,7)] )
#paste the matrices element-by-element, and convert result back to data.table
as.data.table( matrix( paste( m1, m2, sep="-"), nrow=nrow(m1), dimnames=dimnames(m1) ) )
Should run pretty fast, and is very readable and easy to adapt.
output
# x1 y1 z1
# 1: A-1 D-4 G-7
# 2: B-2 E-5 H-8
# 3: C-3 F-6 I-9
benchmarks
microbenchmark::microbenchmark(
wimpel = {
#create matrices
m1 <- as.matrix( dt[,c(2,4,6)] )
m2 <- as.matrix( dt[, c(3,5,7)] )
#paste the matrices element-by-element, and comvert to data.table
as.data.table( matrix( paste( m1, m2, sep="-"), nrow=nrow(m1), dimnames=dimnames(m1) ) )
},
akrun_df = {
data.frame(lapply(split.default(dt[, -1, with = FALSE],
sub("\\d+$", "", names(dt)[-1])), function(x) do.call(paste, c(x, sep="-"))))
},
akrun_map = {
i1 <- seq(2, length(dt), 2)
i2 <- seq(3, length(dt), 2)
dt[, Map(paste, .SD[, i1, with = FALSE], .SD[, i2, with = FALSE], MoreArgs = list(sep="-"))]
},
akrun_dcast = {
dcast(melt(dt, id.var = 'ids')[, paste(value, collapse = "-"),.(grp = sub("\\d+", "", variable), ids)], ids ~ grp, value.var = 'V1')
},
times = 10 )
# Unit: microseconds
# expr min lq mean median uq max neval
# wimpel 303.072 315.122 341.2417 319.1895 327.775 531.429 10
# akrun_df 1022.790 1028.515 1251.7812 1069.1850 1172.519 2779.460 10
# akrun_map 742.013 751.051 785.6059 778.1650 799.855 884.812 10
# akrun_dcast 4104.719 4175.215 4414.6596 4348.7430 4650.911 4939.221 10

How to summarize over multiple columns programatically using ddply?

Is it possible to specify which columns to aggregate over with ddply from the arguments of a function, without using eval + parse? Heres what I have so far:
x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group1 <- c("A","A","A","A","B","B")
group2 <- c("X","X","Y","Y","Z","X")
data <- data.frame(group1, group2, x, y)
Heres what I want the output to be:
aggFunction <- function(dataframe, toAverage, toGroup) {
out <- ddply(dataframe, toGroup, summarise,
x = mean(x),
y = mean(y))
return(out)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
# group1 group2 x y
# 1 A X 3 2.5
# 2 A Y 2 4.5
# 3 B X 7 6.0
# 4 B Z 5 4.0
Here's my solution using parse(eval):
aggFunction <- function(dataframe, toAverage, toGroup) {
toAverageArgs <- paste(toAverage, " = mean(", toAverage, ")", sep = "", collapse = ", ")
out <- eval(parse(text = paste("ddply(dataframe, toGroup, here(summarize),", toAverageArgs, ")")))
return(out)
}
This gives me the output that I want.
I'm wondering if there's a better way to do this. I'm aware of using do.call() and get(), but none of my attempts with these have worked.
Heres an attempt;
get(string) didn't work, but here(summarize) let me get the string value. Unfortunately, this means ddply treats them as strings:
aggFunction <- function(dataframe, toAverage, toGroup) {
string <- paste(toAverage, " = mean(", toAverage, ")", sep = "", collapse = ", ")
out <- ddply(dataframe, toGroup, here(summarise), string)
return(out)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
# group1 group2 ..2
# 1 A X x = mean(x), y = mean(y)
# 2 A Y x = mean(x), y = mean(y)
# 3 B X x = mean(x), y = mean(y)
# 4 B Z x = mean(x), y = mean(y)
Also tried do.call, but they were still treated as strings:
aggFunction <- function(dataframe, toAverage, toGroup) {
string <- paste(toAverage, " = mean(", toAverage, ")", sep = "", collapse = ", ")
print(string)
args <- list(dataframe, toGroup, here(summarise), string)
out <- do.call(ddply, args)
return(out)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
# group1 group2 "x = mean(x), y = mean(y)"
# 1 A X x = mean(x), y = mean(y)
# 2 A Y x = mean(x), y = mean(y)
# 3 B X x = mean(x), y = mean(y)
# 4 B Z x = mean(x), y = mean(y)
Lastly I tried hardcoding in mean(), but then I couldnt set the column name. If I use get(testVar) = mean(get(testVar)) I get unexpected =.
aggFunction <- function(dataframe, toAverage, toGroup) {
testVar <- "x"
out <- ddply(dataframe, toGroup, here(summarise),
get(testVar) = mean(get(testVar))
##
return(out)
}
Using aggregate in base R
aggFunction <- function(dataframe, toAverage, toGroup) {
aggregate(dataframe[, toAverage], dataframe[, toGroup], mean)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
group1 group2 x y
1 A X 3 2.5
2 B X 7 6.0
3 A Y 2 4.5
4 B Z 5 4.0
You can consider dplyr package - generally it's much faster than plyr and also has pretty syntax.
library(dplyr)
x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group1 <- c("A","A","A","A","B","B")
group2 <- c("X","X","Y","Y","Z","X")
aggFunction <- function(dataframe, toAverage, toGroup) {
dataframe %>%
group_by_(.dots = toGroup) %>%
summarise_(.dots = setNames(sprintf("mean(%s)", toAverage), toAverage))
}
data <- data.frame(group1, group2, x, y)
aggFunction(data, c("x", "y"), c("group1", "group2"))
It gives:
group1 group2 x y
1 A X 3 2.5
2 A Y 2 4.5
3 B X 7 6.0
4 B Z 5 4.0
This is a lot easier if you melt the data frame first, do the calculations when in long format, and then cast back.
library(reshape2)
library(plyr)
aggFunction <- function(d1, toAverage, toGroup) {
d2 <- melt(d1, id.vars=toGroup, measure.vars=toAverage)
d3 <- ddply(d2, ~group1 + group2 + variable, summarize, mean=mean(value))
dcast(d3, group1 + group2 ~ variable, value.var="mean")
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
## group1 group2 x y
## 1 A X 3 2.5
## 2 A Y 2 4.5
## 3 B X 7 6.0
## 4 B Z 5 4.0

Resources