If we want to get all combinations of two vectors, we can use rep/recycling rules:
x <- 1:4
y <- 1:2
cbind(rep(x, each = length(y)), rep(y, length(x)))
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 2 1
# [4,] 2 2
# [5,] 3 1
# [6,] 3 2
# [7,] 4 1
# [8,] 4 2
But expand.grid is much nicer -- it handles all the repetition for us.
expand.grid(x, y)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 1 2
# 6 2 2
# 7 3 2
# 8 4 2
Is there a simple version of this for concatenating strings? Like paste.grid? I have a named object where a lot of the objects have names like x_y_z where x, y, and z vary like x and y above.
For example, suppose x can be "avg" or "median", y can be "male" or "female", and z can be "height" or "weight". How can we concisely get all 8 combinations of the three?
Using rep is a pain:
x <- c("avg", "median")
y <- c("male", "female")
z <- c("height", "weight")
paste(rep(x, each = length(y) * length(z)),
rep(rep(y, each = length(z)), length(x)),
rep(z, length(x) * length(y)), sep = "_")
And repurposing expand.grid is a bit clunky (and probably inefficient):
apply(expand.grid(x, y, z), 1, paste, collapse = "_")
Am I missing something? Is there a better way to do this?
Yes, this is what interaction does
levels(interaction(x,y,z,sep='_'))
The implementation is pretty much the same as your rep code.
Outputs:
[1] "avg_female_height" "median_female_height" "avg_male_height" "median_male_height" "avg_female_weight"
[6] "median_female_weight" "avg_male_weight" "median_male_weight"
Using data.table's CJ cross-joining function:
library(data.table)
CJ(x,y,z)[, paste(V1,V2,V3, sep = "_")]
#[1] "avg_female_height" "avg_female_weight" "avg_male_height" "avg_male_weight"
#[5] "median_female_height" "median_female_weight" "median_male_height" "median_male_weight"
Or a variation of your apply approach would be:
do.call(paste, c(expand.grid(x, y, z), sep = "_"))
#[1] "avg_male_height" "median_male_height" "avg_female_height" "median_female_height"
#[5] "avg_male_weight" "median_male_weight" "avg_female_weight" "median_female_weight"
Rudimentary (microbenchmark::microbenchmark) benchmarking shows a pretty significant speed-up by using:
library(tidyr)
library(magrittr)
df <- data.frame(x, y, z)
df %>%
complete(x, y, z) %>%
unite("combo", x, y, z, sep = "_")
A bit slower, but perhaps more straight forward and vectorized variant the apply technique:
df <- expand.grid(x, y, z)
df$combo <- paste(df$Var1, df$Var1, df$Var3, sep = "_")
Someone should chime in with a data.table approach...
Benchmarking: Small Grid (256 elements)
set.seed(21034)
x <- sample(letters, 4, TRUE)
y <- sample(letters, 4, TRUE)
z <- sample(letters, 4, TRUE)
a <- sample(letters, 4, TRUE)
library(data.table)
library(microbenchmark)
library(magrittr)
library(tidyr)
microbenchmark(times = 25L,
DT1 = CJ(x, y, z, a)[ , paste(V1, V2, V3, V4, sep = "_")],
DT2 = CJ(x, y, z, a)[ , do.call(paste, c(.SD, sep = "_"))],
app1 = do.call(paste, c(expand.grid(x, y, z, a), sep = "_")),
app2 = paste((df <- expand.grid(x, y, z, a))$Var1,
df$Var2, df$Var3, sep = "_"),
magg_outer = outer(x, y, paste, sep = "_") %>%
outer(z, paste, sep = "_") %>%
outer(a, paste, sep = "_") %>% as.vector,
magg_tidy = data.frame(x, y, z, a) %>%
complete(x, y, z, a) %>%
unite("combo", x, y, z, a, sep = "_"),
interaction = levels(interaction(x, y, z, a, sep = "_")),
original = apply(expand.grid(x, y, z, a), 1, paste, collapse = "_"),
rep = paste(rep(x, each = (ny <- length(y)) * (nz <- length(z)) *
(na <- length(a))),
rep(rep(y, each = nz * na), (nx <- length(x))),
rep(rep(z, each = na), nx * ny), sep = "_"),
Reduce = Reduce(function(x, y) paste(rep(x, each = length(y)),
rep(y, length(x)), sep = "_"),
list(x, y, z, a)))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# DT1 529.578 576.6400 624.00002 589.8270 604.9845 5449.287 1000 d
# DT2 561.028 606.4220 639.94659 620.4335 636.2735 5484.514 1000 d
# app1 201.043 225.4475 240.36960 233.4795 243.7090 4244.687 1000 b
# app2 196.692 225.6130 244.33543 234.0455 243.7925 4110.605 1000 b
# magg_outer 164.352 194.1395 205.30300 204.4220 211.1990 456.122 1000 b
# magg_tidy 1872.228 2038.1560 2150.98234 2067.8770 2126.1025 21891.884 1000 f
# interaction 254.885 295.1935 313.54392 306.6680 316.8095 4196.465 1000 c
# original 852.018 935.4960 976.24388 954.5115 972.5550 4973.724 1000 e
# rep 50.737 54.1515 60.22671 55.3660 56.9220 3823.655 1000 a
# Reduce 58.395 65.3860 68.46049 66.8920 68.5640 158.184 1000 a
Benchmarking: Large Grid (1,000,000 elements)
set.seed(21034)
x <- sprintf("%03d", sample(100))
y <- sprintf("%03d", sample(100))
z <- sprintf("%02d", sample(10))
a <- sprintf("%02d", sample(10))
library(data.table)
library(microbenchmark)
library(magrittr)
library(tidyr)
microbenchmark(times = 25L,
DT1 = CJ(x, y, z, a)[ , paste(V1, V2, V3, V4, sep = "_")],
DT2 = CJ(x, y, z, a)[ , do.call(paste, c(.SD, sep = "_"))],
app1 = do.call(paste, c(expand.grid(x, y, z, a), sep = "_")),
app2 = paste((df <- expand.grid(x, y, z, a))$Var1,
df$Var2, df$Var3, sep = "_"),
magg_outer = outer(x, y, paste, sep = "_") %>%
outer(z, paste, sep = "_") %>%
outer(a, paste, sep = "_") %>% as.vector,
magg_tidy = data.frame(x, y, z, a) %>%
complete(x, y, z, a) %>%
unite("combo", x, y, z, a, sep = "_"),
interaction = levels(interaction(x, y, z, a, sep = "_")),
original = apply(expand.grid(x, y, z, a), 1, paste, collapse = "_"),
rep = paste(rep(x, each = (ny <- length(y)) * (nz <- length(z)) *
(na <- length(a))),
rep(rep(y, each = nz * na), (nx <- length(x))),
rep(rep(z, each = na), nx * ny), sep = "_"),
Reduce = Reduce(function(x, y) paste(rep(x, each = length(y)),
rep(y, length(x)), sep = "_"),
list(x, y, z, a)))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# DT1 360.6528 467.8408 517.4579 520.1484 549.1756 861.1567 25 ab
# DT2 355.0438 504.9642 572.0732 551.9106 615.6621 927.3210 25 b
# app1 727.4513 766.3053 926.1888 910.3998 957.7610 1690.1540 25 c
# app2 472.5724 567.1121 633.5304 600.3779 634.3158 1135.7535 25 b
# magg_outer 384.0112 475.5070 600.6317 525.8936 676.7134 927.6736 25 b
# magg_tidy 520.6428 602.5028 695.5500 680.8821 748.8746 1180.1107 25 bc
# interaction 353.7317 481.4732 531.0035 518.7084 585.0872 693.5171 25 ab
# original 4965.1156 5358.8704 5914.3560 5780.6609 6074.7470 9024.6476 25 d
# rep 206.0964 236.5811 273.1093 252.8179 285.0910 455.1776 25 a
# Reduce 322.0695 390.2595 446.3948 424.9185 508.5235 621.1878 25 ab
What about using outer()? Your two examples become
x <- 1:4
y <- 1:2
as.vector(outer(x, y, paste, sep = "_"))
## [1] "1_1" "2_1" "3_1" "4_1" "1_2" "2_2" "3_2" "4_2"
library(magrittr)
x <- c("avg", "median")
y <- c("male", "female")
z <- c("height", "weight")
outer(x, y, paste, sep = "_") %>% outer(z, paste, sep = "_") %>% as.vector
## [1] "avg_male_height" "median_male_height" "avg_female_height" "median_female_height" "avg_male_weight"
## [6] "median_male_weight" "avg_female_weight" "median_female_weight"
The second example can be simplified a little with Reduce():
Reduce(function(a, b) outer(a, b, paste, sep = "_"), list(x, y, z)) %>% as.vector
It's not efficient, however. Using microbenchmark, I find that your solution using rep() is about 10 times faster.
Related
I would like to write a for loop in R style (lapply?) to avoid the following repetitive code.
df1$fusion <- apply(df1[, cols], 1, paste, collapse = "-" )
df2$fusion <- apply(df2[, cols], 1, paste, collapse = "-" )
df3$fusion <- apply(df3[, cols], 1, paste, collapse = "-" )
df4$fusion <- apply(df4[, cols], 1, paste, collapse = "-" )
df5$fusion <- apply(df5[, cols], 1, paste, collapse = "-" )
df6$fusion <- apply(df6[, cols], 1, paste, collapse = "-" )
df7$fusion <- apply(df7[, cols], 1, paste, collapse = "-" )
df8$fusion <- apply(df8[, cols], 1, paste, collapse = "-" )
df9$fusion <- apply(df9[, cols], 1, paste, collapse = "-" )
df10$fusion <- apply(df10[, cols], 1, paste, collapse = "-" )
df11$fusion <- apply(df11[, cols], 1, paste, collapse = "-" )
df12$fusion <- apply(df12[, cols], 1, paste, collapse = "-" )
How do I do it?
Something like in shell script style?
df_ls=("df1 df2 df3 df4 df5 df6 df7 df8 df9 df10 df11 df12")
for i in $df_ls
do
${i}$fusion <- apply(${i}[, cols], 1, paste, collapse = "-" )
done
With lists. For example in your case.
list_df_general <- list(df, df2,...,df12) # load your date frames
lista_new <- list() # list empty
for(i in 12){
lista_new[[i]] <- apply(lista_df_general[[i]][, cols], 1, paste, collapse = "-" )
}
For browser inside list you should do
list_new[[1]] or list_new[[2]] with 3,4,...,12.
You can use something like
my_Fun <- function(df, cols)
{
df$fusion <- apply(df[, cols], 1, paste, collapse = "-" )
}
for(i in 1 : 12)
{
variable_Name <- paste0("df", i)
assign(x = variable_Name, value = my_Fun(df = get(x = variable_Name), cols = cols))
}
You can use a combination of lapply and apply on the list of dataframes. This will return a list of dataframes.
df_ls <- mget(paste0("df", 1:12))
lapply(df_ls, function(x) {x$fusion <- apply(x[, cols], 1, paste, collapse = "-" ); x})
Multiple data frames in a list are always handier, find them in the .GlobalEnv and put them into one using mget. Now we may use lapply to apply any function to this list; we will define an anonymous function that uses Reduce which conveniently does the looping job to paste0 over the rows and columns for us, and is fast.
df_lst <- mget(paste0('df', 1:n)) ## put DFs in list
cols <- paste0('X', 2:5) ## define columns, say 2 to 5
res <- lapply(df_lst, function(df) {df$fusion <- Reduce(\(x, y) paste0(x, '-', y), df[cols]); df})
res
# $df1
# X1 X2 X3 X4 X5 fusion
# 1 A A T G C A-T-G-C
# 2 A T T C G T-T-C-G
# 3 A G A G A G-A-G-A
#
# $df2
# X1 X2 X3 X4 X5 fusion
# 1 A T C C C T-C-C-C
# 2 T T A G A T-A-G-A
# 3 G C A A A C-A-A-A
#
# $df3
# X1 X2 X3 X4 X5 fusion
# 1 G G A T T G-A-T-T
# 2 T C T C T C-T-C-T
# 3 G T G C G T-G-C-G
#
# $df4
# X1 X2 X3 X4 X5 fusion
# 1 G T A G G T-A-G-G
# 2 G A G T C A-G-T-C
# 3 T G T C A G-T-C-A
If for some reason, you need the data frames in the .GlobalEnv you may use list2env(res, .GlobalEnv) thereafter (similar as I did with the example data below).
Data:
set.seed(42)
n <- 4
replicate(n, data.frame(matrix(sample(LETTERS[c(1, 20, 3, 7)], 15, replace=TRUE), 3, 5)), simplify=FALSE) |>
setNames(paste0('df', seq_len(n))) |> list2env(.GlobalEnv)
I have two vectors that are binned.
Basically, I want a function to find the union and intersection of these two vectors (output).
It seems there is no function that supports this feature. Any idea of how i can carry out the desired output vector?
example1 <- c("18--25", "26--30", "31--50", "51+")
example2 <- c("18--23", "24--30", "31--65", "66+")
output <- c("18--23", "24--25", "26--30", "31--50", "51--65", "66+")
We can remove duplicates and combine a sorted vector every 2 elements like this (R version 4.0 and later for pipe |>):
f <- function(x, y, sep, max){
m <- paste0("\\", max)
gsub(m, "", c(x, y)) |>
strsplit(sep, fixed = T) |>
unlist(use.names = F) |>
sort() |>
unique() |>
as.numeric() |>
(\(.) tapply(., gl(length(.), 2, length(.)), paste, collapse = sep, simplify = T))() |>
(\(.) .[!is.na(.)])() |>
as.character() |>
(\(.) {.[length(.)] <- paste0(.[length(.)], max) ; .})()
}
# for older R versions
f <- function(x, y, sep, max){
x <- gsub(paste0("\\", max), "", c(x, y))
x <- as.numeric(unique(sort(unlist(strsplit(x, sep, T), use.names = F))))
x <- tapply(x, gl(length(x), 2L, length(x)), paste, collapse = sep, simplify = T)
x <- as.character(x[!is.na(x)])
x[length(x)] <- paste0(x[length(x)], max)
x
}
f(example1, example2, "--", "+")
[1] "18--23" "24--25" "26--30" "31--50" "51--65" "66+"
Using BASE R, I wonder how to answer the following question:
Are there any value on X or Y that occurs only in one row but not others? If yes, produce my desired output below.
f <- data.frame(id = c(rep("AA",4), rep("BB",2), rep("CC",2)), X = c(1,2,2,3,1,4,3,3),
Y = c(99,7,8,7,6,7,7,7))
Desired output:
list(BB = c(X = 4, Y = 6), AA = c(Y = c(99, 8)))
# $BB
# X Y
# 4 6
# $AA
# Y1 Y2 # Would be a plus if shows `Y Y` instead of `Y1 Y2`
# 99 8
There are two big ideas with this base approach:
Since we need to compare all the values, we should just recombine everything into one data.frame.
Making the unsplit data.frame long will save us some extra steps.
#https://stackoverflow.com/questions/58786052/find-variables-that-occur-only-once-across-a-split-data-frame-in-r/58788854#58788854
f <- data.frame(id = c(rep("AA",4), rep("BB",2), rep("CC",2)), X = c(1,2,2,3,1,4,3,3),
Y = c(99,7,8,7,6,7,7,7))
m <- split(f, f$id) # Here is `m`
unsplit <- do.call(rbind, c(m, make.row.names = F))
molten <- data.frame(unsplit[, 1, drop = F], stack(unsplit[, -1]))
# res <- subset(molten, !duplicated(values) & !duplicated(values, fromLast = T))
res <- molten[as.logical(ave(molten[['values']], molten[['ind']], FUN = function(x) !duplicated(x) & !duplicated(x, fromLast = T))), ]
#I would stop here
res
#> id values ind
#> 6 BB 4 X
#> 9 AA 99 Y
#> 11 AA 8 Y
#> 13 BB 6 Y
#to get exact output
res_vector <- res$values
names(res_vector) <- res$ind
split(res_vector, as.character(res$id))
#> $AA
#> Y Y
#> 99 8
#>
#> $BB
#> X Y
#> 4 6
Created on 2019-11-10 by the reprex package (v0.3.0)
Here's another base approach that may be less complicated:
####Way 1 with rapply
vec <- rapply(lapply(m, '[', mods), I)
unique_vec <- vec[!duplicated(vec) & !duplicated(vec, fromLast = T)]
vec_names <- do.call(rbind, strsplit(names(unique_vec), '.', fixed = T))
names(unique_vec) <- substr(vec_names[, 2], 1, 1) #turns Y1 into Y
split(unique_vec, vec_names[, 1])
###Way 2 with data.frame already do.call(rbind, m)
vec <- unlist(
lapply(f[, -1],
function(x){
ind <- !duplicated(x) & !duplicated(x, fromLast = T)
ret <- x[ind]
names(ret) <- f[ind, 1]
ret
}
)
)
#this is likely overly simplified:
split(vec, sub('.*\\.', '', names(vec)))
#this leads to exact result
vec_names <- do.call(rbind, strsplit(names(vec), '.', fixed = T))
names(vec) <- vec_names[, 1]
split(vec, vec_names[, 2])
$AA
Y Y
99 8
$BB
X Y
4 6
OP brings up using table() in a hint. duplicated() is very performant:
unlist(lapply(f[mods], function(y) names(which(table(y) == 1))))
# X Y1 Y2 Y3
# "4" "6" "8" "99"
vec
#X.BB Y.AA Y.AA Y.BB
# 4 99 8 6
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch> <bch:> <dbl> <bch:byt>
1 table_meth 321us 336us 2794. 10.3KB
2 dup_meth 132us 136us 7105. 31.7KB
bench::mark(
table_meth = {unlist(lapply(f[mods], function(y) names(which(table(y) == 1))))},
dup_meth = {
#could get slight performance boost with
#f_id <- f[['id']]
unlist(
lapply(f[, -1],
function(x){
ind <- !duplicated(x) & !duplicated(x, fromLast = T)
ret <- x[ind]
names(ret) <- f[ind, 1]
#names(ret) <- f_id[ind]
ret
}
)
)}
, check = F
)
And similar idea in data.table:
library(data.table)
molten_dt <- melt(rbindlist(m), id.vars = 'id')
molten_dt[!duplicated(value, by = variable) &
!duplicated(value, by = variable, fromLast = T)]
And similar idea in dplyr:
library(dplyr)
library(tidyr)
m%>%
bind_rows()%>%
pivot_longer(cols = -id)%>%
group_by(name)%>%
filter(!duplicated(value) & !duplicated(value, fromLast = T))%>%
group_by(id)%>%
group_split()
It's not pure functional programming but it is base R:
lapply(split(df, df$id), function(z){
X <- z$X[which(!(z$X %in% df$X[duplicated(df$X)]))]
Y <- z$Y[which(!(z$Y %in% df$Y[duplicated(df$Y)]))]
cbind(X, Y)
}
)
Data:
df <-
structure(list(
id = structure(
c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L),
.Label = c("AA", "BB", "CC"),
class = "factor"
),
X = c(1,
2, 2, 3, 1, 4, 3, 3),
Y = c(99, 7, 8, 7, 6, 7, 7, 7)
),
class = "data.frame",
row.names = c(NA,-8L))
I'm looking for an efficient way to paste/combine multiple pairs of adjacent columns at once using data.table. My feeble attempt is slow and not so elegant:
library(data.table)
dt <- data.table(ids = 1:3,
x1 = c("A","B","C"),
x2 = 1:3,
y1 = c("D", "E", "F"),
y2 = 4:6,
z1 = c("G", "H", "I"),
z3 = 7:9)
paste.pairs <- function(x, sep = "-"){
xx <- unlist(x)
x.len <- length(x)
r <- rep(NA, x.len/2)
s <- seq(1, x.len, by = 2)
for(i in 1:(x.len/2)) {
r[i] <- paste(xx[i], xx[i+1], sep = sep)
}
return(as.list(r))
}
dt[, paste.pairs(.SD), by = "ids"]
Is there a better way?
An option with Map by creating column index with seq
i1 <- seq(1, length(dt)-1, 2)
i2 <- seq(2, length(dt)-1, 2)
dt[, Map(paste,
.SD[, i1, with = FALSE], .SD[, i2, with = FALSE],
MoreArgs = list(sep="-")),
by = "ids"]
Another option would be to split by the names of the dataset and then paste
data.frame(lapply(split.default(dt[, -1, with = FALSE],
sub("\\d+$", "", names(dt)[-1])), function(x) do.call(paste, c(x, sep="-"))))
# x y z
#1 A-1 D-4 G-7
#2 B-2 E-5 H-8
#3 C-3 F-6 I-9
Or another option is with melt/dcast
dcast(melt(dt, id.var = 'ids')[, paste(value, collapse = "-"),
.(grp = sub("\\d+", "", variable), ids)], ids ~ grp, value.var = 'V1')
a solution using matrices
#create matrices
#use the columns you want to paste together...
m1 <- as.matrix( dt[,c(2,4,6)] )
m2 <- as.matrix( dt[, c(3,5,7)] )
#paste the matrices element-by-element, and convert result back to data.table
as.data.table( matrix( paste( m1, m2, sep="-"), nrow=nrow(m1), dimnames=dimnames(m1) ) )
Should run pretty fast, and is very readable and easy to adapt.
output
# x1 y1 z1
# 1: A-1 D-4 G-7
# 2: B-2 E-5 H-8
# 3: C-3 F-6 I-9
benchmarks
microbenchmark::microbenchmark(
wimpel = {
#create matrices
m1 <- as.matrix( dt[,c(2,4,6)] )
m2 <- as.matrix( dt[, c(3,5,7)] )
#paste the matrices element-by-element, and comvert to data.table
as.data.table( matrix( paste( m1, m2, sep="-"), nrow=nrow(m1), dimnames=dimnames(m1) ) )
},
akrun_df = {
data.frame(lapply(split.default(dt[, -1, with = FALSE],
sub("\\d+$", "", names(dt)[-1])), function(x) do.call(paste, c(x, sep="-"))))
},
akrun_map = {
i1 <- seq(2, length(dt), 2)
i2 <- seq(3, length(dt), 2)
dt[, Map(paste, .SD[, i1, with = FALSE], .SD[, i2, with = FALSE], MoreArgs = list(sep="-"))]
},
akrun_dcast = {
dcast(melt(dt, id.var = 'ids')[, paste(value, collapse = "-"),.(grp = sub("\\d+", "", variable), ids)], ids ~ grp, value.var = 'V1')
},
times = 10 )
# Unit: microseconds
# expr min lq mean median uq max neval
# wimpel 303.072 315.122 341.2417 319.1895 327.775 531.429 10
# akrun_df 1022.790 1028.515 1251.7812 1069.1850 1172.519 2779.460 10
# akrun_map 742.013 751.051 785.6059 778.1650 799.855 884.812 10
# akrun_dcast 4104.719 4175.215 4414.6596 4348.7430 4650.911 4939.221 10
Is it possible to specify which columns to aggregate over with ddply from the arguments of a function, without using eval + parse? Heres what I have so far:
x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group1 <- c("A","A","A","A","B","B")
group2 <- c("X","X","Y","Y","Z","X")
data <- data.frame(group1, group2, x, y)
Heres what I want the output to be:
aggFunction <- function(dataframe, toAverage, toGroup) {
out <- ddply(dataframe, toGroup, summarise,
x = mean(x),
y = mean(y))
return(out)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
# group1 group2 x y
# 1 A X 3 2.5
# 2 A Y 2 4.5
# 3 B X 7 6.0
# 4 B Z 5 4.0
Here's my solution using parse(eval):
aggFunction <- function(dataframe, toAverage, toGroup) {
toAverageArgs <- paste(toAverage, " = mean(", toAverage, ")", sep = "", collapse = ", ")
out <- eval(parse(text = paste("ddply(dataframe, toGroup, here(summarize),", toAverageArgs, ")")))
return(out)
}
This gives me the output that I want.
I'm wondering if there's a better way to do this. I'm aware of using do.call() and get(), but none of my attempts with these have worked.
Heres an attempt;
get(string) didn't work, but here(summarize) let me get the string value. Unfortunately, this means ddply treats them as strings:
aggFunction <- function(dataframe, toAverage, toGroup) {
string <- paste(toAverage, " = mean(", toAverage, ")", sep = "", collapse = ", ")
out <- ddply(dataframe, toGroup, here(summarise), string)
return(out)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
# group1 group2 ..2
# 1 A X x = mean(x), y = mean(y)
# 2 A Y x = mean(x), y = mean(y)
# 3 B X x = mean(x), y = mean(y)
# 4 B Z x = mean(x), y = mean(y)
Also tried do.call, but they were still treated as strings:
aggFunction <- function(dataframe, toAverage, toGroup) {
string <- paste(toAverage, " = mean(", toAverage, ")", sep = "", collapse = ", ")
print(string)
args <- list(dataframe, toGroup, here(summarise), string)
out <- do.call(ddply, args)
return(out)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
# group1 group2 "x = mean(x), y = mean(y)"
# 1 A X x = mean(x), y = mean(y)
# 2 A Y x = mean(x), y = mean(y)
# 3 B X x = mean(x), y = mean(y)
# 4 B Z x = mean(x), y = mean(y)
Lastly I tried hardcoding in mean(), but then I couldnt set the column name. If I use get(testVar) = mean(get(testVar)) I get unexpected =.
aggFunction <- function(dataframe, toAverage, toGroup) {
testVar <- "x"
out <- ddply(dataframe, toGroup, here(summarise),
get(testVar) = mean(get(testVar))
##
return(out)
}
Using aggregate in base R
aggFunction <- function(dataframe, toAverage, toGroup) {
aggregate(dataframe[, toAverage], dataframe[, toGroup], mean)
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
group1 group2 x y
1 A X 3 2.5
2 B X 7 6.0
3 A Y 2 4.5
4 B Z 5 4.0
You can consider dplyr package - generally it's much faster than plyr and also has pretty syntax.
library(dplyr)
x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group1 <- c("A","A","A","A","B","B")
group2 <- c("X","X","Y","Y","Z","X")
aggFunction <- function(dataframe, toAverage, toGroup) {
dataframe %>%
group_by_(.dots = toGroup) %>%
summarise_(.dots = setNames(sprintf("mean(%s)", toAverage), toAverage))
}
data <- data.frame(group1, group2, x, y)
aggFunction(data, c("x", "y"), c("group1", "group2"))
It gives:
group1 group2 x y
1 A X 3 2.5
2 A Y 2 4.5
3 B X 7 6.0
4 B Z 5 4.0
This is a lot easier if you melt the data frame first, do the calculations when in long format, and then cast back.
library(reshape2)
library(plyr)
aggFunction <- function(d1, toAverage, toGroup) {
d2 <- melt(d1, id.vars=toGroup, measure.vars=toAverage)
d3 <- ddply(d2, ~group1 + group2 + variable, summarize, mean=mean(value))
dcast(d3, group1 + group2 ~ variable, value.var="mean")
}
aggFunction(data, c("x", "y"), c("group1", "group2"))
## group1 group2 x y
## 1 A X 3 2.5
## 2 A Y 2 4.5
## 3 B X 7 6.0
## 4 B Z 5 4.0