Average columns in a dataframe by groups

Average columns in a dataframe by groups - r

My data.frame
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
I want to calculate the mean of each colum by the ID. I have tried two approaches:
1)
ag <- aggregate(. ~ ID, df, function(x) c(mean = mean(x)))
returns
ID Obs_1 Obs_2 Obs_3
1 1 0.41220831 -0.9999704 -0.7234958
2 2 0.03564336 0.5014259 0.4847635
3 3 0.05647885 0.2067311 -0.0542368
Why does it does not indicate "mean" in the column name and how can make this reported? something like this:
Obs_1.mean Obs_2.mean Obs_3.mean
2) the second approach
df[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
it gives
unused argument (by = ID)
Thank you

Here is one approach:
df <- data.frame(ID=rep(1:3, 3),
Obs_1=rnorm(9),
Obs_2=rnorm(9),
Obs_3=rnorm(9))
ag <- aggregate(. ~ ID, df, FUN = mean)
colnames(ag) <- sapply(colnames(ag),
FUN = function(x) { paste0(x, ifelse(x=="ID", "", ".mean")) })
It yields:
ID Obs_1.mean Obs_2.mean Obs_3.mean
1 1 0.2675131 -0.3494175 -0.3128264
2 2 1.0988356 -0.6645116 -0.2192402
3 3 -1.1569562 -0.4040192 -0.1265475
About your question why the trick
FUN = function(x) { c(mean = mean(x)) })
to stick the name "mean" to the result of FUN does not work:
I suppose that is due to the grouping that aggregate() performs. Each group might in theory return a different name. Which of these potentially conflicting names should aggregate() use then to name the complete column of the aggregated dataframe...?
If no grouping happens, like here,
sapply(df, FUN = function(x) { c(mean = mean(x)) })
we get the expected names:
ID.mean Obs_1.mean Obs_2.mean Obs_3.mean
2.0000000 0.4265256 -0.2046452 -0.4496054

Related

pivot lapply data.table

given the following dt:
Category <- c('A','A','A','B','B','B','B','A','B','B')
Amount <- c(10,20,30,15,20,40, 50, 80,20,10)
ID <- c('x01','x01','x02','x03','x03','x04','x05','x06','x07','x08')
dt_1 <- data.table(Category, Amount, ID)
dt_1
I would like to get the following output:
Category <- c('A','B')
NumRecords <- c(4,6)
TotalAmount <- c(140,155)
CountUniqueID <- c(3,5)
dt_2 <- data.table(Category, NumRecords, TotalAmount, CountUniqueID)
dt_2
possibly extending adjusting the following code that uses lapply:
ColsBy <- c("Category")
ColsSummary <- c("Amount")
dt_2 <- dt_1[, lapply(.SD, sum, na.rm=TRUE), by = ColsBy, .SDcols = ColsSummary ]

dt_1[, .(NumRecords = .N,
TotalAmount = sum(Amount),
CountUniqueId = uniqueN(ID)),
by = .(Category)]
Category NumRecords TotalAmount CountUniqueId
1: A 4 140 3
2: B 6 155 5

How to adapt the following code so that any arbitrary self-defined function can work inside?

The data and first piece look like this. I am applying function over pairs of columns (a,b,c) etc.
library(data.table)
d = data.table(time = c(1,1,2,2), a = c(1,2,3,4), b =c(4,3,2,1), c = c(1,1,1,1))
pairs = d[, data.table(t(combn(names(.SD), 2))), by = time]
pairs$i = 1:nrow(pairs) ## index column for reshaping in terms of pairs...
pairs = melt(pairs, id.vars = c('time', 'i'), value.name = 'firm')
d = melt(d, id.vars = 'time', variable.name = 'firm')
d = merge(pairs, d)
The piece I would like to adjust is the following (sum function applied here). Basically, this piece applies function to pair of columns (a-b), (a-c), (b-c) within each group (time 1 and 2).
result = dcast(d, time + i ~ .,
list(pair = \(x) paste(unique(x), collapse = '_'), sum),
value.var = list('firm', 'value'))
Let's say I have an arbitrary function
fun1<- function(x,y, na.rm = FALSE) 1 - 0.5*sum(abs(x-y))
I would apply this fun1 instead of sum in the above piece.

By making fun.aggregate = list we can preserve all the data, and then calculate with it later.
This still might not be what you want, but I think it's progress.
result = dcast(d, time + i ~ .,
list(pair = \(x) paste(unique(x), collapse = '_'), list),
value.var = list('firm', 'value'))
fun1 <- function(x,y, na.rm = FALSE) 1 - 0.5*sum(abs(x-y))
result[, new := sapply(value_list, \(x) fun1(x[1], x[2]) + fun1(x[3], x[4]))]
result
time i firm_pair value_list new
1: 1 1 a_b 1,2,4,3 1.0
2: 1 2 a_c 1,2,1,1 1.5
3: 1 3 b_c 4,3,1,1 1.5
4: 2 4 a_b 3,4,2,1 1.0
5: 2 5 a_c 3,4,1,1 1.5
6: 2 6 b_c 2,1,1,1 1.5

Multiple functions on multiple columns by group, and create informative column names

How to adjust a data table manipulation so that, besides sum per category of several colums,
it would also calculate other functions at the same time such as mean and counts (.N) and automatically create column names: "sum c1" , "sum c2", "sum c4" , "mean c1", " mean c2", "mean c4" and preferably also 1 column "counts"?
My old solution was to write out
mean col1 = ....
mean col2 = ....
Etc, Inside the data.table command
Which worked, but horribly inefficient I think, and it won't work anymore to precode it if in the new app version, the calculations depend on user choices in an R Shiny app what to calculate for which columns.
I've read my way through a bunch of posts and blog articles but haven't quite figured out how to best do this. I read that in some cases the manipulation can become quite slow on large data tables depending on what approach you use (.sdcols, get, lapply, and or by =). Therefore I added a 'sizeable' dummy data set
My real data is around 100k rows by 100 columns and 1-100 groups roughly.
library(data.table)
n = 100000
dt = data.table(index=1:100000,
category = sample(letters[1:25], n, replace = T),
c1=rnorm(n,10000),
c2=rnorm(n,1000),
c3=rnorm(n,100),
c4 = rnorm(n,10)
)
# add more columns to test for big data tables
lapply(c(paste('c', 5:100, sep ='')),
function(addcol) dt[[addcol]] <<- rnorm(n,1000) )
# Simulate columns selected by shiny app user
Colchoice <- c("c1", "c4")
FunChoice <- c(".N", "mean", "sum")
# attempt which now does just one function and doesn't add names
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=Colchoice ]
Expected output is a row per group and a column for each function per each selected column.
Category Mean c1 Sum c1 Mean c4 ...
A
B
C
D
E
......
Possibly a duplicate but I haven't found the exact answer that I need

If I understand correctly, this question consists of two parts:
How to group and aggregate with multiple functions over a list of columns and generate new column names automatically.
How to pass the names of the functions as a character vector.
For part 1, this is nearly a duplicate of Apply multiple functions to multiple columns in data.table but with the additional requirement that the results should be grouped using by =.
Therefore, eddi's answer has to be modified by adding the parameter recursive = FALSE in the call to unlist():
my.summary = function(x) list(N = length(x), mean = mean(x), median = median(x))
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
category c1.N c1.mean c1.median c4.N c4.mean c4.median
1: f 3974 9999.987 9999.989 3974 9.994220 9.974125
2: w 4033 10000.008 9999.991 4033 10.004261 9.986771
3: n 4025 9999.981 10000.000 4025 10.003686 9.998259
4: x 3975 10000.035 10000.019 3975 10.010448 9.995268
5: k 3957 10000.019 10000.017 3957 9.991886 10.007873
6: j 4027 10000.026 10000.023 4027 10.015663 9.998103
...
For part 2, we need to create my.summary() from a character vector of function names. This can be achieved by "programming on the language", i.e, by assembling an expression as character string and finally parsing and evaluating it:
my.summary <-
sapply(FunChoice, function(f) paste0(f, "(x)")) %>%
paste(collapse = ", ") %>%
sprintf("function(x) setNames(list(%s), FunChoice)", .) %>%
parse(text = .) %>%
eval()
my.summary
function(x) setNames(list(length(x), mean(x), sum(x)), FunChoice)
<environment: 0xe376640>
Alternatively, we can loop over the categories and rbind() the results afterwards:
library(magrittr) # used only to improve readability
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
Benchmark
So far, 4 data.table and one dplyr solutions have been posted. At least one of the answers claims to be "superfast". So, I wanted to verify by a benchmark with varying number of rows:
library(data.table)
library(magrittr)
bm <- bench::press(
n = 10L^(2:6),
{
set.seed(12212018)
dt <- data.table(
index = 1:n,
category = sample(letters[1:25], n, replace = T),
c1 = rnorm(n, 10000),
c2 = rnorm(n, 1000),
c3 = rnorm(n, 100),
c4 = rnorm(n, 10)
)
# use set() instead of <<- for appending additional columns
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n, 1000))
tables()
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.summary <- function(x) list(length = length(x), mean = mean(x), sum = sum(x))
bench::mark(
unlist = {
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
},
loop_category = {
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
},
dcast = {
dcast(dt, category ~ 1, fun = list(length, mean, sum), value.var = ColChoice)
},
loop_col = {
lapply(ColChoice, function(col)
dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))),
paste0(col, "_", FunChoice)),
by=category]
) %>%
Reduce(function(x, y) merge(x, y, by="category"), .)
},
dplyr = {
dt %>%
dplyr::group_by(category) %>%
dplyr::summarise_at(dplyr::vars(ColChoice), .funs = setNames(FunChoice, FunChoice))
},
check = function(x, y)
all.equal(setDT(x)[order(category)],
setDT(y)[order(category)] %>%
setnames(stringr::str_replace(names(.), "_", ".")),
ignore.col.order = TRUE,
check.attributes = FALSE
)
)
}
)
The results are easier to compare when plotted:
library(ggplot2)
autoplot(bm)
Please, note the logarithmic time scale.
For this test case, the unlist approach is always the fastest method, followed by dcast. dplyr is catching up for larger problem sizes n. Both lapply/loop approaches are less performant. In particular, Parfait's approach to loop over the columns and merge subresults afterwards seems to be rather sensitive to problem sizes n.
Edit: 2nd benchmark
As suggested by jangorecki, I have repeated the benchmark with much more rows and also with a varying number of groups.
Due to memory limitations, the largest problem size is 10 M rows times 102 columns which takes 7.7 GBytes of memory.
So, the first part of the benchmark code is modified to
bm <- bench::press(
n_grp = 10^(1:3),
n_row = 10L^seq(3, 7, by = 2),
{
set.seed(12212018)
dt <- data.table(
index = 1:n_row,
category = sample(n_grp, n_row, replace = TRUE),
c1 = rnorm(n_row),
c2 = rnorm(n_row),
c3 = rnorm(n_row),
c4 = rnorm(n_row, 10)
)
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n_row, 1000))
tables()
...
As expected by jangorecki, some solutions are more sensitive to the number of groups than others. In particular, performance of loop_category is degrading much stronger with the number of groups while dcast seems to be less affected. For fewer groups, the unlist approach is always faster than dcast while for many groups dcast is faster. However, for larger problem sizes unlist seems to be ahead of dcast.
Edit 2019-03-12: Computing on the language, 3rd benchmark
Inspired by this follow-up question, I have have added a computing on the language approach where the whole expression is created as character string, parsed and evaluated.
The expression is created by
library(magrittr)
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.expression <- CJ(ColChoice, FunChoice, sorted = FALSE)[
, sprintf("%s.%s = %s(%s)", V1, V2, V2, V1)] %>%
paste(collapse = ", ") %>%
sprintf("dt[, .(%s), by = category]", .) %>%
parse(text = .)
my.expression
expression(dt[, .(c1.length = length(c1), c1.mean = mean(c1), c1.sum = sum(c1),
c4.length = length(c4), c4.mean = mean(c4), c4.sum = sum(c4)), by = category])
This is then evaluated by
eval(my.expression)
which yields
category c1.length c1.mean c1.sum c4.length c4.mean c4.sum
1: f 3974 9999.987 39739947 3974 9.994220 39717.03
2: w 4033 10000.008 40330032 4033 10.004261 40347.19
3: n 4025 9999.981 40249924 4025 10.003686 40264.84
4: x 3975 10000.035 39750141 3975 10.010448 39791.53
5: k 3957 10000.019 39570074 3957 9.991886 39537.89
6: j 4027 10000.026 40270106 4027 10.015663 40333.07
...
I have modified the code of the 2nd benchmark to include this approach but had to reduce the additional columns from 100 to 25 in order to cope with the memory limitations of a much smaller PC. The chart shows that the "eval" approach is almost always the fastest or second:

Here's a data.table answer:
funs_list <- lapply(FunChoice, as.symbol)
dcast(dt, category~1, fun=eval(funs_list), value.var = Colchoice)
It's super fast and does what you want.

Consider building a list of data tables where you iterate through every ColChoice and apply each function of FuncChoice (setting names accordingly). Then, to merge all data tables together, run merge in a Reduce call. Also, use get to retrieve environment objects (functions/columns).
Note: ColChoice was renamed for camel case and length function replaces .N for functional form for count:
set.seed(12212018) # RUN BEFORE data.table() BUILD TO REPRODUCE OUTPUT
...
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
output <- lapply(ColChoice, function(col)
dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))),
paste0(col, "_", FunChoice)),
by=category]
)
final_dt <- Reduce(function(x, y) merge(x, y, by="category"), output)
head(final_dt)
# category c1_length c1_mean c1_sum c4_length c4_mean c4_sum
# 1: a 3893 10000.001 38930003 3893 9.990517 38893.08
# 2: b 4021 10000.028 40210113 4021 9.977178 40118.23
# 3: c 3931 10000.008 39310030 3931 9.996538 39296.39
# 4: d 3954 10000.010 39540038 3954 10.004578 39558.10
# 5: e 4016 9999.998 40159992 4016 10.002131 40168.56
# 6: f 3974 9999.987 39739947 3974 9.994220 39717.03

It seems that there's not a straightforward answer using data.table since noone has answered this yet. So I'll propose a dplyr-based answer that should do what you want. I use the built-in iris data set for the example:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length, Sepal.Width), .funs = c(sum=sum,mean= mean), na.rm=TRUE)
## A tibble: 3 x 5
# Species Sepal.Length_sum Sepal.Width_sum Sepal.Length_mean Sepal.Width_mean
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 245. 171. 5.00 3.43
#2 versicolor 297. 138. 5.94 2.77
#3 virginica 323. 149. 6.60 2.97
or using character vector input for the columns and functions:
Colchoice <- c("Sepal.Length", "Sepal.Width")
FunChoice <- c("mean", "sum")
iris %>%
group_by(Species) %>%
summarise_at(vars(Colchoice), .funs = setNames(FunChoice, FunChoice), na.rm=TRUE)
## A tibble: 3 x 5
# Species Sepal.Length_mean Sepal.Width_mean Sepal.Length_sum Sepal.Width_sum
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 5.00 3.43 245. 171.
#2 versicolor 5.94 2.77 297. 138.
#3 virginica 6.60 2.97 323. 149.

If the summary statistics you need to compute are things like mean, .N, and (perhaps) median, which data.table optimizes into c code across the by, you may have faster performance if you convert the table into long form so that you can do the computations in a way that data table can optimize them:
> library(data.table)
> n = 100000
> dt = data.table(index=1:100000,
category = sample(letters[1:25], n, replace = T),
c1=rnorm(n,10000),
c2=rnorm(n,1000),
c3=rnorm(n,100),
c4 = rnorm(n,10)
)
> {lapply(c(paste('c', 5:100, sep ='')), function(addcol) dt[[addcol]] <<- rnorm(n,1000) ); dt}
> Colchoice <- c("c1", "c4")
> dt[, .SD
][, c('index', 'category', Colchoice), with=F
][, melt(.SD, id.vars=c('index', 'category'))
][, mean := mean(value), .(category, variable)
][, median := median(value), .(category, variable)
][, N := .N, .(category, variable)
][, value := NULL
][, index := NULL
][, unique(.SD)
][, dcast(.SD, category ~ variable, value.var=c('mean', 'median', 'N')
]
category mean_c1 mean_c4 median_c1 median_c4 N_c1 N_c4
1: a 10000 10.021 10000 10.041 4128 4128
2: b 10000 10.012 10000 10.003 3942 3942
3: c 10000 10.005 10000 9.999 3926 3926
4: d 10000 10.002 10000 10.007 4046 4046
5: e 10000 9.974 10000 9.993 4037 4037
6: f 10000 10.025 10000 10.015 4009 4009
7: g 10000 9.994 10000 9.998 4012 4012
8: h 10000 10.007 10000 9.986 3950 3950
...

R aggregation with sum function gives the value 0 when it sums NA

I have a problem with several aggregation function in R.
I have this dataset :
df = data.table(x = rep(letters[1:3], each = 4), y = seq(15,26))
# I put NA value for "b" levels
df[5:8,2] = NA
For the "b" level of x, I only have NA values for y column and the result of aggregation replaces the sum of (NA+NA+...+NA) by 0 instead of NA (what I want...)
agg1 = ddply(df, .(x), summarize, agg = sum(y, na.rm = TRUE))
agg2 = aggregate(y ~ x, data = df, sum, na.action=na.pass, na.rm=TRUE)
agg3 = df[, agg := sum(y, na.rm = TRUE), by = list(x)]
I can't remove the NA value of my dataset to overpass this problem because of the complexity of my real dataset.

all you need is to remove the na.rm=T
agg2<- aggregate(y ~ x, data = df, sum, na.action=na.pass)
and this is the result :
> agg2
x y
1 a 66
2 b NA
3 c 98

Mean row by imbricated levels of factors

I have the following dataframe:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
What I want is to obtain the mean rows for each ID, which is to say I want to obtain the mean acc1, acc2, acc3 and acc4 for each level A, B, C and D by averaging the values for each sub (4 levels for each id), which would give something like this in the end (with the NAs replaced by the means I want of course):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
Thanks in advance!

Try:
You can use any of the specialized packages dplyr or data.table or using base R. Because you have a lot of columns that starts with acc to get the mean of, I choose dplyr. Here, the idea is to first group the variable by id and then use summarise_each to get the mean of each column by id that starts_with acc
library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
I would rename using paste
# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
gives the result
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
Or using data.table
library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
Here.SD implies Subset of Data.table, .SDcols are the columns to which we apply the mean operation.
setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1

(This must have been asked at least 20 times.) The `aggregate function applies the same function (given as the third argument) to all the columns of its first argument within groups defined by its second argument:
aggregate(df[-(1:2)], df[1],mean)
If you want to append the letters "mean" to the column names:
names(df2) <- paste0("mean", names(df2)
If you had wanted to do the column selection automatically then grep or grepl would work:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)

Here are a couple of other base R options:
split + vapply (since we know vapply would simplify to a matrix whenever possible)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by (with a do.call(rbind, ...) to get the final structure)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
Both will give you something like this as your result:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
The sample data used here was (with set.seed, for reproducibility):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
Scaling up to 1M rows, these both perform quite well (though obviously not as fast as "dplyr" or "data.table").

You can do this in base package itself using this:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Average columns in a dataframe by groups - r

Related

pivot lapply data.table

How to adapt the following code so that any arbitrary self-defined function can work inside?

Multiple functions on multiple columns by group, and create informative column names

R aggregation with sum function gives the value 0 when it sums NA

Mean row by imbricated levels of factors

Categories

Resources