Using aggregate functions on multiple columns at once in r - r

Lets say I have a data set that has multiple rows and columns and I want to record the min, max and mean for each column and store this data in its own table. How do I loop through the data frame in such a way that I can find this data for each column?
Edit: My initial data is stored in a tbl that looks like this Initial Data and I want the output to look like this Output Data

Take a look at package dplyr, which will make this task more straightforward!
Here's an approach that just uses dplyr. The format isn't exactly what's in Output Data...
> df <- data.frame(A=c(7,2,4), B=c(5,4,6), C=c(7,9,1)) # Your Initial Data
> library(dplyr)
> df %>% summarise_all(.funs=funs(mean, min, max)) ## Approach 1: just dplyr
A_mean B_mean C_mean A_min B_min C_min A_max B_max C_max
1 4.333333 5 5.666667 2 4 1 7 6 9
Alternatively, if you also use package tidyr, you can get exactly the format you wanted for your output data:
> library(tidyr)
> df %>%
+ gather(Column, Value) %>% ## Converts dataframe from wide to long format
+ group_by(Column) %>% ## Groups by the new column containing old column names
+ summarise(Max=max(Value), Min=min(Value), Mean=mean(Value)) ## The summary functions
# A tibble: 3 x 4
Column Max Min Mean
<chr> <dbl> <dbl> <dbl>
1 A 7.00 2.00 4.33
2 B 6.00 4.00 5.00
3 C 9.00 1.00 5.67
One advantage of using these packages is that it may be more efficient, especially if df is large, than using an explicit loop.

I suggest you work with long tables instead of wide ones. While the last will make it simpler to the human eye, the former are easier to manipulate for data analysis. That said, I think you could use the data.table package to achieve this:
# create a data frame
df <- data.frame(A=c(7,2,4), B=c(5,4,6), C=c(7,9,1))
# load data.table package
require(data.table)
# convert df to a data.table
setDT(df)
#Explanation of the following code:
# melt: turns your wide table into a long one
# .(val_mean ...) calculate and give names to calculated variables
# by = ... : group by variable. See data.table vignette
melt(df)[, .(val_mean = mean(value),
val_min = min(value),
val_max = max(value)),
by = variable]
which produces:
variable val_mean val_min val_max
1: A 4.333333 2 7
2: B 5.000000 4 6
3: C 5.666667 1 9

Related

Going from dplyr to base: create a data frame of the first and last index for each level of a variable

Asking how to go from dplyr to base may be a weird ask, especially since I love the tidyverse, but I think because I learned the tidyverse first, my grasp of base is far from masterful, and I need a base solution because the package I'm helping to develop doesn't want any tidyverse dependencies
Data (there are many more columns, but abbreviated for reprex sake):
sample.df <- tibble(batch = rep(c(1,2,3), c(4,5,6)))
Desire base equivalent of:
sample.df %>%
mutate(rowid = row_number()) %>%
group_by(batch) %>%
summarize(idx_b = min(rowid),
idx_e = max(rowid))
# A tibble: 3 x 3
# Groups: batch [3]
batch idx_b idx_e
<dbl> <int> <int>
1 1 1 4
2 2 5 9
3 3 10 15
We create a sequence column in the data, use aggregate to get the range or min/max and convert the matrix column to regular data.frame column with do.call
out <- do.call(data.frame, aggregate(rowid ~ batch,
transform(sample.df, rowid = seq_len(nrow(sample.df))),
FUN = function(x) c(b = min(x), e = max(x))))
Another base R option using unique + ave
unique(
transform(
sample.df,
idx_b = ave(1:nrow(sample.df), batch, FUN = min),
idx_c = ave(1:nrow(sample.df), batch, FUN = max)
)
)
gives
batch idx_b idx_c
1 1 1 4
5 2 5 9
10 3 10 15

using mutate_at from dplyr

I have a data frame with 5 columns and I want to produce 4 additional columns giving my the difference between the last 4 columns and the first column.
I tried the following, but that doesn't work:
library(tidyverse)
df <- as.tibble(data.frame(A = c(1,2), B = c(3,4), C = c(4,5), D = c(2,3), E = c(4,5)))
r_diff <- function(x,y){
z = y - x
return(z)
}
vars_to_process <- c("B","C","D","E")
df %>% mutate_at(.cols=vars_to_process, .funs =r_diff(.,df[,1])) %>% head()
Thanks
Renger
Here's the simplest way to do it.
df %>%
mutate_at(.vars = vars(B:E),
.funs = list(~ . - A))
The .vars argument lets you specify columns in the same way that you would specify columns in select(), provided you put that specification inside the function vars().
The .funs argument accepts an anonymous function defined on the fly inside a call to list(). And you can reference a column in the dataframe (in this case A) when defining this anonymous function (see this Stackoverflow question).
In addition, with the release of dplyr 1.0.0, you can now simply do the following:
df %>%
mutate(across(B:E, ~ . - A))
Here's a faster solution using base R code. Strategy is convert to a matrix, subtract column one from the required columns, build back into a data frame. Note this only returns the modified columns - if there are columns not in vars_to_process they'll not appear in the output but you didn't have any of those in your test set so I'll assume they don't exist.
So, always write things in functions whenever possible:
bsr = function(df,vars_to_process){
m = as.matrix(df)
data.frame(
A = m[, 1],
m[, 1] - m[, vars_to_process])}
Make some test data:
> df = data.frame(matrix(runif(5*1000), ncol=5))
> names(df)=LETTERS[1:5]
> dft = as.tibble(df)
> head(dft)
# A tibble: 6 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.2609174 0.07857624 0.2727817 0.8498004 0.3403234
2 0.3644744 0.95810657 0.8183856 0.2958133 0.4752349
3 0.6042914 0.98793218 0.7547003 0.9596591 0.5354045
4 0.4000441 0.61403331 0.9018804 0.3838347 0.3266855
5 0.6767012 0.11984219 0.9181570 0.5988404 0.6058629
Compare with the tidyverse version:
akr = function(df,vars_to_process){
df %>% mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
}
Check bsr and akr agree:
> head(bsr(dft, vars_to_process))
A B C D E
1 0.2609174 0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119 0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765 0.06888696
4 0.4000441 -0.2139892 -0.50183635 0.01620939 0.07335861
> head(akr(dft, vars_to_process))
# A tibble: 6 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.2609174 0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119 0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765 0.06888696
4 0.4000441 -0.2139892 -0.50183635 0.01620939 0.07335861
okay, except akr returns a tribble but nm. Benchmark:
> microbenchmark(bsr(dft, vars_to_process),akr(dft, vars_to_process))
Unit: microseconds
expr min lq mean median uq
bsr(dft, vars_to_process) 362.117 388.7215 488.9309 446.123 521.776
akr(dft, vars_to_process) 8070.391 8365.4230 9853.5239 8673.692 9335.613
Base R version is 26 times faster. I'd also argue that subtracting a column from another set of columns is tidier than applying a mutator function but as long as you wrap what your doing in a function it doesn't matter how messy the guts are.
We need to subset the column with [[ as the [ is still a data.frame
df %>%
mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
# A tibble: 2 x 5
# A B C D E
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 -2 -3 -1 -3
#2 2 -2 -3 -1 -3

Divide column of data by mean of the group

If I have a data frame, such as:
group=rep(1:4,each=10)
data=c(seq(1,10,1),seq(5,50,5),seq(20,11,-1),seq(0.3,3,0.3))
DF=data.frame(group,data)
Now, I would like to divide each data element by the mean of its group. For example:
group=rep(1:4,each=10)
data=c(seq(1,10,1),seq(5,50,5),seq(20,11,-1),seq(0.3,3,0.3))
DF=data.frame(group,data)
aggregate(DF,by=list(DF$group),FUN=mean)
#Group.1 group data
#1 1 1 5.50
#2 2 2 27.50
#3 3 3 15.50
#4 4 4 1.65
data1=c(seq(1,10,1)/5.5,seq(5,50,5)/27.5,seq(20,11,-1)/15.5,seq(0.3,3,0.3)/1.65)
DF1=data.frame(group, data1)
However, this is a bit convoluted, and work not work easily in a large dataset. I feel like there is an apply application which could be used here, but I cannot find a nice way to do it.
Here's the usual set of options (thanks to #G.Grothendieck for simplification of ave):
# base R
DF$newdata = ave(DF$data, DF$group, FUN = function(x) x/mean(x))
# or...
DF$newdata = DF$data / ave(DF$data, DF$group)
# dplyr
library(dplyr)
DF %>% group_by(group) %>% mutate(newdata = data/mean(data))
# data.table
library(data.table)
setDT(DF)[, newdata := data/mean(data), by=group]

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Applying multiple functions to each column in a data frame using aggregate

When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner:
# bogus functions
foo1 <- function(x){mean(x)*var(x)}
foo2 <- function(x){mean(x)/var(x)}
# for illustration purposes only
npk$block <- as.numeric(npk$block)
subdf <- aggregate(npk[,c("yield", "block")],
by = list(N = npk$N, P = npk$P),
FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))})
Having the results in a nicely ordered data frame is achieved by using:
df <- do.call(data.frame, subdf)
Can I avoid the call to do.call() by somehow using aggregate() smarter in this scenario or shorten the whole process by using another base R solution from the start?
As #akrun suggested, dplyr's summarise_each is well-suited to the task.
library(dplyr)
npk %>%
group_by(N, P) %>%
summarise_each(funs(foo1, foo2), yield, block)
# Source: local data frame [4 x 6]
# Groups: N
#
# N P yield_foo2 block_foo2 yield_foo1 block_foo1
# 1 0 0 2.432390 1 1099.583 12.25
# 2 0 1 1.245831 1 2205.361 12.25
# 3 1 0 1.399998 1 2504.727 12.25
# 4 1 1 2.172399 1 1451.309 12.25
You can use
df=data.frame(as.list(aggregate(...

Resources