df<-data.frame(gender = c('A', 'B', 'B','B','A'),q01 = c(1, 6, 3,8,5),q02 = c(5, 3, 6,5,2))
gender q01 q02
1 A 1 5
2 B 6 3
3 B 3 6
4 B 8 5
5 A 5 2
I want to calculate q01*2+q02 and then get the mean by gender group,the expected result as below:
A 9.5
B 16
I tried but failed:
df %>% aggregate(c(q01,q02)~gender,mean(q01*2+q02))
Error in mean(q01 * 2 + q02) : object 'q01' not found
df %>% group_by(gender) %>% mean(.$q01*2+.$q02)
[1] NA
Warning message:
In mean.default(., .$q01 * 2 + .$q02) :
argument is not numeric or logical: returning NA
What's the problem?
In the OP's code for dplyr + aggregate, the data is not specified along with using c i.e. concatenate two columns together. Also,
aggregate(c(q01,q02)~gender,df, mean(q01*2+q02))
Error in model.frame.default(formula = c(q01, q02) ~ gender, data =
df) : variable lengths differ (found for 'gender')
Here,with c(q01, q02), it is like concatenating c(1:5, 6:10) and now the length will be double as that of previous along with the fact that the FUN used will not get evaluated as it wouldn't find the 'q01' or 'q02'
Instead, we can cbind to create new column with the formula method of aggregate and then get the mean
library(dplyr)
df %>%
aggregate(cbind(q = q01 * 2 + q02) ~ gender, data = ., mean)
# gender q
#1 A 9.5
#2 B 16.0
NOTE: In dplyr, the data from the lhs of %>% can be specified with a ..
NOTE2: Here, we assume that the question is to understand how the aggregate can be made to work in the %>%. If it is just to get the mean, the whole process can be done with dplyr
f1 <- function(x, y, val) mean(x * val + y)
df %>%
group_by(gender) %>%
summarise(q = f1(q01, q02, 2))
Or using data.table methods
library(data.table)
setDT(df)[, .(q = mean(q01 * 2 + q02)), .(gender)]
# gender q
#1: A 9.5
#2: B 16.0
Or using base R with by
stack(by(df[-1], df[1], FUN = function(x) mean(x[,1] * 2 + x[,2])))
Or with aggregate
aggregate(cbind(q = q01 * 2 + q02) ~ gender, df, mean)
Better to keep dplyr and base approaches separate. Each of them have their own way to handle data. With dplyr you can do
library(dplyr)
df %>%
mutate(q = q01 * 2 + q02) %>%
group_by(gender) %>%
summarise(q = mean(q))
# gender q
# <fct> <dbl>
#1 A 9.5
#2 B 16
and using base R aggregate
aggregate(q~gender, transform(df, q = q01*2+q02), mean)
Sticking with the same logicc:
df %>%
do(aggregate(I(q01*2)+q02~gender,
data=.,mean)) %>%
setNames(.,nm=c("gender","q"))
gender q
1 A 9.5
2 B 16.0
NOTE:
I do note that do's lifecycle is marked as questioning.
Related
Say I have a data frame:
df <- data.frame(a = 1:10,
b = 1:10,
c = 1:10)
I'd like to apply several summary functions to each column, so I use dplyr::summarise_all
library(dplyr)
df %>% summarise_all(.funs = c(mean, sum))
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 55 55 55
This works great! Now, say I have a function that takes an extra parameter. For example, this function calculates the number of elements in a column above a threshold. (Note: this is a toy example and not the real function.)
n_above_threshold <- function(x, threshold) sum(x > threshold)
So, the function works like this:
n_above_threshold(1:10, 5)
#[1] 5
I can apply it to all columns like before, but this time passing the additional parameter, like so:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = 5)
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 5 5 5
But, say I have a vector of thresholds where each element corresponds to a column. Say, c(1, 5, 7) for my example above. Of course, I can't simply do this, as it doesn't make any sense:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = c(1, 5, 7))
If I was using base R, I might do this:
> mapply(n_above_threshold, df, c(1, 5, 7))
# a b c
# 9 5 3
Is there a way of getting this result as part of a dplyr piped workflow like I was using for the simpler cases?
dplyr provides a bunch of context-dependent functions. One is cur_column(). You can use it in summarise to look up the threshold for a given column.
library("tidyverse")
df <- data.frame(
a = 1:10,
b = 1:10,
c = 1:10
)
n_above_threshold <- function(x, threshold) sum(x > threshold)
# Pair the parameters with the columns
thresholds <- c(1, 5, 7)
names(thresholds) <- colnames(df)
df %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean
#> 1 9 5.5 5 5.5 3 5.5
This returns NA silently if the current column name doesn't have a known threshold. This is something that you might or might not want to happen.
df %>%
# Add extra column to show what happens if we don't know the threshold for a column
mutate(
x = 1:10
) %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean x_count x_mean
#> 1 9 5.5 5 5.5 3 5.5 NA 5.5
Created on 2022-03-11 by the reprex package (v2.0.1)
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I would like to get the mean of a variable according to the group it belongs to.
Here is a reproducible example.
gender <- c("M","F","M","F")
vec1 <- c(1:4)
vec2 <- c(10:13)
df <- data.frame(vec1,vec2,gender)
variables <- names(df)
variables <- variables[-3]
#Wished result
mean1 <- c(mean(c(1,3)),mean(c(2,4)))
mean2 <- c(mean(c(10,12)),mean(c(11,13)))
gender <- c("M","F")
result <- data.frame(gender,mean1,mean2)
How can I achieved such a result ? I would like to use the vector variables, containing the names of the variables to be summarized instead of writing each variables, as my dataset is quite big.
A dplyr solution
library(dplyr)
df %>% group_by(gender) %>% summarise(across(variables, list(mean = mean), .names = "{.fn}_{.col}"))
Output
# A tibble: 2 x 3
gender mean_vec1 mean_vec2
<chr> <dbl> <dbl>
1 F 3 12
2 M 2 11
Use library dplyr
library(dplyr)
gender <- c("M","F","M","F")
df <- data.frame(1:4,gender)
df %>%
group_by(gender) %>%
summarise(mean = X1.4 %>% mean())
Using aggregate.
## formula notation
aggregate(cbind(vec1, vec2) ~ gender, df, FUN=mean)
# gender vec1 vec2
# 1 F 3 12
# 2 M 2 11
## list notation
with(df, aggregate(list(mean=cbind(vec1, vec2)), list(gender=gender), mean))
# gender mean.vec1 mean.vec2
# 1 F 3 12
# 2 M 2 11
If you get an error in the formula notation, it is because you have named another object mean. Use rm(mean) in this case.
I have a data frame with as follows
lower <- c(1,5,15)
upper <-c(5,15,30)
df<-data.frame(lower,upper)
I would like to use dplyr's mutate to create a new variable of the area under the curve of a defined function. The function is as follows.
my_fun <- function(x){y = 1.205016 + 0.03796243 * log(x)}
I am using the integral() function from the pracma package to find the area under the curve. When I use this function on an a pair of upper and lower values it runs with no Errors as follows.
integral(my_fun, 1,5)
[1] 4.973705`
However, when I try to run this same function using dplyr's mutate, I get the following.
new_df <- df %>%
mutate(new_variable = integral(my_fun, lower, upper))
Error in integral(my_fun, lower, upper) : length(xmin) == 1 is not
TRUE
It seems that the integral function must be reading the whole vectors df$lower and df$upperrather than the individual pairs of values 1,5. Is there a solution to this using dplyr's mutate, or should I be looking for other solutions.
I did some looking around and the only instances of this error related to mutate did not seem to address the issue I have here.
We could use rowwise
library(dplyr)
library(pracma)
df %>%
rowwise %>%
mutate(new_variable = integral(my_fun, lower, upper))
-output
# A tibble: 3 x 3
# Rowwise:
# lower upper new_variable
# <dbl> <dbl> <dbl>
#1 1 5 4.97
#2 5 15 12.9
#3 15 30 19.8
Or with map2
library(purrr)
df %>%
mutate(new_variable = map2_dbl(lower, upper, ~integral(my_fun, .x, .y)))
-output
# lower upper new_variable
#1 1 5 4.973705
#2 5 15 12.907107
#3 15 30 19.837273
Or using pmap
df %>%
mutate(new_variable = pmap_dbl(cur_data(), ~ integral(my_fun, ..1, ..2)))
# lower upper new_variable
#1 1 5 4.973705
#2 5 15 12.907107
#3 15 30 19.837273
Or using base R
df$new_variable <- unlist(Map(function(x, y)
integral(my_fun, x, y), df$lower, df$upper))
Or using apply from base R
apply(df, 1, function(x) integral(my_fun, x[1], x[2]))
#[1] 4.973705 12.907107 19.837273
I have results from a within-participants design with timeseries info about each trial. I want to reshuffle the conditions for some permutation testing. I need to write a function though and this is where I run into problems.
My data looks something like this:
library(tidyverse)
sampel <- expand.grid(s0 = 1:5, r0 = 1:12)
sampel <- sampel %>% mutate(c0 = rep(c('A', 'B'), 30))
sampel <- sampel %>%
group_by(s0, c0, r0) %>%
nest() %>%
mutate(t0 = map(data, function(t) seq(1:8)), v0 = map(data, function(v) seq(from = 0, by = runif(1), length.out = 8))) %>%
unnest(cols = c(data, t0, v0)) %>%
ungroup() %>%
mutate(s0 = paste('s', s0, sep = ''))
head(sampel, n = 12)
(if you have any pointers how I could go about displaying this example in a better way I would much appreciate it too)
So to add some context, it's results of a within-subjects study. s0 stands for participant, c0 for condition, r0 for trial number (run). t0 is a timepoint and v0 a value of interest at this timepoint.
I am trying to reshuffle conditions within-participants
resampledSampel <-
sampel %>%
group_by(s0, r0, c0) %>%
nest() %>%
group_by(s0) %>%
mutate(c1 = c0[sample(row_number())])
resampledSampel %>%
head(n = 12)
This works as hoped, but when I try to make a function:
resample_within <- function(df, subject, trial, condition) {
subject <- enquo(subject)
trial <- enquo(trial)
condition <- enquo(condition)
resampled <-
df %>%
group_by(!!subject, !!trial, !!condition) %>%
nest() %>%
group_by(!!subject) %>%
mutate(condition = !!condition[sample(row_number())]) %>%
unnest(data)
return(resampled)
}
resample_within(sampel, s0, r0, c0)
throws an error:
Error: row_number() should only be called in a data context
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Subsetting quosures with `[` is deprecated as of rlang 0.4.0
Please use `quo_get_expr()` instead.
This warning is displayed once per session.
Any idea how I can use mutate(condition = !!condition[sample(row_number())]) in the function? Or how I could do all this without dplyr (it makes me realise that I probably rely on dplyr a bit too much...)
Thank you in advance. And, also in advance, I apologise if the way I presented the question is not ideal (I will gladly take any pointers on how to better formulate questions on stack exchange too. For instance I can't seem to figure out how to display the data structure)
Very nice first question!
This is actually just a matter of operator precedence. When you call !!condition[sample(row_number())], it is interpreted as !!(condition[sample(row_number())]) i.e. you are trying to subset the quosure then apply double bang, but you mean (!!condition)[sample(row_number())], that is, you want to subset the result of the double-bang. So just apply brackets to fix the order of evaluation and it works as expected:
resample_within <- function(df, subject, trial, condition) {
subject <- enquo(subject)
trial <- enquo(trial)
condition <- enquo(condition)
resampled <-
df %>%
group_by(!!subject, !!trial, !!condition) %>%
nest() %>%
group_by(!!subject) %>%
mutate(condition = (!!condition)[sample(row_number())]) %>%
unnest(data)
return(resampled)
}
Now:
resample_within(sampel, s0, r0, c0)
#> # A tibble: 480 x 6
#> # Groups: s0 [5]
#> s0 r0 c0 t0 v0 condition
#> <chr> <int> <chr> <int> <dbl> <chr>
#> 1 s1 1 A 1 0 B
#> 2 s1 1 A 2 0.981 B
#> 3 s1 1 A 3 1.96 B
#> 4 s1 1 A 4 2.94 B
#> 5 s1 1 A 5 3.93 B
#> 6 s1 1 A 6 4.91 B
#> 7 s1 1 A 7 5.89 B
#> 8 s1 1 A 8 6.87 B
#> 9 s2 1 B 1 0 A
#> 10 s2 1 B 2 0.976 A
#> # ... with 470 more rows
We can use the curly-curly ({{}}) operator
library(dplyr)
library(tidyr)
resample_within <- function(df, subject, trial, condition) {
df %>%
group_by({{subject}}, {{trial}}, {{condition}}) %>%
nest() %>%
group_by({{subject}}) %>%
mutate(condition = ({{condition}})[sample(row_number())]) %>%
unnest(data)
}
resample_within(sampel, s0, r0, c0)
I am trying to use a data.frame twice in a dplyr chain. Here is a simple example that gives an error
df <- data.frame(Value=1:10,Type=rep(c("A","B"),5))
df %>%
group_by(Type) %>%
summarize(X=n()) %>%
mutate(df %>%filter(Value>2) %>%
group_by(Type) %>%
summarize(Y=sum(Value)))
Error: cannot handle
So the idea is that first a data.frame is created with two columns Value which is just some data and Type which indicates which group the value is from.
I then try to use summarize to get the number of objects in each group, and then mutate, using the object again to get the sum of the values, after the data has been filtered. However I get the Error: cannot handle. Any ideas what is happening here?
Desired Output:
Type X Y
A 5 24
B 5 28
You could try the following
df %>%
group_by(Type) %>%
summarise(X = n(), Y = sum(Value[Value > 2]))
# Source: local data frame [2 x 3]
#
# Type X Y
# 1 A 5 24
# 2 B 5 28
The idea is to filter only Value by the desired condition, instead the whole data set
And a bonus solution
library(data.table)
setDT(df)[, .(X = .N, Y = sum(Value[Value > 2])), by = Type]
# Type X Y
# 1: A 5 24
# 2: B 5 28
Was going to suggest that to #nongkrong but he deleted, with base R we could also do
aggregate(Value ~ Type, df, function(x) c(length(x), sum(x[x>2])))
# Type Value.1 Value.2
# 1 A 5 24
# 2 B 5 28
This is also pretty easy to do with ifelse()
df %>% group_by(Type) %>% summarize(X=n(),y=sum( ifelse(Value>2, Value, 0 )))
outputs:
Source: local data frame [2 x 3]
Type X y
1 A 5 24
2 B 5 28