Say I have a data frame:
df <- data.frame(a = 1:10,
b = 1:10,
c = 1:10)
I'd like to apply several summary functions to each column, so I use dplyr::summarise_all
library(dplyr)
df %>% summarise_all(.funs = c(mean, sum))
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 55 55 55
This works great! Now, say I have a function that takes an extra parameter. For example, this function calculates the number of elements in a column above a threshold. (Note: this is a toy example and not the real function.)
n_above_threshold <- function(x, threshold) sum(x > threshold)
So, the function works like this:
n_above_threshold(1:10, 5)
#[1] 5
I can apply it to all columns like before, but this time passing the additional parameter, like so:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = 5)
# a_fn1 b_fn1 c_fn1 a_fn2 b_fn2 c_fn2
# 1 5.5 5.5 5.5 5 5 5
But, say I have a vector of thresholds where each element corresponds to a column. Say, c(1, 5, 7) for my example above. Of course, I can't simply do this, as it doesn't make any sense:
df %>% summarise_all(.funs = c(mean, n_above_threshold), threshold = c(1, 5, 7))
If I was using base R, I might do this:
> mapply(n_above_threshold, df, c(1, 5, 7))
# a b c
# 9 5 3
Is there a way of getting this result as part of a dplyr piped workflow like I was using for the simpler cases?
dplyr provides a bunch of context-dependent functions. One is cur_column(). You can use it in summarise to look up the threshold for a given column.
library("tidyverse")
df <- data.frame(
a = 1:10,
b = 1:10,
c = 1:10
)
n_above_threshold <- function(x, threshold) sum(x > threshold)
# Pair the parameters with the columns
thresholds <- c(1, 5, 7)
names(thresholds) <- colnames(df)
df %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean
#> 1 9 5.5 5 5.5 3 5.5
This returns NA silently if the current column name doesn't have a known threshold. This is something that you might or might not want to happen.
df %>%
# Add extra column to show what happens if we don't know the threshold for a column
mutate(
x = 1:10
) %>%
summarise(
across(
everything(),
# Use `cur_column()` to access each column name in turn
list(count = ~ n_above_threshold(.x, thresholds[cur_column()]),
mean = mean)
)
)
#> a_count a_mean b_count b_mean c_count c_mean x_count x_mean
#> 1 9 5.5 5 5.5 3 5.5 NA 5.5
Created on 2022-03-11 by the reprex package (v2.0.1)
Related
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.
Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)
You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5
I am trying to use dplyr's new NSE language approach to create a conditional mutate, using a vector input. Where I am having trouble is setting the column equal to itself, see mwe below:
df <- data.frame("Name" = c(rep("A", 3), rep("B", 3), rep("C", 4)),
"X" = runif(1:10),
"Y" = runif(1:10)) %>%
tbl_df() %>%
mutate_if(is.factor, as.character)
ColToChange <- "Name"
ToChangeTo <- "Big"
Now, using the following:
df %>% mutate( !!ColToChange := ifelse(X >= 0.5 & Y >= 0.5, ToChangeTo, !!ColToChange))
Sets the ColToChange value to Name, not back to its original value. I am thus trying to use the syntax above to achieve this:
df %>% mutate( !!ColToChange := ifelse(X >= 0.5 & Y >= 0.5, ToChangeTo, Name))
But instead of Name, have it be the vector.
You need to use rlang:sym to evaluate ColToChange as a symbol Name first, then evaluate it as a column with !!:
library(rlang); library(dplyr);
df %>% mutate(!!ColToChange := ifelse(X >= 0.5 & Y >= 0.5, ToChangeTo, !!sym(ColToChange)))
# A tibble: 10 x 3
# Name X Y
# <chr> <dbl> <dbl>
# 1 A 0.05593119 0.3586310
# 2 A 0.70024660 0.4258297
# 3 Big 0.95444388 0.7152358
# 4 B 0.45809482 0.5256475
# 5 Big 0.71348123 0.5114379
# 6 B 0.80382633 0.2665391
# 7 Big 0.99618062 0.5788778
# 8 Big 0.76520307 0.6558515
# 9 C 0.63928001 0.1972674
#10 C 0.29963517 0.5855646
I would like to implement a function which has the same interface as the filter method in dplyr but instead of removing the rows not matching to a condition would, for instance, return an array with an indicator variable, or attach such column to the returned tibble?
I would find it very useful since it would allow me to compute summaries of some columns after and before filtering as well as summaries of the rows which would have been removed on a single tibble.
I find the dplyr::filter interface very convenient and therefore would like to emulate it.
I think group_by will help you here
You might normally filter then summarise like so
library(dplyr)
mtcars %>%
filter(cyl==4) %>%
summarise(mean=mean(gear))
# mean
# 1 4.090909
You can group_by, summarise, then filter
mtcars %>%
group_by(cyl) %>%
summarise(mean=mean(gear))
# optional filter here
# # A tibble: 3 x 2
# cyl mean
# <dbl> <dbl>
# 1 4 4.090909
# 2 6 3.857143
# 3 8 3.285714
You can group by conditionals as well, like so
mtcars %>%
group_by(cyl > 4) %>%
summarise(mean=mean(gear))
# # A tibble: 2 x 2
# `cyl > 4` mean
# <lgl> <dbl>
# 1 FALSE 4.090909
# 2 TRUE 3.476190
You need to quo and !! (or UQ()) . See following example:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_by) {
quo_group_by <- quo(group_by)
print(quo_group_by)
df %>%
group_by(!!quo_group_by) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
For more examples and discussion see http://dplyr.tidyverse.org/articles/programming.html
I have created a function which takes a little while to run (lots of crunching going on) and there are two distinct outputs that I need to return from this function. The inputs into these outputs are the same which is why I have combined them in the same function so that I don't have to crunch them twice, but the outputs are so entirely different in content and based on such entirely different calculations that there is no way to actually combine them into a one parse kinda statement. One object is tens of lines earlier than the other. But I need to return both, so I think it has to be in some type of format which mimics: store the two separate objects in a single list, lapply, then extract and rbind the two objects.
Any help on a solution to this would be appreciated - ideally not using a for loop or data.table. Dplyr solutions are fine.
Some dummy data:
df <- data.frame(ID = c(rep("A",10), rep("B", 10), rep("C", 10)),
subID = c(rep("U", 5),rep("V", 5),rep("W", 5),rep("X", 5),rep("Y", 5),rep("Z", 5)),
Val = c(1,6,3,8,6,5,2,4,7,20,4,2,3,5,7,3,2,5,7,12,5,3,7,1,6,1,34,9,5,3))
The function (again noting the function is much more complex than this, and I am calculating many more complex and unrelated things in each of the separate objects, not just the average!):
func <- function(x, df){
temp <- filter(df, ID == x)
average_id <- temp %>% group_by(ID) %>% summarise(avg = mean(Val))
average_subid <- temp %>% group_by(ID, subID) %>% summarise(avg = mean(Val))
df_list <- list(avgID=average_id, avgSubID=average_subid)
return(df_list)
}
Presently I have computed the results using this command, but am unsure whether this is correct or how to further extract the results after the objects are stored in this list (of lists) (i.e. I get stuck here):
result <- lapply(list("A","B","C"), func, df)
The result should look like:
> average_ID
ID avg
1 A 6.2
2 B 5.0
3 C 7.4
> average_subID
ID subID avg
1 A U 4.8
2 A V 7.6
3 B W 4.2
4 B X 5.8
5 C Y 4.4
6 C Z 10.4
I have previously used a for loop and stored the results in lists (i.e. avgListID[x] <- average_id, then binded together. But I don't think this is ideal.
Thanks in advance!
I realize this is a bit old, but since neither provided answer seems to have done the trick, how about this? Split the function into two, and run each within your lapply, returning a list of lists?
library(dplyr)
df <- data.frame(ID = c(rep("A",10), rep("B", 10), rep("C", 10)),
subID = c(rep("U", 5),rep("V", 5),rep("W", 5),rep("X", 5),rep("Y", 5),rep("Z", 5)),
Val = c(1,6,3,8,6,5,2,4,7,20,4,2,3,5,7,3,2,5,7,12,5,3,7,1,6,1,34,9,5,3))
subfunc1 <- function(temp){
return(temp %>% group_by(ID) %>% summarise(avg = mean(Val)))
}
subfunc2 <- function(temp){
return(temp %>% group_by(ID, subID) %>% summarise(avg = mean(Val)))
}
func <- function(x, df){
temp <- filter(df, ID == x)
df_list <- list(avgID=subfunc1(temp), avgSubID=subfunc2(temp))
return(df_list)
}
result <- lapply(list("A","B","C"), func, df)
To get the structure/order you need, transpose the lists as explained here:
n <- length(result[[1]]) # assuming all lists in result have the same length
result <- lapply(1:n, function(i) lapply(result, "[[", i))
> average_ID <- aggregate(df$Val, by = list(df$ID), FUN = mean)
>
> average_ID
Group.1 x
1 A 6.2
2 B 5.0
3 C 7.4
> average_subID <- aggregate(df$Val, by = list(df$ID,df$subID), FUN = mean)
>
> average_subID
Group.1 Group.2 x
1 A U 4.8
2 A V 7.6
3 B W 4.2
4 B X 5.8
5 C Y 4.4
6 C Z 10.4
What about returning a list where each element represents the averages at a specific grouping level. For example:
library(tidyverse)
fnc = function(groups=NULL, data=df) {
groups=as.list(groups)
data %>%
group_by_(.dots=groups) %>%
summarise(avg=mean(Val))
}
list(Avg_Overall=NULL, Avg_by_ID="ID", Avg_by_SubID=c("ID","subID")) %>%
map(~fnc(.x))
$Avg_Overall
# A tibble: 1 x 1
avg
<dbl>
1 6.2
$Avg_by_ID
# A tibble: 3 x 2
ID avg
<fctr> <dbl>
1 A 6.2
2 B 5.0
3 C 7.4
$Avg_by_SubID
# A tibble: 6 x 3
# Groups: ID [?]
ID subID avg
<fctr> <fctr> <dbl>
1 A U 4.8
2 A V 7.6
3 B W 4.2
4 B X 5.8
5 C Y 4.4
6 C Z 10.4
You could also just calculate the average by subID and then the average by ID can be calculated from that:
# Average by subID
avg = df %>% group_by(ID, subID) %>%
summarise(n = n(),
avg = mean(Val))
# Average by ID
avg %>%
group_by(ID) %>%
summarise(avg = sum(avg*n)/sum(n))
# Overall average
avg %>%
ungroup %>%
summarise(avg = sum(avg*n)/sum(n))