Using n() at the same time as calculating other summary statistics - r

I am having trouble to prepare a summary table using dplyr based on the data set below:
set.seed(1)
df <- data.frame(rep(sample(c(2012,2016),10, replace = T)),
sample(c('Treat','Control'),10,replace = T),
runif(10,0,1),
runif(10,0,1),
runif(10,0,1))
colnames(df) <- c('Year','Group','V1','V2','V3')
I want to calculate the mean, median, standard deviation and count the number of observations by each combination of Year and Group.
I have successfully used this code to get mean, median and sd:
summary.table = df %>%
group_by(Year, Group) %>%
summarise_all(funs(n(), sd, median, mean))
However, I do not know how to introduce the n() function inside the funs() command. It gave me the counting for V1, V2 and V3. This is quite redundant, since I only want the size of the sample. I have tried introducing
mutate(N = n()) %>%
before and after the group_by() line, but it did not give me what I wanted.
Any help?
EDIT: I had not made my doubt clear enough. The problem is that the code gives me columns that I do not need, since the number of observations for V1 is sufficient for me.

Add the N column before summarizing as an extra grouping column:
library(dplyr)
set.seed(1)
df <- data.frame(Year = rep(sample(c(2012, 2016), 10, replace = TRUE)),
Group = sample(c('Treat', 'Control'), 10, replace = TRUE),
V1 = runif(10, 0, 1),
V2 = runif(10, 0, 1),
V3 = runif(10, 0, 1))
df2 <- df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd, median, mean))
df2
#> # A tibble: 4 x 12
#> # Groups: Year, Group [?]
#> Year Group N V1_sd V2_sd V3_sd V1_median V2_median
#> <dbl> <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2012 Control 2 0.05170954 0.29422635 0.1152669 0.3037848 0.6193239
#> 2 2012 Treat 2 0.51092899 0.08307494 0.1229560 0.5734239 0.5408230
#> 3 2016 Control 3 0.32043716 0.34402222 0.3822026 0.3823880 0.4935413
#> 4 2016 Treat 3 0.37759667 0.29566739 0.1233162 0.3861141 0.6684667
#> # ... with 4 more variables: V3_median <dbl>, V1_mean <dbl>,
#> # V2_mean <dbl>, V3_mean <dbl>

Are you getting the same error I am:
“Error in n(): function should not be called directly”
If so, there's a stack question on that here that might help:
dplyr: "Error in n(): function should not be called directly"
The resolution seems to be detaching plyr where there appears to be a conflict and reloading the dplyr library.

Related

Dyplr summarise across output as rows?

I would like to generate overview tables for the same statistics (e.g., n, mean, sd) across multiple variables.
I started with combining the dyplr summarise and across function. See follwing example:
df <- data.frame(
var1 = 1:10,
var2 = 11:20
)
VarSum <- df %>% summarise(across(c(var1, var2), list(n = length, mean = mean, sd = sd)))
The output is of course given as one row (1x6) with three colums for each variable in this example. What I would like to achieve is to get the output rowise for each variable (2x3). Is that even possible with my approach? Would appriciate any suggestions.
You can pivot first:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
summarise(across(value, list(n = length, mean = mean, sd = sd)), .by = name)
name value_n value_mean value_sd
<chr> <int> <dbl> <dbl>
1 var1 10 5.5 3.03
2 var2 10 15.5 3.03

Loop through specific columns of dataframe keeping some columns as fixed

I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])

Running linear models for groups within dataframe and storing outputs in dataframe in R

I am trying to run multiple linear models for a very large dataset and store the outputs in a dataframe. I have managed to get estimates and p-values into dataframe (see below) but I also want to store the AIC for each model.
#example dataframe
dt = data.frame(x = rnorm(40, 5, 5),
y = rnorm(40, 3, 4),
group = rep(c("a","b"), 20))
library(dplyr)
library(broom)
# code that runs lm for each group in row z and stores output
dt_lm <- dt %>%
group_by(group) %>%
do(tidy(lm(y~x, data=.)))
Use glance instead of tidy:
dt_lm <- dt %>%
group_by(group) %>%
do(glance(lm(y~x, data=.))) %>%
select(AIC)
which gives:
Adding missing grouping variables: `group`
# A tibble: 2 x 2
# Groups: group [2]
group AIC
<chr> <dbl>
1 a 119.
2 b 114.
If you not only want to store the AIC but other metrics just skip the select part.
In the newer version of dplyr i.e. >= 1.0, we can also use nest_by
library(dplyr)
library(tidyr)
library(broom)
dt %>%
nest_by(group) %>%
transmute(out = list(glance(lm(y ~ x, data = data)))) %>%
unnest(c(out)) %>%
select(AIC)
# A tibble: 2 x 2
# Groups: group [2]
# group AIC
# <chr> <dbl>
#1 a 115.
#2 b 100.

Why does the ``mean`` function not work properly with ``group_by %>% summarise`` in a function environement?

For example:
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
df1 <- x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
df1$Var = z # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, the results are:
Treatment n Mean SD Var
<fct> <int> <dbl> <dbl> <chr>
1 A 2 2.5 1.29 Price
2 B 2 2.5 1.29 Price
They are the mean and sd of all the observations, but not the grouped observations by Treatment. What is the problem here?
If I take the code out of the function environment, it works totally fine. Please help, thanks.
If you have a better way to achieve my purpose, that would be great! Thanks!
When you use variables with $ in dplyr pipes they do not respect grouping and work as if they are applied to the entire dataframe. Apart from that, you can use {{}} to evaluate column names in the functions.
library(dplyr)
SummarizeFn <- function(x,y,z) {
x %>%
group_by(Treatment) %>%
summarize(n = n(), Mean = mean({{y}}), SD = sd({{y}}), Var = z)
}
SummarizeFn(df, Price, "Price")
# Treatment n Mean SD Var
# <fct> <int> <dbl> <dbl> <chr>
#1 A 2 1.5 0.707 Price
#2 B 2 3.5 0.707 Price
This is related to the question of standard evaluation. That's funny, I just wrote an article on the subject. This is quite hard to pass string names with dplyr. If you need to do that, use rlang::sym (or rlang::syms) and !! (or !!!)
Regarding your problem, I think data.table offers you a concise solution
dt <- as.data.table(mtcars)
output <- dt[,lapply(.SD, function(d) return(list(.N,mean(d),sd(d)))),
.SDcols = c("mpg","qsec")]
output[,'stat' := c("observations","mean","sd")]
output
# output
# mpg qsec stat
# 1: 32 32 observations
# 2: 20.09062 17.84875 mean
# 3: 6.026948 1.786943 sd
I propose an anonymous function with lapply but you could use a more sophisticated function defined before the summary step. Change the .SDcols to include more variables if needed

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

Resources