I would like to generate overview tables for the same statistics (e.g., n, mean, sd) across multiple variables.
I started with combining the dyplr summarise and across function. See follwing example:
df <- data.frame(
var1 = 1:10,
var2 = 11:20
)
VarSum <- df %>% summarise(across(c(var1, var2), list(n = length, mean = mean, sd = sd)))
The output is of course given as one row (1x6) with three colums for each variable in this example. What I would like to achieve is to get the output rowise for each variable (2x3). Is that even possible with my approach? Would appriciate any suggestions.
You can pivot first:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
summarise(across(value, list(n = length, mean = mean, sd = sd)), .by = name)
name value_n value_mean value_sd
<chr> <int> <dbl> <dbl>
1 var1 10 5.5 3.03
2 var2 10 15.5 3.03
Related
I have a dataframe in R, let's call it df, which I would like to analyse in terms of mean, median, standard deviation, IQR etc column-wise. I have prepared succinct functions (where it's not just mean or sd) which can take a vector as input and output, say, the IQR or coefficient of variance. Now, if I want to apply any of these across the attributes (columns), I could use IQRs <- apply(df,2,IQR) for example.
My question is, how can I apply multiple of these functions together (really, I want to chain them all together), so as to fill in a table where there will be one column for the attributes and then one column per function (i.e. Means will be one column, IQRs will be one column), and the different attributes of the data-frame (which were columns in df) will be rows of this table (listed in the first column)?
Suppose your data looked like this:
set.seed(69)
df <- data.frame(A = rnorm(5), B = rnorm(5), C = rnorm(5))
And your function names were like this:
funcs <- c("mean", "median", "sd", "var", "min", "max")
Then you can use an apply inside an lapply like this:
as.data.frame(setNames(lapply(funcs, function(f) apply(df, 2, as.name(f))), funcs))
#> mean median sd var min max
#> A -0.3546864 -0.3348139 0.5948611 0.3538597 -0.949889 0.3743156
#> B -0.2016318 -0.9039467 1.4092795 1.9860687 -1.571073 1.4440935
#> C -0.3537707 -0.1691765 0.7955558 0.6329090 -1.311374 0.4149940
You can use tidyr::gather and dplyr::summarize:
# Toy data
df <- data.frame(x = 1:10, y = 11:20)
# Libs
library(tidyverse)
# Code
df %>%
gather(var, val) %>%
group_by(var) %>%
summarize(med = median(val), mean = mean(val), iqr = IQR(val))
Output:
# A tibble: 2 x 4
var med mean iqr
<chr> <dbl> <dbl> <dbl>
1 x 5.5 5.5 4.5
2 y 15.5 15.5 4.5
For example:
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
df1 <- x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
df1$Var = z # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, the results are:
Treatment n Mean SD Var
<fct> <int> <dbl> <dbl> <chr>
1 A 2 2.5 1.29 Price
2 B 2 2.5 1.29 Price
They are the mean and sd of all the observations, but not the grouped observations by Treatment. What is the problem here?
If I take the code out of the function environment, it works totally fine. Please help, thanks.
If you have a better way to achieve my purpose, that would be great! Thanks!
When you use variables with $ in dplyr pipes they do not respect grouping and work as if they are applied to the entire dataframe. Apart from that, you can use {{}} to evaluate column names in the functions.
library(dplyr)
SummarizeFn <- function(x,y,z) {
x %>%
group_by(Treatment) %>%
summarize(n = n(), Mean = mean({{y}}), SD = sd({{y}}), Var = z)
}
SummarizeFn(df, Price, "Price")
# Treatment n Mean SD Var
# <fct> <int> <dbl> <dbl> <chr>
#1 A 2 1.5 0.707 Price
#2 B 2 3.5 0.707 Price
This is related to the question of standard evaluation. That's funny, I just wrote an article on the subject. This is quite hard to pass string names with dplyr. If you need to do that, use rlang::sym (or rlang::syms) and !! (or !!!)
Regarding your problem, I think data.table offers you a concise solution
dt <- as.data.table(mtcars)
output <- dt[,lapply(.SD, function(d) return(list(.N,mean(d),sd(d)))),
.SDcols = c("mpg","qsec")]
output[,'stat' := c("observations","mean","sd")]
output
# output
# mpg qsec stat
# 1: 32 32 observations
# 2: 20.09062 17.84875 mean
# 3: 6.026948 1.786943 sd
I propose an anonymous function with lapply but you could use a more sophisticated function defined before the summary step. Change the .SDcols to include more variables if needed
I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).
I am having trouble to prepare a summary table using dplyr based on the data set below:
set.seed(1)
df <- data.frame(rep(sample(c(2012,2016),10, replace = T)),
sample(c('Treat','Control'),10,replace = T),
runif(10,0,1),
runif(10,0,1),
runif(10,0,1))
colnames(df) <- c('Year','Group','V1','V2','V3')
I want to calculate the mean, median, standard deviation and count the number of observations by each combination of Year and Group.
I have successfully used this code to get mean, median and sd:
summary.table = df %>%
group_by(Year, Group) %>%
summarise_all(funs(n(), sd, median, mean))
However, I do not know how to introduce the n() function inside the funs() command. It gave me the counting for V1, V2 and V3. This is quite redundant, since I only want the size of the sample. I have tried introducing
mutate(N = n()) %>%
before and after the group_by() line, but it did not give me what I wanted.
Any help?
EDIT: I had not made my doubt clear enough. The problem is that the code gives me columns that I do not need, since the number of observations for V1 is sufficient for me.
Add the N column before summarizing as an extra grouping column:
library(dplyr)
set.seed(1)
df <- data.frame(Year = rep(sample(c(2012, 2016), 10, replace = TRUE)),
Group = sample(c('Treat', 'Control'), 10, replace = TRUE),
V1 = runif(10, 0, 1),
V2 = runif(10, 0, 1),
V3 = runif(10, 0, 1))
df2 <- df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd, median, mean))
df2
#> # A tibble: 4 x 12
#> # Groups: Year, Group [?]
#> Year Group N V1_sd V2_sd V3_sd V1_median V2_median
#> <dbl> <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2012 Control 2 0.05170954 0.29422635 0.1152669 0.3037848 0.6193239
#> 2 2012 Treat 2 0.51092899 0.08307494 0.1229560 0.5734239 0.5408230
#> 3 2016 Control 3 0.32043716 0.34402222 0.3822026 0.3823880 0.4935413
#> 4 2016 Treat 3 0.37759667 0.29566739 0.1233162 0.3861141 0.6684667
#> # ... with 4 more variables: V3_median <dbl>, V1_mean <dbl>,
#> # V2_mean <dbl>, V3_mean <dbl>
Are you getting the same error I am:
“Error in n(): function should not be called directly”
If so, there's a stack question on that here that might help:
dplyr: "Error in n(): function should not be called directly"
The resolution seems to be detaching plyr where there appears to be a conflict and reloading the dplyr library.
I want to compute for all variables of a big data frame either the sum or the mean (or every other possible summary). This should be done if possible in only one pipe. As far as I know you can use sumarise() only in a way that the function for each variable is selected seperately (e.g. summarise(., mean_var1 = mean(var1), sum_var2 = sum(var2), ...)). This would be way to much typing. On the other hand I think summarise_each() can handle multiple columns but it is not possible to say that I want the mean of columns 1 and the sum of all other columns.
I'm looking for a way to combine the variability of summarise and the scale of summarise_each. Something like summarise( name(df)[1] = mean(.[ ,1]), name(df)[2:3] = sum(.[ ,2:3]) ). Is this possible with dplyr?
Some Toy data:
library(dplyr)
set.seed(1)
df <- data.frame(a = sample(0:1, 100, replace = TRUE),
b = rnorm(100),
c = rnorm (100))
The desired output:
df %>%
summarise(a = mean(a), b = sum(b), c = sum(c))
a b c
1 0.48 -1.757949 2.277879
We can do this a bit more easily in data.table
library(data.table)
setDT(df)[, c(a=mean(a), lapply(.SD, sum)), .SDcols = b:c]
# a b c
#1: 0.48 -1.757949 2.277879
One option with dplyr would be to get the mean of 'a' and then do the summarise_each
library(dplyr)
df %>%
mutate(a= mean(a)) %>%
group_by(a) %>%
summarise_each(funs(sum))
# a b c
# <dbl> <dbl> <dbl>
#1 0.48 -1.757949 2.277879
Or combine with dmap
library(purrr)
dmap_at(df, "a", mean) %>%
dmap_at(., names(.)[-1], sum) %>%
distinct()
# a b c
#1 0.48 -1.757949 2.277879