Summarizing Multiple Columns of Data Using Pipes - r

I'm looking to report the min, max, and mean of certain columns (price, age, and dist)from the houses data set using pipes in a concise tibble. For now, I have the following code which produces a rather inelegant solution with a 1x9 tibble:
houses %>%
select(price, age, dist) %>%
summarize_each(list(min = min, max = max, mean = mean))
I was hoping to create a more organized solution using pipes with the selected data as rows and the summary stats (min, max, mean) as columns resulting in a 3x3 tibble. Any ideas?

You may first get the data in long format and then calculate summary statistics for each column. Here is an example with mtcars dataset.
library(dplyr)
library(tidyr)
mtcars %>%
select(mpg, disp, cyl) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE))
# name min max mean
# <chr> <dbl> <dbl> <dbl>
#1 cyl 4 8 6.19
#2 disp 71.1 472 231.
#3 mpg 10.4 33.9 20.1

A possible solution to output a dataframe:
library(dplyr)
houses %>%
summarise(across(c(price,age,dist),c(max,min,mean))) %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame() %>%
rename(Max=V1, Min=V2, Mean=V3)
A possible solution to output a tibble:
library(dplyr)
houses %>%
summarise(across(c(price,age,dist),c(max,min,mean))) %>%
matrix(ncol = 3, byrow = T) %>%
tibble(Max=unlist(.[,1]),Min=unlist(.[,2]),Mean=unlist(.[,3])) %>%
select(Max,Min,Mean)
EDIT (2021-10-01)
A very short solution:
library(dplyr)
library(purrr)
map_dfc(c("Max","Min","Mean"),
~ tibble(!!sym(.x) := apply(select(houses, price, age, dist),2,tolower(.x))))

Related

Summarize information by group in data table in R

I'm trying to get multiple summary statistics in R grouped by Team. I used code like below, but output is not what I want.
please point me in a better direction. Thanks!
set.seed(77)
data <- data.frame(Team =sample(c("A","B"),30, replace=TRUE),
gender=sample(c("female","male"),30, replace=TRUE),
Age =sample(c(0:100),30, replace=T))
dat <- data %>%
group_by(Team, gender) %>%
dplyr::summarize_all(list(my_mean = mean,
my_sum = sum,
my_sd = sd)) %>%
as.data.frame()
df <- data %>%
group_by(Team) %>%
summarize(total = n(gender),
mean = mean(Age),
Max_Age = max(Age),
Min_Age = min(Age),
sd = sd(Age),
)
I want to get like this pic.
You may need to create the dataframe for the summary statistics of age per Team (age_summary in the example below) and that for the count of Team members per gender and Team (gender_summary in the example below), and then merge them into one dataframe (say summary_df).
library(tidyverse)
set.seed(77)
data <- data.frame(
Team = sample(c("A", "B"), 30, replace = TRUE),
gender = sample(c("female", "male"), 30, replace = TRUE),
Age = sample(c(0:100), 30, replace = T)
)
age_summary <- data %>%
group_by(Team) %>%
summarize(
mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age)
) %>%
column_to_rownames("Team") %>%
t() %>%
as_tibble(
rownames = "age_summary"
)
gender_summary <- data %>%
group_by(Team) %>%
count(gender) %>%
ungroup() %>%
pivot_wider(names_from = Team, values_from = n)
summary_df <- full_join(
age_summary,
gender_summary
) %>%
mutate(
"item" = if_else(
is.na(gender),
"Age",
"Sex"
)
) %>%
unite("summary", c(age_summary, gender), na.rm = TRUE, remove = FALSE) %>%
relocate(item, .before = 1) %>%
select(-c(age_summary, gender))
# # A tibble: 6 × 4
# item summary A B
# <chr> <chr> <dbl> <dbl>
# 1 Age mean 45.6 57.8
# 2 Age Max 92 82
# 3 Age Min 5 14
# 4 Age sd 30.1 22.1
# 5 Sex female 8 9
# 6 Sex male 7 6

More efficient way of taking averages over multiple lists

I have some data where I use the rsample package to create rolling windows (I use the iris data set here). The rolling_iris dataset contains a number of lists.
I would like to compute the min, max, mean and sd of each of the lists. That is in split 1 compute the min across the first 4 columns etc. I originally do this by mapping over the splits and using pivot_longer to rearrange the data then computing the statistics, finally using pivot_wider to get the data back into the original form. This is quite slow.
library(dplyr)
library(purrr)
iris
rolling_iris <- rsample::rolling_origin(iris, initial = 10, assess = 1, cumulative = FALSE, skip = 0)
rolling_iris_statistics <- map(rolling_iris$splits, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
I would like to map over each of the lists and compute the above statistics. Then once this is done scale the analysis by the following function.
Scale_Me <- function(x){
(x - min(x)) / (max(x) - min(x))
}
Additional:
rolling_iris_analysis <- map(rolling_iris$splits, ~analysis(.x))
rolling_iris_assessment <- map(rolling_iris$splits, ~assessment(.x))
EDIT:
I managed to compute the following (I am not sure if it is "faster")
analysis <- map(rolling_iris$splits, ~analysis(.x))
map(analysis, ~select(., c(1:4)) %>% as.matrix %>% mean())
The below code subsets into each sub data frame. So, rolling_iris_dfs is a list of data frames. Then, you can iterate over each data frame and compute statistics.
rolling_iris_dfs <- map(seq(1, length(rolling_iris[[1]])), ~rolling_iris[[1]][[.x]]$data)
rolling_iris_stats <- map(rolling_iris_dfs, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)

Extract data labels in R from attr and add as column to correspond to the variable/column name

I have a very large data set with variable names that are super abbreviated and it would help immensely if the label in the attr(*, "label") section was extracted and showed up in the column beside the corresponding variable.
label(mtcars[["mpg"]]) <- "Miles/(US) gallon"
label(mtcars[["hp"]]) <- "Gross horsepower"
label(mtcars[["wt"]]) <- "Weight (1000lbs)"
My current code just gets the mean/sd from the entire data set:
mtcars %>% select(mpg, hp, wt) %>% pivot_longer(everything()) %>% group_by(name) %>% summarise(mean=mean(value, na.rm = TRUE), sd=sd(value, na.rm=TRUE))
But I want a column with the label of the variables so it's easier to tell:
name mean sd label
hp 14.7. 68.6 Gross horsepower
mpg 20.1 6.03 Miles/(US) gallon
wt 3.22 0.978 Weight (1000lbs)
I found a thread that sort of gets to what I want, but if I add mutate(labels=label(mtcars)[name]) at the end of the code, I get a column with NA instead of the labels.
We can use imap
library(purrr)
library(dplyr)
library(Hmisc)
imap_dfr(mtcars[c('hp', 'mpg', 'wt')], ~
tibble(name = .y, mean = mean(.x[[1]]),
sd = sd(.x[[1]], na.rm = TRUE),
label = attr(.x, 'label')))
If we use the OP's method, we can also use summarise_all and then do the pivot_longer
library(tidyr)
mtcars %>%
dplyr::select(mpg, hp, wt) %>%
summarise_all(list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
label = ~attr(., 'label'))) %>%
mutate(rn = 1) %>%
pivot_longer(cols = -rn, names_to = c('name', '.value'), names_sep="_") %>%
select(-rn)
# name mean sd label
#1 mpg 20.09062 6.0269481 Miles/(US) gallon
#2 hp 146.68750 68.5628685 Gross horsepower
#3 wt 3.21725 0.9784574 Weight (1000lbs)

reformatting dplyr summarise_at() output

I am using summarise_at() to obtain the mean and standard error of multiple variables by group.
The output has 1 row for each group, and 1 column for each calculated quantity, per group. I'd like to have a table with 1 row for each variable, and 1 column for each calculated quantity:
data <- mtcars
data$condition <- as.factor(c(rep("control", 16), rep("treat", 16)))
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt),
funs(mean = mean, se=sd(.)/sqrt(n())))
# A tibble: 2 x 7
condition mpg_mean cyl_mean wt_mean mpg_se cyl_se wt_se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 control 18.2 6.5 3.56 1.04 0.387 0.204
2 treat 22.0 5.88 2.87 1.77 0.499 0.257
Here's what I think would be more useful (the numbers are not meaningful):
# MEAN.control, MEAN.treat, SE.control, SE.treat
# mpg 1.5 2.4 .30 .45
# cyl 3.2 1.9 .20 .60
# disp 12.3 17.8 .20 .19
Any ideas? New to the tidyverse, so sorry if this is too basic.
The funs is getting deprecated in dplyr. Instead use list in summarise_at/mutate_at. After the summarise step, gather the data into 'long' format, separate the 'key' column into two by splitting at the delimiter _, then unite the 'cond' and 'key2' (after changing the case of 'key2'), spread it to 'wide' format and if needed, change the row names with the column 'key1'
library(tidyverse)
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(MEAN = ~ mean(.),
SE = ~sd(.)/sqrt(n()))) %>%
gather(key, val, -condition) %>%
separate(key, into = c("key1", "key2")) %>%
unite(cond, key2, condition, sep=".") %>%
spread(cond, val) %>%
column_to_rownames('key1')
# MEAN.control MEAN.treat SE.control SE.treat
#cyl 6.500000 5.875000 0.3872983 0.4989572
#mpg 18.200000 21.981250 1.0369024 1.7720332
#wt 3.560875 2.873625 0.2044885 0.2571034
A different possibility could be:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
separate(var, c("vars", "var2")) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
vars MEAN_control MEAN_treat SE_control SE_treat
<chr> <dbl> <dbl> <dbl> <dbl>
1 cyl 6.5 5.88 0.387 0.499
2 mpg 18.2 22.0 1.04 1.77
3 wt 3.56 2.87 0.204 0.257
Here, after your initial steps, it performs a wide-to-long data transformation, excluding the "condition" column. Second, it separates the variable names into two columns. Third, it combines the metric and the condition, with the metric being upper case. Finally, it removes the redundant variable and returns it to the desired format.
Or you can avoid separate() by using some regex:
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = gsub("_.*$", "", var),
var2 = gsub(".*\\_", "", var)) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or with strsplit():
data %>%
group_by(condition) %>%
summarise_at(vars(mpg, cyl, wt), list(mean = ~ mean(.),
se = ~ sd(.)/sqrt(n()))) %>%
gather(var, val, -condition) %>%
mutate(vars = sapply(strsplit(var, "_"), function(x) x[1]),
var2 = sapply(strsplit(var, "_"), function(x) x[2])) %>%
mutate(var2 = paste(toupper(var2), as.character(condition), sep = "_")) %>%
select(-condition, -var) %>%
spread(var2, val)
Or you can completely rewrite it to:
data %>%
select(mpg, cyl, wt, condition) %>%
gather(vars, val, -condition) %>%
group_by(condition, vars) %>%
summarise(mean = mean(val),
se = sd(val)/sqrt(n())) %>%
ungroup() %>%
gather(var2, val, -c(condition, vars)) %>%
mutate(var2 = paste(toupper(var2), condition, sep = "_")) %>%
select(-condition) %>%
spread(var2, val)
In this case it, first, selects the variables of interest. Second, it performs a transformation from wide to long format, excluding the "condition" column. Third, it groups by conditions and variable names and calculates the metrics. In the forth step, it performs a second wide-to-long transformation, excluding the "condition" column and the column with initial variable names. Finally, it combines together the metric (upper case) and condition, removes the redundant variable and returns it to the desired format.

How to add totals as well as group_by statistics in R

When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both?
I am looking for something clean and short. Until now I can only think of:
bind_rows(
iris %>% group_by(Species) %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1))
,
iris %>% summarise(
"Mean" = mean(Sepal.Width),
"Median" = median(Sepal.Width),
"sd" = sd(Sepal.Width),
"p10" = quantile(Sepal.Width, probs = 0.1)) %>%
mutate(Species = "Total")
)
But I would like something more compact. In particular, I don't want to type the code (for summarize) twice, once for each group and once for the total.
You can simplify it if you untangle what you're trying to do: you have iris data that has several species, and you want that summarized along with data for all species. You don't need to calculate those summary stats before you can bind. Instead, bind iris with a version of iris that's been set to Species = "Total", then group and summarize.
library(tidyverse)
bind_rows(
iris,
iris %>% mutate(Species = "Total")
) %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Width),
Median = median(Sepal.Width),
sd = sd(Sepal.Width),
p10 = quantile(Sepal.Width, probs = 0.1))
#> # A tibble: 4 x 5
#> Species Mean Median sd p10
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 3.4 0.379 3
#> 2 Total 3.06 3 0.436 2.5
#> 3 versicolor 2.77 2.8 0.314 2.3
#> 4 virginica 2.97 3 0.322 2.59
I like the caution in the comments above, though I have to do this sort of calculation for work enough that I have a similar shorthand function in a personal package. It perhaps makes less sense for things like standard deviations, but it's something I need to do a lot for adding up totals of demographic groups, etc. (If it's useful, that function is here).
bit shorter, though quite similar to bind_rows
q10 <- function(x){quantile(x , probs=0.1)}
iris %>%
select(Species,Sepal.Width)%>%
group_by(Species) %>%
summarise_all(c("mean", "sd", "q10")) %>%
t() %>%
cbind(c("total", iris %>% select(Sepal.Width) %>% summarise_all(c("mean", "sd", "q10")))) %>%
t()
more clean probably:
bind_rows(
iris %>%
group_by(Species) %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10"))
,
iris %>%
select(Sepal.Width)%>%
summarise_all(c("mean", "sd", "q10")) %>%
mutate(Species = "Total")
)

Resources