I have a data frame and there are 3 numeric variables there. I need to calculate some parameters of these numeric variables like mean, median, std, kurtosis. And then I need to arrange this in a data frame. So, first column of this data frame will contain all numeric variable names and second column will contain all mean values, third column will contain all median values and so on. How can this be achieved ? I am familiar with dplyr package. So any suggestions ?
You can use summarise with across :
library(dplyr)
library(tidyr)
mtcars %>%
select(1:3) %>%
summarise(across(where(is.numeric), list(mean = mean, std = sd, med = median)))
# mpg_mean mpg_std mpg_med cyl_mean cyl_std cyl_med disp_mean disp_std disp_med
#1 20.09062 6.026948 19.2 6.1875 1.785922 6 230.7219 123.9387 196.3
In the older version of dplyr, you can use summarise_if :
mtcars %>%
select(1:3) %>%
summarise_if(is.numeric, list(mean = mean, std = sd, med = median))
You can add pivot_longer to above answer to get data in required format.
mtcars %>%
select(1:3) %>%
summarise(across(where(is.numeric),list(mean=mean,std=sd,med = median))) %>%
pivot_longer(cols = everything(),
names_to = c('col', '.value'),
names_sep = '_')
# A tibble: 3 x 4
# col mean std med
# <chr> <dbl> <dbl> <dbl>
#1 mpg 20.1 6.03 19.2
#2 cyl 6.19 1.79 6
#3 disp 231. 124. 196.
Or you can first pivot and then do the calculation :
mtcars %>%
select(1:3) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(mean = mean(value), std = sd(value), med = median(value))
Related
I have three workflows to get Mean, Standard Deviation, and Variance. Would it be possible to simplify this by creating one function with one table with all the summaries as the result?
Mean
iris %>%
select(-Species) %>%
summarise_all( , mean, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(Mean = V1)
Standard Deviation
iris %>%
select(-Species) %>%
summarise_all(., sd, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(SD = V1)
Variance
iris %>%
select(-Species) %>%
summarise_all(., var, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(Variance = V1)
We could reshape to 'long' format and then do a group by operation to create the three summarise columns
library(dplyr)
library(tidyr)
iris %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Name") %>%
group_by(Name) %>%
summarise(Mean = mean(value, na.rm = TRUE),
SD = sd(value, na.rm = TRUE),
Variance = var(value, na.rm = TRUE))
-output
# A tibble: 4 × 4
Name Mean SD Variance
<chr> <dbl> <dbl> <dbl>
1 Petal.Length 3.76 1.77 3.12
2 Petal.Width 1.20 0.762 0.581
3 Sepal.Length 5.84 0.828 0.686
4 Sepal.Width 3.06 0.436 0.190
iris %>%
select(-Species) %>%
summarise_all(list(mean = mean,sd = sd, var = var), na.rm = TRUE)%>%
pivot_longer(everything(), names_sep = '_', names_to = c('Name','.value'))
# A tibble: 4 x 4
Name mean sd var
<chr> <dbl> <dbl> <dbl>
1 Sepal.Length 5.84 0.828 0.686
2 Sepal.Width 3.06 0.436 0.190
3 Petal.Length 3.76 1.77 3.12
4 Petal.Width 1.20 0.762 0.581
Given a vector of names of numeric variables in a dataframe, I need to calculate mean and sd for each variable. For example, given the mtcars dataset and the following vector of variable names:
vars_to_transform <- c("mpg", "disp")
I'd like to have the following as result:
The first solution that came into my mind is the following:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } )
The result is the following:
As you can see, all the returned variables are characters, but I expected to have numbers for avg and sd.
Is there a way to fix this? Or is there any better solution than this?
P.S.
I'm using purr 0.3.4
Seems like an overcomplicated way of doing select->pivot->group->summarise.
mtcars %>%
select(all_of(vars_to_transform)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(
mean = mean(value),
sd = sd(value)
)
# A tibble: 2 x 3
name mean sd
<chr> <dbl> <dbl>
1 disp 231. 124.
2 mpg 20.1 6.03
The following works (instead of using c() in your code, use tibble):
vars_to_transform %>%
map_dfr(~ tibble(variable = .x, avg = mean(mtcars[[.x]], na.rm = T),
sd = sd(mtcars[[.x]], na.rm = T)))
Explanation: With c(), you are using a vector, whose elements must have the same type (character in your case, because variable is character). With tibble, one can have a different type per element.
#Gwang-Jin Kim suggests, in a comment bellow that I thank, one could also have used list instead of tibble.
Or try with adding type.convert:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } ) %>%
type.convert(as.is=T)
#> # A tibble: 2 × 3
#> variable avg sd
#> <chr> <dbl> <dbl>
#> 1 mpg 20.1 6.03
#> 2 disp 231. 124.
Another option:
library(purrr)
library(dplyr)
vars_to_transform <- c("mpg", "disp")
funs <- lst(mean, sd)
mtcars %>%
select(all_of(vars_to_transform)) %>%
map_df(~ funs %>%
map(exec, .x), .id = "var")
# A tibble: 2 x 3
var mean sd
<chr> <dbl> <dbl>
1 mpg 20.1 6.03
2 disp 231. 124.
m <- mtcars[, vars_to_transform]
tibble(variable = names(m), avg = apply(m, 2, mean), sd = apply(m, 2, sd))
## A tibble: 2 × 3
# variable avg sd
# <chr> <dbl> <dbl>
#1 mpg 20.1 6.03
#2 disp 231. 124.
I try to generate yearwise summary statistics as follows:
data %>%
group_by(year) %>%
summarise(mean.abc = mean(abc), mean.def = mean(def), sd.abc = sd(abc), sd.def = sd(def))
This code returns a row vector filled with NA in the respective columns
mean.abc mean.def sd.abc sd.def
1 NA NA NA NA
So, I tried to work this out and replicated some examples
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(disp))
And this script returns
mean
1 230.7219
So, what am I doing wrong? I am loading the following packages:
loadpackage( c("foreign","haven", "tidyverse", "plyr", "stringr", "eeptools", "factoextra") )
Thanky for your support!
Your issue is that the summarise-function from the plyr-package does not do what you expect it to do.
See the difference between:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
plyr::summarise(mean = mean(disp))
#> mean
#> 1 230.7219
and
mtcars %>%
group_by(cyl) %>%
dplyr::summarise(mean = mean(disp))
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 105.
#> 2 6 183.
#> 3 8 353.
Since your data seems to have missing values, this should do the trick:
data %>%
group_by(year) %>%
dplyr::summarise(across(all_of(c('abc', 'def')),
.fns = list(mean = ~mean(.,na.rm=T),
sd = ~sd(.,na.rm=T))))
I want to use summarize_all on the following data and create my desired output, but I was curious how to do this the tidy way using some combination of mutate and summarize I think? Any help appreciated!!
dummy <- tibble(
a = 1:10,
b = 100:109,
c = 1000:1009
)
Desired Output
tibble(
Mean = colMeans(dummy[1:3]),
Variance = colVars(as.matrix(dummy[1:3])),
CV = Variance/Mean
)
Mean Variance CV
<dbl> <dbl> <dbl>
1 5.5 9.17 1.67
2 104. 9.17 0.0877
3 1004. 9.17 0.00913
It would be easier to reshape to 'long' format and then do it once after grouping by 'name'
library(dplyr)
library(tidyr)
pivot_longer(dummy, cols = everything()) %>%
group_by(name) %>%
summarise(Mean = mean(value), Variance = var(value), CV = Variance/Mean) %>%
select(-name)
# A tibble: 3 x 3
# Mean Variance CV
# <dbl> <dbl> <dbl>
#1 5.5 9.17 1.67
#2 104. 9.17 0.0877
#3 1004. 9.17 0.00913
Or either use summarise_all or summarise/across, but the output would be a single row, then do the reshaping
dummy %>%
summarise(across(everything(), list(Mean = mean,
Variance = var, CV = ~ mean(.)/var(.)))) %>%
pivot_longer(everything()) %>%
separate(name, into = c('name', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value)
I would like to do this:
data %>%
group_by(ID) %>%
summarize(maxVal = max(Val),
maxVal2 = max(Val2))
However, I have a lot of columns I would like to get the Max of. I would like to pass in a vector of columns like this:
cols <- c("Val", "Val2")
data %>%
group_by(ID) %>%
summarize(max(cols))
However, this does not work. How do I fix the syntax to do this easily?
If we wanted to have a prefix name after summarizeing the multiple columns, then use rename_at
library(tidyverse)
data %>%
group_by(ID) %>%
summarise_at(vars(cols), max) %>%
rename_at(-1, ~ paste0('max', .))
As a reproducible example, used the data(mtcars)
mtcars %>%
group_by(gear) %>%
summarise_at(vars(mpg, disp), max) %>%
rename_at(-1, ~ paste0('max', .))
# A tibble: 3 x 3
# gear maxmpg maxdisp
# <dbl> <dbl> <dbl>
#1 3 21.5 472
#2 4 33.9 168.
#3 5 30.4 351