This question already has answers here:
use dplyr's summarise_each to return one row per function?
(3 answers)
Closed 4 years ago.
As part of my exploratory work I have built a function that provides a variety of metrics for each field in my dataset. I want to apply it to each column of my dataset.
library(tidyverse)
mtcars %>%summarise_all(., .funs = funs(mean, median, sd, max, min, n_distinct))
However, this results a dataset with 1 row and each function/column combination as a column. The names are also concatenated like 'column_function'.
DESIRED result would be a 'tidy' format like:
ORIGINAL_COLUMN_NAME | FUNCTION | RESULT
I'm guessing there has to be an easy way to do this?
Here is one option.
library(tidyverse)
mtcars %>%
gather(Original_Column, Value) %>%
group_by(Original_Column) %>%
summarise_all(., .funs = funs(mean, median, sd, max, min, n_distinct)) %>%
gather(Function, Result, -Original_Column)
# # A tibble: 66 x 3
# Original_Column Function Result
# <chr> <chr> <dbl>
# 1 am mean 0.406
# 2 carb mean 2.81
# 3 cyl mean 6.19
# 4 disp mean 231.
# 5 drat mean 3.60
# 6 gear mean 3.69
# 7 hp mean 147.
# 8 mpg mean 20.1
# 9 qsec mean 17.8
# 10 vs mean 0.438
# # ... with 56 more rows
Related
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
With dplyr and R, you can use group_by and summarize to aggregate data.
For instance:
mpg_cyl_carb <- mtcars %>%
group_by(cyl, carb) %>%
summarise(var1 = mean(mpg))
head(mpg_cyl_carb, 3)
A tibble: 3 x 3
Groups: cyl [2]
cyl carb var1
<dbl> <dbl> <dbl>
1 4 1 27.6
2 4 2 25.9
3 6 1 19.8
It means that when cyl = 4 and carb = 4, the mean for mpg is 27.6. When cyl = 6 and carb = 1, the mean is 19.8, and so on.
I would like to nest those aggregate results on the original dataframe. Currently am joining two tables to do this:
> mtcars %>%
+ left_join(mpg_cyl_carb, by = c("cyl", "carb")) %>%
+ head(3) %>%
+ select(mpg, cyl, carb, var1)
mpg cyl carb var1
1 21.0 6 4 19.75
2 21.0 6 4 19.75
3 22.8 4 1 27.58
But is there an easier way? A single command for mutate, like:
> mtcars %>%
+ mutate(. . .)
Not a solution using if_else, as it would add complexity.
Use the group_by before the mutate to create the mean column by group - instead of creating a summarised dataset and then joining to original data
library(dplyr)
mtcars %>%
group_by(cyl, carb) %>%
mutate(var1 = mean(mpg)) %>%
ungroup %>%
head
Bare with me... I am using the R/RStudio with the data mtcars, dplyr , mutate and the summarise commands. Also tried group by.
I want to center the values mtcars$mpg then take that info and display the summary of the number of cylinders vs centered mtcars$mpg.
So far...
mtcars %>% mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>% summarise(centered_mpg, cyl)
The above produces:
centered_mpg
cyl
0.909375
6
0.909375
6
2.709375
4
1.309375
6
...
...
INSTEAD, I WANT:
centered_mpg
cyl
x1
4
x2
6
x3
8
Are you looking for this?
with(mtcars, aggregate(list(centered_mpg=scale(mpg, scale=FALSE)), list(cyl=cyl), mean))
# cyl centered_mpg
# 1 4 6.5730114
# 2 6 -0.3477679
# 3 8 -4.9906250
It looks like you want to center each individual car's mpg by subtracting the global mean(mpg). This gives a centered_mpg for every car - and the code you have looks fine for this.
Then you want to calculate some sort of "summary" of the centered mpg values by cylinder group, so we need to group_by(cyl) and then define whatever summary function you want - here I use mean() but you can use median, sum, or whatever else you'd like.
mtcars %>%
mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>%
group_by(cyl) %>%
summarise(mean_centered_mpg = mean(centered_mpg))
# # A tibble: 3 x 2
# cyl mean_centered_mpg
# <dbl> <dbl>
# 1 4 6.57
# 2 6 -0.348
# 3 8 -4.99
This question already has answers here:
correlation between columns by group
(3 answers)
Closed 3 years ago.
I am trying to create a vector of all correlation of variables for each cross sectional unit, using dplyr approach returns an error as variables needs to be numeric.
I dont know how to solve it
What I need to end up with is a dataframe, that contains correlation between crmrte variable and all other explanatory variables BUT at cross-sectional level.
I need to specify code below:
cors <- crime %>%
group_by(county) %>%
summarize(cor = cor(crmrte, prbarr))
Update:
As suggested by Sotos, generalizing code above to be automatic I did this:
cors <- crime %>%
group_by(county) %>%
summarise_at(vars(names(crime)[4:ncol(crime)]), funs(cor(crmrte, .)))
But not sure if it is right approach
You can use summarise_at along with vars() which automatically quotes the names
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(names(mtcars)[6:10]), funs(cor(mpg, .)))
which gives,
# A tibble: 3 x 6
cyl wt qsec vs am gear
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 -0.713 -0.236 0.0488 0.536 0.339
2 6 -0.682 -0.419 -0.530 0.530 -0.00949
3 8 -0.650 -0.104 NA 0.0496 0.0496
Let's use mtcars to explain the situation.
What I want to do is the same below for multiple columns. To have the mean of a column qsec (in the example) regarding another column with a specific value (4 and 6, in the example below). I'll compare the result later so maybe I would store the results in a vector
table(mtcars$cyl)
4 6 8
11 7 14
mean(mtcars$qsec[mtcars$cyl == 4], na.rm = T)
mean(mtcars$qsec[mtcars$gear == 4], na.rm = T)
I would like to check the means of qsec regarding the cyl, and let's say gear and carb, with the same "pattern" for the mean i.e. mean of observations with 4 and mean of observations 6. In the true dataset would be several columns that have the same set of numbers (2, 0 and 1). I'll compare the means of a column (in the example qsec) with observations 2 and 0.
I've tried to look at the functions like tapply, apply, sapply. But I'm stuck in having the condition in the mean applying for every column (at once).
Hope I made myself clear.
Thank you!
The function you are looking for is aggregate:
aggregate(. ~ cyl, FUN=mean, data=mtcars[,c("cyl", "qsec", "gear", "carb")],
subset=cyl %in% c(4, 6)
)
cyl qsec gear carb
1 4 19.13727 4.090909 1.545455
2 6 17.97714 3.857143 3.428571
In the function above data= is the data.frame. Here we only selected the wanted columns. And the subset= specifies which rows of the data to keep (in this case only cyl 4 and 6).
The formula . ~ cyl instructs to summarise all columns according to the cyl column.
a data.table solution:
require(data.table)
mtcars[cyl %in% c(4, 6), .(mn_qsec = mean(qsec),
mn_gear = mean(gear),
mn_carb = mean(carb)),
by = cyl]
What I understand you're looking for is the mean of qsec for each level of cyl, gear, and carb separately, not in combination. This code gets you that, but doesn't directly let you select specific levels of those factors. If you need to be able to do that second part, I think you should be able to tweak this to get there, but I'm not sure how...
apply(mtcars[,c("cyl","gear","carb")], 2, function(x) {
aggregate(mtcars[,"qsec"],list(x),mean)
})
Output:
$cyl
Group.1 x
1 4 19.13727
2 6 17.97714
3 8 16.77214
$gear
Group.1 x
1 3 17.692
2 4 18.965
3 5 15.640
$carb
Group.1 x
1 1 19.50714
2 2 18.18600
3 3 17.66667
4 4 16.96500
5 6 15.50000
6 8 14.60000
On option is to use dplyr::mutate_at as OP wants to apply same function on multiple column. The solution will be as:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(c("qsec", "gear", "carb")), funs(mean), na.rm = TRUE) %>%
filter(cyl!=8)
# # A tibble: 2 x 4
# cyl qsec gear carb
# <dbl> <dbl> <dbl> <dbl>
# 1 4.00 19.1 4.09 1.55
# 2 6.00 18.0 3.86 3.43
What is the efficient/preferred way to do group mean centering with dplyr, that is take each element of a group (mutate) and perform an operation on it and a summary stat (summarize) for that group. Here's how one might do group mean centering on mtcars using base R:
do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x){
x[["cent"]] <- x$mpg - mean(x$mpg)
x
}))
You can try
library(dplyr)
mtcars %>%
add_rownames()%>% #if the rownames are needed as a column
group_by(cyl) %>%
mutate(cent= mpg-mean(mpg))
It appears that the above code use the global mean to center the mpg; how should I do if I want to center at the within group mean, i.e. the mean values of each cyl group level are different.
> mtcars %>%
+ add_rownames()%>% #if the rownames are needed as a column
+ group_by(cyl) %>%
+ mutate(cent= mpg-mean(mpg))%>%
+ dplyr ::select(cent)
Adding missing grouping variables: `cyl`
# A tibble: 32 x 2
# Groups: cyl [3]
cyl cent
<dbl> <dbl>
1 6 0.909
2 6 0.909
3 4 2.71
4 6 1.31
5 8 -1.39
6 6 -1.99
7 8 -5.79
8 4 4.31
9 4 2.71
10 6 -0.891
# … with 22 more rows
Warning message:
Deprecated, use tibble::rownames_to_column() instead.
> mtcars$mpg[1:5]-mean(mtcars$mpg)
[1] 0.909375 0.909375 2.709375 1.309375 -1.390625
You can try this instead (although the name of the new variable displayed is different):
mtcars %>%
group_by(cyl) %>%
mutate(gpcent = scale(mpg, scale = F))