Let's use mtcars to explain the situation.
What I want to do is the same below for multiple columns. To have the mean of a column qsec (in the example) regarding another column with a specific value (4 and 6, in the example below). I'll compare the result later so maybe I would store the results in a vector
table(mtcars$cyl)
4 6 8
11 7 14
mean(mtcars$qsec[mtcars$cyl == 4], na.rm = T)
mean(mtcars$qsec[mtcars$gear == 4], na.rm = T)
I would like to check the means of qsec regarding the cyl, and let's say gear and carb, with the same "pattern" for the mean i.e. mean of observations with 4 and mean of observations 6. In the true dataset would be several columns that have the same set of numbers (2, 0 and 1). I'll compare the means of a column (in the example qsec) with observations 2 and 0.
I've tried to look at the functions like tapply, apply, sapply. But I'm stuck in having the condition in the mean applying for every column (at once).
Hope I made myself clear.
Thank you!
The function you are looking for is aggregate:
aggregate(. ~ cyl, FUN=mean, data=mtcars[,c("cyl", "qsec", "gear", "carb")],
subset=cyl %in% c(4, 6)
)
cyl qsec gear carb
1 4 19.13727 4.090909 1.545455
2 6 17.97714 3.857143 3.428571
In the function above data= is the data.frame. Here we only selected the wanted columns. And the subset= specifies which rows of the data to keep (in this case only cyl 4 and 6).
The formula . ~ cyl instructs to summarise all columns according to the cyl column.
a data.table solution:
require(data.table)
mtcars[cyl %in% c(4, 6), .(mn_qsec = mean(qsec),
mn_gear = mean(gear),
mn_carb = mean(carb)),
by = cyl]
What I understand you're looking for is the mean of qsec for each level of cyl, gear, and carb separately, not in combination. This code gets you that, but doesn't directly let you select specific levels of those factors. If you need to be able to do that second part, I think you should be able to tweak this to get there, but I'm not sure how...
apply(mtcars[,c("cyl","gear","carb")], 2, function(x) {
aggregate(mtcars[,"qsec"],list(x),mean)
})
Output:
$cyl
Group.1 x
1 4 19.13727
2 6 17.97714
3 8 16.77214
$gear
Group.1 x
1 3 17.692
2 4 18.965
3 5 15.640
$carb
Group.1 x
1 1 19.50714
2 2 18.18600
3 3 17.66667
4 4 16.96500
5 6 15.50000
6 8 14.60000
On option is to use dplyr::mutate_at as OP wants to apply same function on multiple column. The solution will be as:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(c("qsec", "gear", "carb")), funs(mean), na.rm = TRUE) %>%
filter(cyl!=8)
# # A tibble: 2 x 4
# cyl qsec gear carb
# <dbl> <dbl> <dbl> <dbl>
# 1 4.00 19.1 4.09 1.55
# 2 6.00 18.0 3.86 3.43
Related
Bare with me... I am using the R/RStudio with the data mtcars, dplyr , mutate and the summarise commands. Also tried group by.
I want to center the values mtcars$mpg then take that info and display the summary of the number of cylinders vs centered mtcars$mpg.
So far...
mtcars %>% mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>% summarise(centered_mpg, cyl)
The above produces:
centered_mpg
cyl
0.909375
6
0.909375
6
2.709375
4
1.309375
6
...
...
INSTEAD, I WANT:
centered_mpg
cyl
x1
4
x2
6
x3
8
Are you looking for this?
with(mtcars, aggregate(list(centered_mpg=scale(mpg, scale=FALSE)), list(cyl=cyl), mean))
# cyl centered_mpg
# 1 4 6.5730114
# 2 6 -0.3477679
# 3 8 -4.9906250
It looks like you want to center each individual car's mpg by subtracting the global mean(mpg). This gives a centered_mpg for every car - and the code you have looks fine for this.
Then you want to calculate some sort of "summary" of the centered mpg values by cylinder group, so we need to group_by(cyl) and then define whatever summary function you want - here I use mean() but you can use median, sum, or whatever else you'd like.
mtcars %>%
mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>%
group_by(cyl) %>%
summarise(mean_centered_mpg = mean(centered_mpg))
# # A tibble: 3 x 2
# cyl mean_centered_mpg
# <dbl> <dbl>
# 1 4 6.57
# 2 6 -0.348
# 3 8 -4.99
This question already has answers here:
use dplyr's summarise_each to return one row per function?
(3 answers)
Closed 4 years ago.
As part of my exploratory work I have built a function that provides a variety of metrics for each field in my dataset. I want to apply it to each column of my dataset.
library(tidyverse)
mtcars %>%summarise_all(., .funs = funs(mean, median, sd, max, min, n_distinct))
However, this results a dataset with 1 row and each function/column combination as a column. The names are also concatenated like 'column_function'.
DESIRED result would be a 'tidy' format like:
ORIGINAL_COLUMN_NAME | FUNCTION | RESULT
I'm guessing there has to be an easy way to do this?
Here is one option.
library(tidyverse)
mtcars %>%
gather(Original_Column, Value) %>%
group_by(Original_Column) %>%
summarise_all(., .funs = funs(mean, median, sd, max, min, n_distinct)) %>%
gather(Function, Result, -Original_Column)
# # A tibble: 66 x 3
# Original_Column Function Result
# <chr> <chr> <dbl>
# 1 am mean 0.406
# 2 carb mean 2.81
# 3 cyl mean 6.19
# 4 disp mean 231.
# 5 drat mean 3.60
# 6 gear mean 3.69
# 7 hp mean 147.
# 8 mpg mean 20.1
# 9 qsec mean 17.8
# 10 vs mean 0.438
# # ... with 56 more rows
We can group mtcars by cylinder and summarize miles per gallon with some simple code.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
This provides the correct output shown below.
cyl avg
1 4 26.66364
2 6 19.74286
3 8 15.10000
If I kindly ask dplyr to exclude NA I get some weird results.
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(!is.na(mpg)))
Since there are no NA in this data set the results should be the same as above. But it averages all mpg to exactly "1". A problem with my code or a bug in dplyr?
cyl avg
1 4 1
2 6 1
3 8 1
My actual data set does have some NA that I need to exclude only for this summarization, but exhibits the same behavior.
You want this:
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg, na.rm = T))
# A tibble: 3 x 2
cyl avg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
Right now, you're returning a logical vector with !is.na(mpg). When you take the mean() of a logical vector, it'll be coerced to 1, not the numeric value you desire.
The way you have coded it, the input to the mean() function is a vector of TRUE and FALSE values. Use mean(mpg[!is.na(mpg)]) instead.
Consider using data.table which I have used for illustration purposes. The following all produce the same result.
library(data.table)
MT[, mean(mpg), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
MT[, mean(mpg, na.rm=TRUE), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
MT[, mean(mpg[!is.na(mpg)]), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
I am fitting an elastic net with cross-validation and I am looking at how big the coefficients are for each predictor:
lambda <- cv.glmnet(x = features_training, y = outcomes_training, alpha = 0)
elnet <- lambda$glmnet.fit
coefs <- coef(elnet, s = lambda$lambda.min, digits = 3)
The coefs variable contains a dgCMatrix:
1
(Intercept) -1.386936e-16
ret 4.652863e-02
ind30 -2.419878e-03
spyvol 1.570406e-02
Is there a quick way to turn this into a dataframe with 2 columns (one for the predictor name and the other for the coefficient value)? as.data.frame, as.matrix or chaining both did not work. I would notably like to sort the rows according to the second column.
broom::tidy has a nice method for coercing dgCMatrix objects to long-form data frames (a bit like as.data.frame.table), which works well here:
mod <- glmnet::cv.glmnet(model.matrix(~ ., mtcars[-1]), mtcars$mpg, alpha = 0)
broom::tidy(coef(mod$glmnet.fit, s = mod$lambda.min, digits = 3))
#> row column value
#> 1 (Intercept) 1 21.171285892
#> 2 cyl 1 -0.368057153
#> 3 disp 1 -0.005179902
#> 4 hp 1 -0.011713150
#> 5 drat 1 1.053216800
#> 6 wt 1 -1.264212476
#> 7 qsec 1 0.164975032
#> 8 vs 1 0.756163432
#> 9 am 1 1.655635460
#> 10 gear 1 0.546651086
#> 11 carb 1 -0.559817882
Another way, and no hacks through attributes() function, but extracting the rownames and matrix values. The attributes(class(coefs)) informs that dgCMatrix is a sparse matrix created using Matrix package.
data.frame( predict_names = rownames(coefs),
coef_vals = matrix(coefs))
# predict_names coef_vals
# 1 (Intercept) 21.117339411
# 2 (Intercept) 0.000000000
# 3 cyl -0.371338786
# 4 disp -0.005254534
# 5 hp -0.011613216
# 6 drat 1.054768651
# 7 wt -1.234201216
# 8 qsec 0.162451314
# 9 vs 0.771959823
# 10 am 1.623812912
# 11 gear 0.544171362
# 12 carb -0.547415029
What is the efficient/preferred way to do group mean centering with dplyr, that is take each element of a group (mutate) and perform an operation on it and a summary stat (summarize) for that group. Here's how one might do group mean centering on mtcars using base R:
do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x){
x[["cent"]] <- x$mpg - mean(x$mpg)
x
}))
You can try
library(dplyr)
mtcars %>%
add_rownames()%>% #if the rownames are needed as a column
group_by(cyl) %>%
mutate(cent= mpg-mean(mpg))
It appears that the above code use the global mean to center the mpg; how should I do if I want to center at the within group mean, i.e. the mean values of each cyl group level are different.
> mtcars %>%
+ add_rownames()%>% #if the rownames are needed as a column
+ group_by(cyl) %>%
+ mutate(cent= mpg-mean(mpg))%>%
+ dplyr ::select(cent)
Adding missing grouping variables: `cyl`
# A tibble: 32 x 2
# Groups: cyl [3]
cyl cent
<dbl> <dbl>
1 6 0.909
2 6 0.909
3 4 2.71
4 6 1.31
5 8 -1.39
6 6 -1.99
7 8 -5.79
8 4 4.31
9 4 2.71
10 6 -0.891
# … with 22 more rows
Warning message:
Deprecated, use tibble::rownames_to_column() instead.
> mtcars$mpg[1:5]-mean(mtcars$mpg)
[1] 0.909375 0.909375 2.709375 1.309375 -1.390625
You can try this instead (although the name of the new variable displayed is different):
mtcars %>%
group_by(cyl) %>%
mutate(gpcent = scale(mpg, scale = F))