I have this code
data_2012 %>%
group_by(job2) %>%
filter(!is.na(job2)) %>%
summarise(mean = mean(persinc2, na.rm = T),
sd = sd(persinc2, na.rm = T))
Which gives me a little table for that specific variable which is perfect, however i have multiple variables that i want the mean and SD for but it all to be in the one table, how do i do that?
I am very new to R.
You can use across and have to choose your columns using the tidy_select format:
data_2012 %>%
group_by(job2) %>%
filter(!is.na(job2)) %>%
summarise(across(your_columns, list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE))))
With a toy dataset
iris %>%
group_by(Species) %>%
summarise(across(everything(), list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE))))
# A tibble: 3 x 9
Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 0.352 3.43 0.379
2 versicolor 5.94 0.516 2.77 0.314
3 virginica 6.59 0.636 2.97 0.322
# ... with 4 more variables: Petal.Length_mean <dbl>, Petal.Length_sd <dbl>,
# Petal.Width_mean <dbl>, Petal.Width_sd <dbl>
With base R, we may use split() to split the data by some factor variable. This returns a list of a number of elements that is equal to the number of levels of that factor variable. We can then obtain the mean and sd (or any other statistic you like) per column per level using members of the *apply() family as follows:
# toy data
df <- mtcars[, 1:5]
# splitting by a factor variable
lapply(split(df, df$cyl), function(x) {
sapply(x, function(i) data.frame(Mean=mean(i), SD=sd(i)))
})
Output
$`4`
mpg cyl disp hp drat
Mean 26.66364 4 105.1364 82.63636 4.070909
SD 4.509828 0 26.87159 20.93453 0.3654711
$`6`
mpg cyl disp hp drat
Mean 19.74286 6 183.3143 122.2857 3.585714
SD 1.453567 0 41.56246 24.26049 0.4760552
$`8`
mpg cyl disp hp drat
Mean 15.1 8 353.1 209.2143 3.229286
SD 2.560048 0 67.77132 50.97689 0.3723618
Related
I'm trying to run a simple single linear regression over a large number of variables, grouped according to another variable. Using the mtcars dataset as an example, I'd like to run a separate linear regression between mpg and each other variable (mpg ~ disp, mpg ~ hp, etc.), grouped by another variable (for example, cyl).
Running lm over each variable independently can easily be done using purrr::map (modified from this great tutorial - https://sebastiansauer.github.io/EDIT-multiple_lm_purrr_EDIT/):
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>%
map_df(glance, .id='variable') %>%
select(variable, r.squared, p.value)
# A tibble: 10 x 3
variable r.squared p.value
<chr> <dbl> <dbl>
1 cyl 0.726 6.11e-10
2 disp 0.718 9.38e-10
3 hp 0.602 1.79e- 7
4 drat 0.464 1.78e- 5
5 wt 0.753 1.29e-10
6 qsec 0.175 1.71e- 2
7 vs 0.441 3.42e- 5
8 am 0.360 2.85e- 4
9 gear 0.231 5.40e- 3
10 carb 0.304 1.08e- 3
And running a linear model over grouped variables is also easy using map:
mtcars %>%
split(.$cyl) %>% #split by grouping variable
map(~ lm(mpg ~ wt, data = .)) %>%
map_df(broom::glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
# A tibble: 3 x 3
cyl r.squared p.value
<chr> <dbl> <dbl>
1 4 0.509 0.0137
2 6 0.465 0.0918
3 8 0.423 0.0118
So I can run by variable, or by group. However, I can't figure out how to combine these two (grouping everything by cyl, then running lm(mpg ~ each other variable, separately). I'd hoped to do something like this:
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
split(.$cyl) %>% # group by grouping variable
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>% #run lm across all variables
map_df(glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
and get a result that gives me cyl(group), variable, r.squared, and p.value (a combination of 3 groups * 10 variables = 30 model outputs).
But split() turns the dataframe into a list, which the construction from part 1 [ map(~ lm(mtcars$mpg ~ .x, data = mtcars)) ] can't handle. I have tried to modify it so that it doesn't explicitly refer to the original data structure, but can't figure out a working solution. Any help is greatly appreciated!
IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.
If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.
library(tidyverse)
ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs
mtcars %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)
# A tibble: 27 × 4
# Groups: cyl [3]
cyl iv r.squared p.value
<dbl> <chr> <dbl> <dbl>
1 4 disp 0.648 0.00278
2 4 hp 0.274 0.0984
3 4 drat 0.180 0.193
4 4 wt 0.509 0.0137
5 4 qsec 0.0557 0.485
6 4 vs 0.00238 0.887
7 4 am 0.287 0.0892
8 4 gear 0.115 0.308
9 4 carb 0.0378 0.567
10 6 disp 0.0106 0.826
11 6 hp 0.0161 0.786
# ...
the code below shows me extracting certain values for 1 parameter in my data frame (Calcium), but I want to be able to do this for all of the parameters/rows in my data frame. There are multiple rows for Calcium, which is why I took the median value.
How can I create a loop that does this for the other drug substance parameters?
Cal_limits=ag_limits_5 %>% filter(PARAMETER=="Drug Substance.Calcium")
lcl <- median(Cal_limits$LCL, na.rm = TRUE)
ucl <- median(Cal_limits$UCL, na.rm = TRUE)
lsl <- median(Cal_limits$LSL_1, na.rm = TRUE)
usl <- median(Cal_limits$USL_1, na.rm = TRUE)
cl <- median(Cal_limits$TARGET_MEAN, na.rm = TRUE)
stdev <- median(Cal_limits$TARGET_STDEV, na.rm = TRUE)
sigabove <- ucl + stdev #3.219 #(UCL + sd (3.11+0.107))
sigbelow <- lcl - stdev#2.363 #(LCL - sd (2.47-0.107))
Snapshot showing that there are multiple rows dedicated to one parameter, the columns not pictured have confidential information but include the values I am looking to extract
Edit: I am creating an RShiny app, so I am not sure if I will need to incorporate a reactive function
Using mtcars, you can do
aggregate(. ~ cyl, data = mtcars, FUN = median)
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 4 26.0 108.0 91.0 4.080 2.200 18.900 1 1 4 2.0
# 2 6 19.7 167.6 110.0 3.900 3.215 18.300 1 0 4 4.0
# 3 8 15.2 350.5 192.5 3.115 3.755 17.175 0 0 3 3.5
which provides the median for each of the variables (. means "all others") for each of the levels of cyl. I'm going to guess that this would apply to your data as
aggregate(. ~ PARAMETER, data = ag_limits_5, FUN = median)
If you have more columns than you want to reduce, then you can specify them manually with
aggregate(LCL + UCL + LSL_1 + USL_1 + TARGET_MEAN + TARGET_STDDEV ~ PARAMETER,
data = ag_limits_5, FUN = median)
and I think you'll get output something like
# PARAMETER LCL UCL LSL_1 USL_1 TARGET_MEAN TARGET_STDDEV
# 1 Drug Substance.Calcium 1.1 1.2 1.3 1.4 ...
# 2 Drug Substance.Copper ...
(with real numbers, I'm just showing structure there).
Since it appears that you're using dplyr, you can do it this way, too:
mtcars %>%
group_by(cyl) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))
# # A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 26 108 91 4.08 2.2 18.9 1 1 4 2
# 2 6 19.7 168. 110 3.9 3.22 18.3 1 0 4 4
# 3 8 15.2 350. 192. 3.12 3.76 17.2 0 0 3 3.5
which for you might be
ag_limits_5 %>%
group_by(PARAMETER) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))
I came across something weird with dplyr and across, or at least something I do not understand.
If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()
Which results in
gear mpg cyl se_mpg se_cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 16.1 7.47 NA NA
2 4 24.5 4.67 NA NA
3 5 21.4 6 NA NA
However, if I switch the order of the individual across commands, I get the following:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()
# A tibble: 3 x 5
gear se_mpg se_cyl mpg cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.871 0.307 16.1 7.47
2 4 1.52 0.284 24.5 4.67
3 5 2.98 0.894 21.4 6
Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.
I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.
In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
mean = ~mean(.x, na.rm = TRUE),
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear mean_mpg se_mpg mean_cyl se_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 16.1 0.871 7.47 0.307
# 2 4 24.5 1.52 4.67 0.284
# 3 5 21.4 2.98 6 0.894
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
mean = ~mean(.x, na.rm = TRUE)
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear se_mpg mean_mpg se_cyl mean_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 0.871 16.1 0.307 7.47
# 2 4 1.52 24.5 0.284 4.67
# 3 5 2.98 21.4 0.894 6
I want to calculate the pair-wise correlations between "mpg" and all other numeric variables of interest for each cyl in the mtcars dataset. I would like to adopt the tidy data principle.
It's rather easy with corrr::correlate().
library(dplyr)
library(tidyr)
library(purrr)
library(corrr)
data(mtcars)
mtcars2 <- mtcars[,1:7] %>%
group_nest(cyl) %>%
mutate(cors = map(data, corrr::correlate),
stretch = map(cors, corrr::stretch)) %>%
unnest(stretch)
mtcars2 %>%
filter(x == "mpg")
By using corrr::correlate(), all available pair-wise correlations have been calculated. I could use dplyr::filter() to select the correlations of interest.
However, when datasets are large, a lot of calculations go to the unwanted correlations, making this approach very time-consuming. So I tried to calculate only mpg vs. others. I'm not very familiar with purrr, and the following code doesn't work.
mtcars2 <- mtcars[,1:7] %>%
group_nest(cyl) %>%
mutate(comp = map(data, ~colnames),
corr = map(comp, ~cor.test(data[["mpg"]], data[[.]])))
If you need to use cor.test, below is an option using broom:
library(broom)
library(tidyr)
library(dplyr)
mtcars[,1:7] %>%
pivot_longer(-c(mpg,cyl)) %>%
group_by(cyl,name) %>%
do(tidy(cor.test(.$mpg,.$value)))
# A tibble: 15 x 10
# Groups: cyl, name [15]
cyl name estimate statistic p.value parameter conf.low conf.high method
<dbl> <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
1 4 disp -0.805 -4.07 0.00278 9 -0.947 -0.397 Pears…
2 4 drat 0.424 1.41 0.193 9 -0.236 0.816 Pears…
3 4 hp -0.524 -1.84 0.0984 9 -0.855 0.111 Pears…
4 4 qsec -0.236 -0.728 0.485 9 -0.732 0.424 Pears…
5 4 wt -0.713 -3.05 0.0137 9 -0.920 -0.198 Pears…
6 6 disp 0.103 0.232 0.826 5 -0.705 0.794 Pears…
7 6 drat 0.115 0.258 0.807 5 -0.699 0.799 Pears…
If you just need the correlation, for big datasets, the nesting etc might be costly and unnecessary because you can simply do cor(,) and melt that:
#define columns to correlate
cor_vars = setdiff(colnames(mtcars)[1:7],"cyl")
split(mtcars[,1:7],mtcars$cyl) %>%
map_dfr(~data.frame(x="mpg",y=cor_vars,
cyl=unique(.x$cyl),rho=as.numeric(cor(.x$mpg,.x[,cor_vars]))))
x y cyl rho
1 mpg mpg 4 1.00000000
2 mpg disp 4 -0.80523608
3 mpg hp 4 -0.52350342
4 mpg drat 4 0.42423947
5 mpg wt 4 -0.71318483
6 mpg qsec 4 -0.23595389
7 mpg mpg 6 1.00000000
8 mpg disp 6 0.10308269
9 mpg hp 6 -0.12706785
10 mpg drat 6 0.11471598
11 mpg wt 6 -0.68154982
12 mpg qsec 6 -0.41871779
13 mpg mpg 8 1.00000000
14 mpg disp 8 -0.51976704
15 mpg hp 8 -0.28363567
16 mpg drat 8 0.04793248
17 mpg wt 8 -0.65035801
18 mpg qsec 8 -0.10433602
Would this work for you? I have done this in the past but on smallish datasets and have not bench marked it so not sure of performance. I use pivot_longer to reshape the data prior to nesting. The variables you pass essentially work as the filtering step, sort of
mtcars2 <- mtcars[,1:7] %>%
pivot_longer(c(-mpg, -cyl), names_to = "y.var", values_to = "value" ) %>%
group_nest(cyl, y.var) %>%
mutate(x.var = "mpg", #just so you can see this in the output
cor = map_dbl(data, ~ {cor <- cor.test(.x$mpg, .x$value)
cor$estimate})) %>%
select(data, cyl, x.var , y.var, cor) %>%
arrange(cyl, y.var)
I am running multiple models on multiple sections of my data set, similar to (but with many more models)
library(tidyverse)
d1 <- mtcars %>%
group_by(cyl) %>%
do(mod_linear = lm(mpg ~ disp + hp, data = ., x = TRUE))
d1
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# # A tibble: 3 x 3
# cyl mod_linear
# * <dbl> <list>
# 1 4. <S3: lm>
# 2 6. <S3: lm>
# 3 8. <S3: lm>
I then tidy this tibble and save my parameter estimates using tidy() in the broom package.
I also want to calculate the standard deviation of the predictors (stored in models above as I set x = TRUE) to create and then compare re-scaled parameters. I can do the former of these using
d1 %>%
# group_by(cyl) %>%
do(term = colnames(.$mod$x),
pred_sd = apply(X = .$mod$x, MARGIN = 2, FUN = sd)) %>%
unnest()
# # A tibble: 9 x 2
# term pred_sd
# <chr> <dbl>
# 1 (Intercept) 0.00000
# 2 disp 26.87159
# 3 hp 20.93453
# 4 (Intercept) 0.00000
# 5 disp 41.56246
# 6 hp 24.26049
# 7 (Intercept) 0.00000
# 8 disp 67.77132
# 9 hp 50.97689
However, the result is not a grouped tibble so I end up loosing the cyl column to tell me which terms belong to which model. How can avoid this loss? - Adding in group_by again seems to throw an error.
n.b. I want avoid using purrr for at least for the first part (fitting the models) as I run different types of models and then need to reshape the results (d1), and I like the progress bar with do.
n.b. I want to work with the $x component of the models rather than the raw data as they have the data on correct scale (I am experimenting with different transformations of the predictors)
We can do this by nesting initially and then do the unnest
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_linear = map(data, ~ lm(mpg ~ disp + hp, data = .x, x = TRUE)),
term = map(mod_linear, ~ names(coef(.x))),
pred = map(mod_linear, ~ .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist )) %>%
select(-data, -mod_linear) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
Or instead of calling the map multiple times, this can be further made compact with
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_contents = map(data, ~ {
mod <- lm(mpg ~ disp + hp, data = .x, x = TRUE)
term <- names(coef(mod))
pred <- mod$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
tibble(term, pred)
}
)) %>%
select(-data) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
If we start from 'd1' (based on the OP's code)
d1 %>%
ungroup %>%
mutate(mod_contents = map(mod_linear, ~ {
pred <- .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
term <- .x %>%
coef %>%
names
tibble(term, pred)
})) %>%
select(-mod_linear) %>%
unnest