Can I speed up many regressions and outputs from sapply in R? - r

I created a function using sapply to run 153k linear regressions, and extract the estimates, se, pvalues, and the associated rowname from the first column. The code should be self-explanatory:
run_lms <-sapply(1:nrow(test_wide[,-1]), function(x) {
lm_output<-lm(unlist(test_wide[x,-1]) ~ survey_clean_for_lm$CR + survey_clean_for_lm$cbage + survey_clean_for_lm$sex + survey_clean_for_lm$bmistrat + survey_clean_for_lm$deidsite + survey_clean_for_lm$snppc1 + survey_clean_for_lm$snppc2 + survey_clean_for_lm$snppc3 + survey_clean_for_lm$methpc1 + survey_clean_for_lm$methpc2 + survey_clean_for_lm$methpc3 + survey_clean_for_lm$methpc4 + survey_clean_for_lm$methpc5 + survey_clean_for_lm$methpc6 + survey_clean_for_lm$methpc7 )
lm_summary <-summary(lm_output)
estimate <-lm_summary$coefficients[2,1]
se <-lm_summary$coefficients[2,2]
pval <-lm_summary$coefficients[2,4]
bind_cols(test_wide[x,1], estimate, se, pval)
}
)
It takes nearly 10 hours to run 153k regressions and store the output. I'm wondering if anyone has advice for speeding this up. I think the bind_cols() portion is part of the problem, but I'm not sure how else to structure and save the output. Ultimately, I want this format:
probe estimate se pval
<chr> <dbl> <dbl> <dbl>
1 cg20272595 0.00556 0.00135 0.0000600
2 cg13995374 0.00466 0.00114 0.0000654
3 cg05254132 0.00367 0.000897 0.0000658
4 cg10049251 -0.00727 0.00179 0.0000746
5 cg19695507 -0.0108 0.00274 0.000117
6 cg21590616 0.00687 0.00176 0.000136
7 cg04089674 -0.00718 0.00186 0.000158
8 cg16907093 -0.00506 0.00132 0.000184
9 cg04600792 -0.00593 0.00156 0.000193
10 cg10529757 0.0122 0.00322 0.000199
# … with 151,853 more rows

Related

How to use results from different regression models in a scatterplot built using group_by in R?

I would like to add 2 different regression curves, coming from different models, in a scatter plot.
Let's use the example below:
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
The scatter plot looks like this:
plot_df <- ggplot(df, aes(x = Weight, y = Increment, color=Id)) + geom_point()
I tested a linear and an exponential regression model and could extract the results following loki's answer there:
linear_df <- df %>% group_by(Id) %>% do(model = glance(lm(Increment ~ Weight,data = .))) %>% unnest(model)
exp_df <- df %>% group_by(Id) %>% do(model = glance(lm(log(Increment) ~ Weight,data = .))) %>% unnest(model)
The linear model fits better for the Ga group, the exponential one for the Aa group, and nothing for the Za one:
> linear_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.656 0.637 15.1 34.4 1.50e- 5 1 -81.6 169. 172. 4106. 18 20
2 Ga 1.00 1.00 0.243 104113. 6.10e-32 1 1.01 3.98 6.65 0.942 16 18
3 Za 0.0471 -0.0262 26.7 0.642 4.37e- 1 1 -69.5 145. 147. 9283. 13 15
> exp_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.999 0.999 0.0624 24757. 1.05e-29 1 28.2 -50.3 -47.4 0.0700 18 20
2 Ga 0.892 0.885 0.219 132. 3.86e- 9 1 2.87 0.264 2.94 0.766 16 18
3 Za 0.00444 -0.0721 0.941 0.0580 8.14e- 1 1 -19.3 44.6 46.7 11.5 13 15
Now, how can I draw the linear regression line for the Aa group, the exponential regression curve for the Ga group, and no curve for the Za group? There is this, but it applies for different regressions built inside the same model type. How can I combine my different objects?
The formula shown below gives the same fitted values as does 3 separate fits for each Id so create the lm objects for each of the two models and then plot the points and the lines for each. The straight solid lines are the linear model and the curved dashed lines are the exponential model.
library(ggplot2)
fm.lin <- lm(Increment ~ Id/Weight + 0, df)
fm.exp <- lm(log(Increment) ~ Id/Weight + 0, df)
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = fitted(fm.lin))) +
geom_line(aes(y = exp(fitted(fm.exp))), lty = 2, lwd = 1)
To only show the Aa fitted lines for the linear model and Ga fitted lines for the exponential model NA out the portions not wanted. In this case we used solid lines for the fitted models.
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = ifelse(Id == "Aa", fitted(fm.lin), NA))) +
geom_line(aes(y = ifelse(Id == "Ga", exp(fitted(fm.exp)), NA)))
Added
Regarding the questions in the comments, the formula used above nests Weight within Id and effectively uses a model matrix which, modulo column order, is a block diagonal matrix whose blocks are the model matrices of the 3 individual models. Look at this to understand it.
model.matrix(fm.lin)
Since this is a single model rather than three models the summary statistics will be pooled. To get separate summary statistics use lmList from the nlme package (which comes with R so it does not have to be installed -- just issue a library statement). The statements below will give objects of class lmList that can be used in place of the ones above as they have a fitted method that will return the same fitted values.
library(nlme)
fm.lin2 <- lmList(Increment ~ Weight | Id, df, pool = FALSE)
fm.exp2 <- lmList(log(Increment) ~ Weight | Id, df, pool = FALSE)
In addition, they can be used to get individual summary statistics. Internally the lmList objects consist of a list of 3 lm objects with attributes in this case so we can extract the summary statistics by extracting the summary statistics from each component.
library(broom)
sapply(fm.lin2, glance)
sapply(fm.exp2, glance)
One caveat is that common statistical tests between models using different dependent variables, Increment vs. log(Increment), are invalid.
possible solution
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
library(tidyverse)
df_model <- df %>%
group_nest(Id) %>%
mutate(
formula = c(
"lm(log(Increment) ~ Weight, data = .x)",
"lm(Increment ~ Weight,data = .x)",
"lm(Increment ~ 0,data = .x)"
),
transform = c("exp(fitted(.x))",
"fitted(.x)",
"fitted(.x)")
) %>%
mutate(model = map2(data, formula, .f = ~ eval(parse(text = .y)))) %>%
mutate(fit = map2(model, transform, ~ eval(parse(text = .y)))) %>%
select(Id, data, fit) %>%
unnest(c(data, fit))
ggplot(df_model) +
geom_point(aes(Weight, Increment, color = Id)) +
geom_line(aes(Weight, fit, color = Id))
Created on 2021-10-06 by the reprex package (v2.0.1)

rstatix package anova_test function gives partial eta squared despite setting effect.size = 'ges'

I am not able to obtain eta squared, only partial eta squared, when I use rstatix::anova_test.
Example from the iris dataset:
First using aov:
aov <- aov(Sepal.Length ~ Sepal.Width + Species, data = iris)
summary(aov)
Df Sum Sq Mean Sq F value Pr(>F)
Sepal.Width 1 1.41 1.41 7.363 0.00746 **
Species 2 72.75 36.38 189.651 < 2e-16 ***
Residuals 146 28.00 0.19
Then using sjstats::eta_sq, if I choose partial = TRUE or FALSE I get a different effect size, as I would expect.
eta_sq(aov, partial = FALSE)
term etasq
1 Sepal.Width 0.014
2 Species 0.712
eta_sq(aov, partial = TRUE)
term partial.etasq
1 Sepal.Width 0.048
2 Species 0.722
However, when I do the same in anova_test, I get the partial eta squared both times regardless of whether the effect size is pes or ges, both times it's the partial eta squared:
aov_pes <- iris %>% anova_test(Sepal.Length ~ Sepal.Width + Species,
detailed = T,
effect.size = "pes")
get_anova_table(aov_pes)
Effect SSn SSd DFn DFd F p p<.05
1 Sepal.Width 10.953 28.004 1 146 57.102 4.19e-12 *
2 Species 72.752 28.004 2 146 189.651 2.56e-41 *
pes
1 0.281
2 0.722
aov_ges <- iris %>% anova_test(Sepal.Length ~ Sepal.Width + Species,
detailed = T,
effect.size = "ges")
get_anova_table(aov_ges)
Effect SSn SSd DFn DFd F p p<.05
1 Sepal.Width 10.953 28.004 1 146 57.102 4.19e-12 *
2 Species 72.752 28.004 2 146 189.651 2.56e-41 *
ges
1 0.281
2 0.722
Does anyone know why this is? Thanks!
Answer
rstatix::anova_test seems to contain a mistake in the calculation! I would be very, very careful with this function.
Note that eta_sq is deprecated, and effectsize::eta_squared should be used.
Proper calculation
We have three SS values: 1.412238, 72.752431, and 28.003665. We can calculate the pes and ges:
pes: 1.412238 / (1.412238 + 28.003665)
ges: 1.412238 / (1.412238 + 72.752431 + 28.003665)
anova_test
Under the hood, anova_test calls two functions for pes and ges calculation:
pes: rstatix:::add_partial_eta_squared
ges: rstatix:::add_generalized_eta_squared
The pes calculation by anova_test
res.anova.summary$ANOVA %>% mutate(pes = .data$SSn/(.data$SSn + .data$SSd))
This indeed calculates the pes as we expect it to.
The ges calculation by anova_test
aov.table <- res.anova.summary$ANOVA
aov.table %>% mutate(ges = .data$SSn/(.data$SSn + sum(unique(.data$SSd)) + obs.SSn1 - obs.SSn2))
Here, we run into a problem. This code seems blatantly incorrect. It just divides each sum of square value by itself + the residual sum of squares (28.004). That is the pes, not the ges.
You could contact the maintainer of the package (maintainer("rstatix")) or create a new issue for the rstatix package here.

how to add pre-calculated standard errors for both X and Y axis into a scatterplot in r?

Here's how my data look like:
A tibble: 3 x 6
box
water(%)
VWC_tube_avg
st_error_tube
VWC_sensor_avg
st_error_sensor
1
0
0.0110
0.00275
0
0
1
10
0.127
0.00429
0.0390
2.31e-18
1
30
0.383
0.0118
0.313
1.86e- 3
A quick explanation of the data: The data concerns the calibration of volumetric water content (VWC) soil sensors. In the table above we have the water % used for the calibration, the VWC presented in the test tubes followed by the standard error of the VWC of the tubes, the VWC of the sensors, and the correspondent previously calculated standard error.
Now I want to do a correlation scatterplot and add to each point (correlation) the correspondent standard error in the x-axis (st_error_tube) and y-axis (st_error_sensor).
The scatterplot correlation is done but the problem is the standard error bars which I can´t add to the scatterplot.
This is what I try, but without success:
ggplot(box1_1st, aes(VWC_tube_avg, VWC_sensor_avg)) +
geom_point() +
geom_errorbarh(aes(xmin=st_error_tube, xmax=st_error_tube)) +
geom_errorbar(aes(ymin=st_error_sensor, ymax=st_error_sensor))
Any help will be more than welcome.
You can use the following code to achieve that
library(ggplot2)
box1_1st <- read.table(text = "box water VWC_tube_avg st_error_tube VWC_sensor_avg st_error_sensor
1 0 0.0110 0.00275 0 0
1 10 0.127 0.00429 0.0390 2.31e-18
1 30 0.383 0.0118 0.313 1.86e-3", header = T)
ggplot(box1_1st, aes(VWC_tube_avg, VWC_sensor_avg)) +
geom_point() +
geom_errorbarh(aes(xmin= VWC_tube_avg - st_error_tube, xmax= VWC_tube_avg + st_error_tube), height=0.01) +
geom_errorbar(aes(ymin= VWC_sensor_avg - st_error_sensor, ymax= VWC_sensor_avg + st_error_sensor), width=0.01)

pooled model summary not showing R squared or adjusted R Squared

I'm using imputed data (via r-MICE) to carry out some linear regressions.
eg:
fitimp2 <- with(impdatlong_mids,
lm(nat11 ~ sex + AGE +
I(fasbathroom + fasbedroom + fascomputers +
fasdishwash + fasfamcar + fasholidays)+fatherhome1 +
motherhome1 +talkfather +talkmother + I(famsup+famhelp)+
fmeal))
When I call a summary:
summary(pool(fitimp2))
I don't get the signif codes / asterisks, which isn't a huge deal, just kind of inconvenient, but more importantly, I don't get the R or Adjusted R squared like I would with a regular model summary.
My output looks like:
term estimate
1 (Intercept) 1.560567449
2 sex 0.219087438
3 AGE 0.005548590
4 I(fasbathroom + fasbedroom + fascomputers + fasdishwash + fasfamcar + fasholidays) -0.009028995
5 fatherhome1 -0.055150616
6 motherhome1 0.001564544
7 talkfather 0.115541883
8 talkmother 0.149495541
9 I(famsup + famhelp) -0.006991828
10 fmeal 0.081613347
std.error statistic df p.value
1 0.162643898 9.59499539 1118.93509 0.000000e+00
2 0.024588831 8.91003863 4984.09857 0.000000e+00
3 0.007672715 0.72315871 3456.13665 4.696313e-01
4 0.005495148 -1.64308498 804.41067 1.007561e-01
5 0.030861154 -1.78705617 574.98597 7.445506e-02
6 0.057226626 0.02733944 90.61856 9.782491e-01
7 0.012924577 8.93970310 757.72150 0.000000e+00
8 0.016306200 9.16801814 239.68789 0.000000e+00
9 0.003215294 -2.17455343 1139.07321 2.986886e-02
10 0.011343686 7.19460591 2677.98522 8.095746e-13
Any ideas how to get the Rsquared values to display? Thanks in advance.
Have you tried
for unadjusted r-sq:
pool.r.squared(fitimp2)
for adjusted r-sq:
pool.r.squared(fitimp2, adjusted = TRUE)
see pg. 51 of https://cran.r-project.org/web/packages/mice/mice.pdf

lm finding predicted values of a factor from a dataset that has been split

I have split a data set from the Column WithSTV, which is discrete, into two datasets. From there a run the lm for each dataset and again from there I need the predicted probabilities of a single discrete variable in these models. I must then plot this. I am getting the exact value for each which makes no sense. Please find the code and data below.
I have tried multiple packages and avenues including ggplot2, DAMisc, etc.
library(readstata13)
library(haven)
library(sjlabelled)
library(sjmisc)
library(sjstats)
library(ggeffects)
library(sjPlot)
dat <- read.dta13("STV.dta")
dat <- na.omit(dat)
zig<-split(dat, dat$WithSTV) ##split dataframe by WithSTV variable
##this == WithSTV=0
zig5<- zig[[1]] ##sperate by 1st level
blah55 <-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=zig5)
summary(blah55)
##this == WithSTV=1
zig10<- zig[[2]] ##sperate by 2nd level
blah10 <-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=zig10)
summary(blah10)
##WithSTV==0
d<-zig5
d$nonpartisan <- as.factor(d$nonpartisan)
fit<-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=d)
d$predicted <- predict(fit) # Save the predicted values
d$residuals <- residuals(fit)
plot_model(fit, type = "pred", terms = c("nonpartisan"))
##WithSTV==1
d<-zig10
d$nonpartisan <- as.factor(d$nonpartisan)
fit2<-lm(PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level, data=d)
d$predicted <- predict(fit2) # Save the predicted values
d$residuals <- residuals(fit2)
plot_model(fit2, type = "pred", terms = c("nonpartisan"))
This is a link to the data. This is not a large file.
https://drive.google.com/file/d/1HBssOfb0QX6BTh6ipwlJCRf3ZOT5zKmE/view?usp=sharing
I am expecting the predicted values of nonpartisan for each model to not be identical or close to identical but more stacked/ stairs looking. So, for example, if I compare nonpartisan for zig5/ fit and zig10/fit they will say around 3 & 5 not both 8.
Thanks in advance.
So from not knowing your exact research question, I would include the WithSTV as a predictor in the model and not try to model them separately. I imagine that the intervention was on the entire sample so you're interested in the marginal difference between those exposed and those who were not exposed. If that is the case, leaving WithSTV in a single linear model is a good approach.
Here is the approach with that in mind that you use:
library(haven)
library(tidyverse)
# Read in the Data
dat <- haven::read_dta("STV.dta")
# Fit the Model including WithSTV as a fixed effect
blah10 <-lm(PercentRunoff1 ~ Statewide + Contested +
nonpartisan + presidential_election +
education_level + WithSTV, data=dat)
# Inspect the Coefficients
# install.packages("arm") # great utility package
arm::coefplot(blah10)
arm::display(blah10)
# Add the prediction intervals
marginal_combos <- model.matrix(PercentRunoff1 ~ Statewide + Contested +
nonpartisan + presidential_election +
education_level + WithSTV,
data = dat)
# Add the Predicted Values to Original Data with Pred Interval
dat_fitted <- dat %>%
bind_cols(pred = predict(blah10, newdata = ., interval="predict") %>%
as_tibble())
# Generate the Marginal Graph
# This basically will make the fitted lines and then the prediction intervals in gray
dat_fitted %>%
mutate(nonpartisan = as_factor(nonpartisan)) %>%
mutate(WithSTV = as_factor(WithSTV)) %>%
ggplot(aes(nonpartisan, fit, group = WithSTV))+
geom_line(aes(color = WithSTV))+
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = .1)
This shows that the estimates are slightly different for those with and without for the point estimates, but that the prediction intervals are contained. So there isn't a huge, distinct, difference.
Hope this helps.
Based on your data, the differences between the two subsets of your data indeed seem to be rather small. This is true both for predicted values, adjusted by the covariates of your models, as well as the raw mean values of your outcome for each subgroup (see results at the very bottom).
If you would expect more differences, there might be an issue with your data (preparation)?
library(sjlabelled)
library(ggeffects)
library(sjPlot)
library(dplyr)
d <- read_stata("D:/Downloads/STV.dta")
dat <- na.omit(d)
zig <- split(dat, dat$WithSTV)
zig5 <- zig[[1]]
zig5$nonpartisan <- as.factor(zig5$nonpartisan)
fit <- lm(
PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level,
data = zig5
)
zig10 <- zig[[2]]
zig10$nonpartisan <- as.factor(zig10$nonpartisan)
fit2 <- lm(
PercentRunoff1 ~ Statewide + Contested + nonpartisan + presidential_election + education_level,
data = zig10
)
ggpredict(fit, "nonpartisan")
#>
#> # Predicted values of PercentRunoff1
#> # x = nonpartisan
#>
#> x predicted std.error conf.low conf.high
#> 0 0.095 0.003 0.090 0.100
#> 1 0.198 0.007 0.184 0.212
#>
#> Adjusted for:
#> * Statewide = 0.05
#> * Contested = 0.70
#> * presidential_election = 0.52
#> * education_level = 82.71
ggpredict(fit2, "nonpartisan")
#>
#> # Predicted values of PercentRunoff1
#> # x = nonpartisan
#>
#> x predicted std.error conf.low conf.high
#> 0 0.099 0.004 0.092 0.107
#> 1 0.268 0.007 0.255 0.282
#>
#> Adjusted for:
#> * Statewide = 0.05
#> * Contested = 0.77
#> * presidential_election = 0.43
#> * education_level = 82.56
dat %>%
group_by(WithSTV, nonpartisan) %>%
summarize(mean = mean(PercentRunoff1))
#> # A tibble: 4 x 3
#> # Groups: WithSTV [2]
#> WithSTV nonpartisan mean
#> <dbl> <dbl> <dbl>
#> 1 0 0 0.101
#> 2 0 1 0.165
#> 3 1 0 0.114
#> 4 1 1 0.223
Created on 2019-09-01 by the reprex package (v0.3.0)

Resources