I'm using imputed data (via r-MICE) to carry out some linear regressions.
eg:
fitimp2 <- with(impdatlong_mids,
lm(nat11 ~ sex + AGE +
I(fasbathroom + fasbedroom + fascomputers +
fasdishwash + fasfamcar + fasholidays)+fatherhome1 +
motherhome1 +talkfather +talkmother + I(famsup+famhelp)+
fmeal))
When I call a summary:
summary(pool(fitimp2))
I don't get the signif codes / asterisks, which isn't a huge deal, just kind of inconvenient, but more importantly, I don't get the R or Adjusted R squared like I would with a regular model summary.
My output looks like:
term estimate
1 (Intercept) 1.560567449
2 sex 0.219087438
3 AGE 0.005548590
4 I(fasbathroom + fasbedroom + fascomputers + fasdishwash + fasfamcar + fasholidays) -0.009028995
5 fatherhome1 -0.055150616
6 motherhome1 0.001564544
7 talkfather 0.115541883
8 talkmother 0.149495541
9 I(famsup + famhelp) -0.006991828
10 fmeal 0.081613347
std.error statistic df p.value
1 0.162643898 9.59499539 1118.93509 0.000000e+00
2 0.024588831 8.91003863 4984.09857 0.000000e+00
3 0.007672715 0.72315871 3456.13665 4.696313e-01
4 0.005495148 -1.64308498 804.41067 1.007561e-01
5 0.030861154 -1.78705617 574.98597 7.445506e-02
6 0.057226626 0.02733944 90.61856 9.782491e-01
7 0.012924577 8.93970310 757.72150 0.000000e+00
8 0.016306200 9.16801814 239.68789 0.000000e+00
9 0.003215294 -2.17455343 1139.07321 2.986886e-02
10 0.011343686 7.19460591 2677.98522 8.095746e-13
Any ideas how to get the Rsquared values to display? Thanks in advance.
Have you tried
for unadjusted r-sq:
pool.r.squared(fitimp2)
for adjusted r-sq:
pool.r.squared(fitimp2, adjusted = TRUE)
see pg. 51 of https://cran.r-project.org/web/packages/mice/mice.pdf
Related
I created a function using sapply to run 153k linear regressions, and extract the estimates, se, pvalues, and the associated rowname from the first column. The code should be self-explanatory:
run_lms <-sapply(1:nrow(test_wide[,-1]), function(x) {
lm_output<-lm(unlist(test_wide[x,-1]) ~ survey_clean_for_lm$CR + survey_clean_for_lm$cbage + survey_clean_for_lm$sex + survey_clean_for_lm$bmistrat + survey_clean_for_lm$deidsite + survey_clean_for_lm$snppc1 + survey_clean_for_lm$snppc2 + survey_clean_for_lm$snppc3 + survey_clean_for_lm$methpc1 + survey_clean_for_lm$methpc2 + survey_clean_for_lm$methpc3 + survey_clean_for_lm$methpc4 + survey_clean_for_lm$methpc5 + survey_clean_for_lm$methpc6 + survey_clean_for_lm$methpc7 )
lm_summary <-summary(lm_output)
estimate <-lm_summary$coefficients[2,1]
se <-lm_summary$coefficients[2,2]
pval <-lm_summary$coefficients[2,4]
bind_cols(test_wide[x,1], estimate, se, pval)
}
)
It takes nearly 10 hours to run 153k regressions and store the output. I'm wondering if anyone has advice for speeding this up. I think the bind_cols() portion is part of the problem, but I'm not sure how else to structure and save the output. Ultimately, I want this format:
probe estimate se pval
<chr> <dbl> <dbl> <dbl>
1 cg20272595 0.00556 0.00135 0.0000600
2 cg13995374 0.00466 0.00114 0.0000654
3 cg05254132 0.00367 0.000897 0.0000658
4 cg10049251 -0.00727 0.00179 0.0000746
5 cg19695507 -0.0108 0.00274 0.000117
6 cg21590616 0.00687 0.00176 0.000136
7 cg04089674 -0.00718 0.00186 0.000158
8 cg16907093 -0.00506 0.00132 0.000184
9 cg04600792 -0.00593 0.00156 0.000193
10 cg10529757 0.0122 0.00322 0.000199
# … with 151,853 more rows
The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
I use the following sample code, to run AR1 process on data
(just numbers I picked to check the function):
> data
[1] 3 7 4 6 2 8 5 4
> data_ts
Time Series:
Start = 1
End = 8
Frequency = 1
[1] 3 7 4 6 2 8 5 4
> arima(data_ts,order=c(1,0,0))
Call:
arima(x = data_ts, order = c(1, 0, 0))
Coefficients:
ar1 intercept
-0.6965 5.0323
s.e. 0.2334 0.2947
sigma^2 estimated as 1.769: log likelihood = -13.97, aic = 33.93
residuals are:
> arima(data_ts,order=c(1,0,0))$resid
Time Series:
Start = 1
End = 8
Frequency = 1
[1] -1.4581973 0.5521706 0.3383218 0.2487084 -2.3582160 0.8556328 2.0348596
[8] -1.0547538
Now, the coefficient should be -0.6965 and the intercept 5.0323. I'd like to verify the result. So I'm assigning the parameters accordingly i.e.:
data[8] = intercept + coefficcient_data[7] + residual[8]
but it never gets correct. What am I doing wrong? BTW - trying the ar function produces different results:
ar(x = data_ts, aic = FALSE, order.max = 1, method = "ols")
Coefficients:
1
-0.6786
Intercept: 0.3527 (0.4951)
Order selected 1 sigma^2 estimated as 1.709. And still - when I assign the time-series parameters onto the estimated equation + errors, the result isn't correct. Any idea ?
ok, found the answer in http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm
the actual intercept is: intercept*(1- coefficient)
I would like to extract the variance-covariance matrix from a simple plm fixed effects model. For example:
library(plm)
data("Grunfeld")
M1 <- plm(inv ~ lag(inv) + value + capital, index = 'firm',
data = Grunfeld)
The usual vcov function gives me:
vcov(M1)
lag(inv) value capital
lag(inv) 3.561238e-03 -7.461897e-05 -1.064497e-03
value -7.461897e-05 9.005814e-05 -1.806683e-05
capital -1.064497e-03 -1.806683e-05 4.957097e-04
plm's fixef function only gives:
fixef(M1)
1 2 3 4 5 6 7
-286.876375 -97.190009 -209.999074 -53.808241 -59.348086 -34.136422 -34.397967
8 9 10
-65.116699 -54.384488 -6.836448
Any help extracting the variance-covariance matrix that includes the fixed effects would be much appreciated.
Using names sometimes is very useful:
names(M1)
[1] "coefficients" "vcov" "residuals" "df.residual"
[5] "formula" "model" "args" "call"
M1$vcov
lag(inv) value capital
lag(inv) 1.265321e-03 3.484274e-05 -3.395901e-04
value 3.484274e-05 1.336768e-04 -7.463365e-05
capital -3.395901e-04 -7.463365e-05 3.662395e-04
Picking up your example, do the following to get the standard errors (if that is what you are interested in; it is not the whole variance-covariance matrix):
library(plm)
data("Grunfeld")
M1 <- plm(inv ~ lag(inv) + value + capital, index = 'firm',
data = Grunfeld)
fix <- fixef(M1)
fix_se <- attr(fix, "se")
fix_se
1 2 3 4 5 6 7 8 9 10
43.453642 25.948160 20.294977 11.245009 12.472005 9.934159 10.554240 11.083221 10.642589 9.164694
You can also use the summary function for more info:
summary(fix)
Estimate Std. Error t-value Pr(>|t|)
1 -286.8764 43.4536 -6.6019 4.059e-11 ***
2 -97.1900 25.9482 -3.7455 0.0001800 ***
3 -209.9991 20.2950 -10.3473 < 2.2e-16 ***
4 -53.8082 11.2450 -4.7851 1.709e-06 ***
5 -59.3481 12.4720 -4.7585 1.950e-06 ***
6 -34.1364 9.9342 -3.4363 0.0005898 ***
7 -34.3980 10.5542 -3.2592 0.0011174 **
8 -65.1167 11.0832 -5.8753 4.222e-09 ***
9 -54.3845 10.6426 -5.1101 3.220e-07 ***
10 -6.8364 9.1647 -0.7460 0.4556947
Btw, the documentation expains the "se" attribute:
Value An object of class "fixef". It is a numeric vector containing
the fixed effects with attribute se which contains the standard
errors. [...]"
Note: You might need the latest development version for that because much has improved there about fixef: https://r-forge.r-project.org/R/?group_id=406
I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, pairs and trios(3 Independent vars), each regression with a different Independent variables vs dependent variable. The outcome that I would like to get back is a list of each regression summary in a row: coeff,p-value ,AUC,CI 95%. Using the data set submitted below there should be 7 regressions:
dat$admit vs dat$female
dat$admit vs dat$apcalc
dat$admit vs dat$num
dat$admit vs dat$female + dat$apcalc
dat$admit vs dat$female + dat$num
dat$admit vs dat$apcalc + dat$num
dat$admit vs dat$female + dat$apcalc + dat$num
Here is a sample data set (where dat$admit is the logistic regression dependent variable) :
dat <- read.table(text = " female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",header = TRUE)
Per #marek comment, the output should be like this (for female alone and from female & apcalc ):
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
female+apcalc 0.000000e+00 0.000000e+00 1 1 0.5
There is a good code that #David Arenburg wrote that produces the stats but with no models creations of pairs and trios so I would like to know how can add the models creations.
Here is David Arenburg's code?
library(caTools)
ResFunc <- function(x) {
temp <- glm(reformulate(x,response="admit"), data=dat,family=binomial)
c(summary(temp)$coefficients[,1],
summary(temp)$coefficients[,4],
colAUC(predict(temp, type = "response"), dat$admit))
}
temp <- as.data.frame(t(sapply(setdiff(names(dat),"admit"), ResFunc)))
colnames(temp) <- c("Intercept", "Estimate", "P-Value (Intercept)", "P-Value (Estimate)", "AUC")
temp
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
# apcalc 0.000000e+00 0.000000e+00 1 1 0.5
# num 5.177403e-16 -1.171295e-16 1 1 0.5
Any idea how to create this list? Thanks, Ron
Simple solution is to make the list of models by hand:
results <- list(
"female" = glm(admit~female , family=binomial, dat)
,"apcalc" = glm(admit~apcalc , family=binomial, dat)
,"num" = glm(admit~num , family=binomial, dat)
,"female + apcalc" = glm(admit~female + apcalc, family=binomial, dat)
,"female + num" = glm(admit~female + num , family=binomial, dat)
,"apcalc + num" = glm(admit~apcalc + num , family=binomial, dat)
,"all" = glm(admit~female + apcalc + num, family=binomial, dat)
)
Then you could check models by lapplying over the list of models:
lapply(results, summary)
Or more advanced (coefficient statistics):
require(plyr)
ldply(results, function(m) {
name_rows(as.data.frame(summary(m)$coefficients))
})
In similar way you could extract every information you want. Just write function to extract statistics you want, which takes glm model as argument:
get_everything_i_want <- function(model) {
#... do what i want ...
# eg:
list(AIC = AIC(model))
}
and then apply to each model:
lapply(results, get_everything_i_want)
# $female
# $female$AIC
# [1] 15.0904
# $apcalc
# $apcalc$AIC
# [1] 15.0904
# $num
# $num$AIC
# [1] 15.0904
# $`female + apcalc`
# $`female + apcalc`$AIC
# [1] 17.0904
# $`female + num`
# $`female + num`$AIC
# [1] 17.0904
# $`apcalc + num`
# $`apcalc + num`$AIC
# [1] 17.0904
# $all
# $all$AIC
# [1] 19.0904