summary dataframe from several multiple regression outputs - r

I am doing multiple OLS regressions. I have used the following lm function:
GroupNetReturnsStockPickers <- read.csv("GroupNetReturnsStockPickers.csv", header=TRUE, sep=",", dec=".")
ModelGroupNetReturnsStockPickers <- lm(StockPickersNet ~ Mkt.RF+SMB+HML+WML, data=GroupNetReturnsStockPickers)
names(GroupNetReturnsStockPickers)
summary(ModelGroupNetReturnsStockPickers)
Which gives me the summary output of:
Call:
lm(formula = StockPickersNet ~ Mkt.RF + SMB + HML + WML, data = GroupNetReturnsStockPickers)
Residuals:
Min 1Q Median 3Q Max
-0.029698 -0.005069 -0.000328 0.004546 0.041948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.655e-05 5.981e-04 0.078 0.938
Mkt.RF -1.713e-03 1.202e-02 -0.142 0.887
SMB 3.006e-02 2.545e-02 1.181 0.239
HML 1.970e-02 2.350e-02 0.838 0.403
WML 1.107e-02 1.444e-02 0.766 0.444
Residual standard error: 0.009029 on 251 degrees of freedom
Multiple R-squared: 0.01033, Adjusted R-squared: -0.005445
F-statistic: 0.6548 on 4 and 251 DF, p-value: 0.624
This is perfect. However, I am doing a total of 10 multiple OLS regressions, and I wish to create my own summary output, in a data frame, where I extract the Intercept Estimate, the tvalue estimate, and the p-value, for all 10 analyzes individually. Hence it would be a 10x3, where the columns names would be Model1, Model2,..,Model10, and row names: Value, t-value and p-Value.
I appreciate any help.

There's a few packages that do this (stargazer and texreg) as well as this code for outreg.
In any case, if you are only interested in the intercept here is one approach:
# Estimate a bunch of different models, stored in a list
fits <- list() # Create empty list to store models
fits$model1 <- lm(Ozone ~ Solar.R, data = airquality)
fits$model2 <- lm(Ozone ~ Solar.R + Wind, data = airquality)
fits$model3 <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
# Combine the results for the intercept
do.call(cbind, lapply(fits, function(z) summary(z)$coefficients["(Intercept)", ]))
# RESULT:
# model1 model2 model3
# Estimate 18.598727772 7.724604e+01 -64.342078929
# Std. Error 6.747904163 9.067507e+00 23.054724347
# t value 2.756222869 8.518995e+00 -2.790841389
# Pr(>|t|) 0.006856021 1.052118e-13 0.006226638

Look at the broom package, which was created to do exactly what you are asking for. The only difference is that it puts the models into rows and the different statistics into columns, and I understand that you would prefer the opposite, but you can work around that afterwards if it is really necessary.
To give you an example, the function tidy() converts a model output into a dataframe.
model <- lm(mpg ~ cyl, data=mtcars)
summary(model)
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
And
library(broom)
tidy(model)
yields the following data frame:
term estimate std.error statistic p.value
1 (Intercept) 37.88458 2.0738436 18.267808 8.369155e-18
2 cyl -2.87579 0.3224089 -8.919699 6.112687e-10
Look at ?tidy.lm to see more options, for instance for confidence intervals, etc.
To combine the output of your ten models into one dataframe, you could use
library(dplyr)
bind_rows(one, two, three, ... , .id="models")
Or, if your different models come from regressions using the same dataframe, you can combine it with dplyr:
models <- mtcars %>% group_by(gear) %>% do(data.frame(tidy(lm(mpg~cyl, data=.), conf.int=T)))
Source: local data frame [6 x 8]
Groups: gear
gear term estimate std.error statistic p.value conf.low conf.high
1 3 (Intercept) 29.783784 4.5468925 6.550360 1.852532e-05 19.960820 39.6067478
2 3 cyl -1.831757 0.6018987 -3.043297 9.420695e-03 -3.132080 -0.5314336
3 4 (Intercept) 41.275000 5.9927925 6.887440 4.259099e-05 27.922226 54.6277739
4 4 cyl -3.587500 1.2587382 -2.850076 1.724783e-02 -6.392144 -0.7828565
5 5 (Intercept) 40.580000 3.3238331 12.208796 1.183209e-03 30.002080 51.1579205
6 5 cyl -3.200000 0.5308798 -6.027730 9.153118e-03 -4.889496 -1.5105036

Related

How do I change predictors in linear regression in loop in R?

How do I change predictors in linear regression in loop in R?
Below is an example along with the error. Can someone please fix it.
# sample data
mpg <- mpg
str(mpg)
# array of predictors
predictors <- c("hwy", "cty")
# loop over predictors
for (predictor in predictors)
{
# fit linear regression
model <- lm(formula = predictor ~ displ + cyl,
data = mpg)
# summary of model
summary(model)
}
Error
Error in model.frame.default(formula = predictor ~ displ + cyl, data = mpg, :
variable lengths differ (found for 'displ')
We may use paste or reformulate. Also, as it is a for loop, create an object to store the output from summary
sumry_model <- vector('list', length(predictors))
names(sumry_model) <- predictors
for (predictor in predictors) {
# fit linear regression
model <- lm(reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
# with paste
# model <- lm(formula = paste0(predictor, "~ displ + cyl"), data = mpg)
# summary of model
sumry_model[[predictor]] <- summary(model)
}
-output
> sumry_model
$hwy
Call:
lm(formula = reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.5098 -2.1953 -0.2049 1.9023 14.9223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2162 1.0481 36.461 < 2e-16 ***
displ -1.9599 0.5194 -3.773 0.000205 ***
cyl -1.3537 0.4164 -3.251 0.001323 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.759 on 231 degrees of freedom
Multiple R-squared: 0.6049, Adjusted R-squared: 0.6014
F-statistic: 176.8 on 2 and 231 DF, p-value: < 2.2e-16
$cty
Call:
lm(formula = reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
Residuals:
Min 1Q Median 3Q Max
-5.9276 -1.4750 -0.0891 1.0686 13.9261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.2885 0.6876 41.139 < 2e-16 ***
displ -1.1979 0.3408 -3.515 0.000529 ***
cyl -1.2347 0.2732 -4.519 9.91e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.466 on 231 degrees of freedom
Multiple R-squared: 0.6671, Adjusted R-squared: 0.6642
F-statistic: 231.4 on 2 and 231 DF, p-value: < 2.2e-16
This may be also done as a multivariate response
summary(lm(cbind(hwy, cty) ~ displ + cyl, data = mpg))
Or if we want to use predictors
summary(lm(as.matrix(mpg[predictors]) ~ displ + cyl, data = mpg))

How to extract p value from ca.po function in R?

I want to get the p-value of both ca.po models. Can someone show me how?
at?
library(data.table)
library(urca)
dt_xy = as.data.table(timeSeries::LPP2005REC[, 2:3])
res = urca::ca.po(dt_xy, type = "Pz", demean = demean, lag = "short")
summary(res)
And the results. I marked the p-values I need in the result.
Model 1 p-value = 0.9841
Model 2 p-value = 0.1363
########################################
# Phillips and Ouliaris Unit Root Test #
########################################
Test of type Pz
detrending of series with constant and linear trend
Response SPI :
Call:
lm(formula = SPI ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.036601 -0.003494 0.000243 0.004139 0.024975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.702e-04 7.954e-04 1.220 0.223
zrSPI -1.185e-02 5.227e-02 -0.227 0.821
zrSII -3.037e-02 1.374e-01 -0.221 0.825
trd -6.961e-07 3.657e-06 -0.190 0.849
Residual standard error: 0.007675 on 372 degrees of freedom
Multiple R-squared: 0.0004236, Adjusted R-squared: -0.007637
F-statistic: 0.05255 on 3 and 372 DF, p-value: 0.9841 **<--- I need this p.value**
Response SII :
Call:
lm(formula = SII ~ zr + trd)
Residuals:
Min 1Q Median 3Q Max
-0.0096931 -0.0018105 -0.0002734 0.0017166 0.0115427
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.598e-05 3.012e-04 -0.252 0.8010
zrSPI -1.068e-02 1.979e-02 -0.540 0.5897
zrSII -9.574e-02 5.201e-02 -1.841 0.0664 .
trd 1.891e-06 1.385e-06 1.365 0.1730
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.002906 on 372 degrees of freedom
Multiple R-squared: 0.01476, Adjusted R-squared: 0.006813
F-statistic: 1.858 on 3 and 372 DF, p-value: 0.1363 **<--- I need this p.value**
Value of test-statistic is: 857.4274
Critical values of Pz are:
10pct 5pct 1pct
critical values 71.9586 81.3812 102.0167
You have to dig into the res object to see its attributes and what's available there.
attributes(reg)
...
#>
#> $testreg
#> Response SPI :
#>
#> Call:
#> lm(formula = SPI ~ zr + trd)
#>
...
A long list of objects is returned, but we can see what is looking like the summary of lm being called under testreg, which we can see is one of the attributes of res. We can also access attributes of res using attr(res, "name"), so let's look at the components of testreg.
names(attributes(res))
#> [1] "z" "type" "model" "lag" "cval" "res"
#> [7] "teststat" "testreg" "test.name" "class"
names(attr(res, "testreg"))
#> [1] "Response SPI" "Response SII"
As you noted above, you're looking for 2 separate p-values, so makes since we have two separate models. Let's retrieve these and look at what they are.
spi <- attr(res, "testreg")[["Response SPI"]]
sii <- attr(res, "testreg")[["Response SII"]]
class(spi)
#> [1] "summary.lm"
So, each of them is a summary.lm object. There's lots of documentation on how to extract p-values from lm or summary.lm objects, so let's use the method described here.
get_pval <- function(summary_lm) {
pf(
summary_lm$fstatistic[1L],
summary_lm$fstatistic[2L],
summary_lm$fstatistic[3L],
lower.tail = FALSE
)
}
get_pval(spi)
#> value
#> 0.9840898
get_pval(sii)
#> value
#> 0.1363474
And there you go, those are the two p-values you were interested in!

Avoid losing formulas when applying the lm function over a list of formulas in R

I'm trying to take all pairs of variables in the mtcars data set and make a linear model using the lm function. But my approach is causing me to lose the formulas when I go to summarize or plot the models. Here is the code that I am using.
library(tidyverse)
my_vars <- names(mtcars))
pairs <- t(combn(my_vars, 2)) # Get all possible pairs of variables
# Create formulas for the lm model
fmls <-
as.tibble(pairs) %>%
mutate(fml = paste(V1, V2, sep = "~")) %>%
select(fml) %>%
.[[1]] %>%
sapply(as.formula)
# Create a linear model for ear pair of variables
mods <- lapply(fmls, function(v) lm(data = mtcars, formula = v))
# print the summary of all variables
for (i in 1:length(mods)) {
print(summary(mods[[i]]))
}
(I snagged the idea of using strings to make formulas from here
[1]: Pass a vector of variables into lm() formula.) Here is the output of the summary for the first model (summary(mods[[1]])):
Call:
lm(formula = v, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
I'm searching for a (perhaps metaprogramming) technique so that the call line looks something like lm(formula = var1 ~ var2, data = mtcars) as opposed to formula = v.
I made pairs into a data frame, to make life easier:
library(tidyverse)
my_vars <- names(mtcars)
pairs <- t(combn(my_vars, 2)) %>%
as.data.frame# Get all possible pairs of variables
You can do this using eval() which evaluates an expression.
listOfRegs <- apply(pairs, 1, function(pair) {
V1 <- pair[[1]] %>% as.character
V2 <- pair[[2]] %>% as.character
fit <- eval(parse(text = paste0("lm(", pair[[1]] %>% as.character,
"~", pair[[2]] %>% as.character,
", data = mtcars)")))
return(fit)
})
lapply(listOfRegs, summary)
Then:
> lapply(listOfRegs, summary)
[[1]]
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
... etc

Running a regression

Background: my data set has 52 rows and 12 columns (assume column names are A - L) and the name of my data set is foo
I am told to run a regression where foo$L is the dependent variable, and all other variables are independent except for foo$K.
The way i was doing it is
fit <- lm(foo$L ~ foo$a + ... +foo$J)
then calling
summary(fit)
Is my way a good way to run a regression and finding the intercept and coef?
Use the data argument to lm so you don't have to use the foo$ syntax for each predictor. Use dependent ~ . as the formula to have the dependent variable predicted by all other variables. Then you can use - K to exclude K:
data_mat = matrix(rnorm(52 * 12), nrow = 52)
df = as.data.frame(data_mat)
colnames(df) = LETTERS[1:12]
lm(L ~ . - K, data = df)
You can first remove the column K, and then do fit <- lm(L ~ ., data = foo). This will treat the L column as the dependent variable and all the other columns as the independent variables. You don't have to specify each column names in the formula.
Here is an example using the mtcars, fitting a multiple regression model to mpg with all the other variables except carb.
mtcars2 <- mtcars[, !names(mtcars) %in% "carb"]
fit <- lm(mpg ~ ., data = mtcars2)
summary(fit)
# Call:
# lm(formula = mpg ~ ., data = mtcars2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.3038 -1.6964 -0.1796 1.1802 4.7245
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 12.83084 18.18671 0.706 0.48790
# cyl -0.16881 0.99544 -0.170 0.86689
# disp 0.01623 0.01290 1.259 0.22137
# hp -0.02424 0.01811 -1.339 0.19428
# drat 0.70590 1.56553 0.451 0.65647
# wt -4.03214 1.33252 -3.026 0.00621 **
# qsec 0.86829 0.68874 1.261 0.22063
# vs 0.36470 2.05009 0.178 0.86043
# am 2.55093 2.00826 1.270 0.21728
# gear 0.50294 1.32287 0.380 0.70745
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.593 on 22 degrees of freedom
# Multiple R-squared: 0.8687, Adjusted R-squared: 0.8149
# F-statistic: 16.17 on 9 and 22 DF, p-value: 9.244e-08

lm: use product of two variables as a single variable

I am running the following piece of code:
lm(ath ~ HAPP + IQ2 + OPEN2 + INCOME*EXPEC,data=data)
Which, of course, lead me to the output:
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-3.2644 -0.5461 -0.0223 0.4158 3.2217
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.730e+00 3.141e+00 1.824 0.068112 .
HAPP -7.765e-02 8.958e-02 -0.867 0.386014
IQ2 5.080e-04 7.453e-05 6.816 9.38e-12 ***
OPEN2 -5.038e-06 5.114e-06 -0.985 0.324640
INCOME -1.837e-02 1.211e-01 -0.152 0.879395
EXPEC -3.336e-01 1.161e-01 -2.873 0.004067 **
INCOME:EXPEC 2.645e-03 7.597e-04 3.481 0.000499 ***
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 9.489 1.363 6.96 3.41e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 222.5 on 8 Df
Pseudo R-squared: 0.6938
Number of iterations: 36 (BFGS) + 4 (Fisher scoring)
I need to drop the INCOME and EXPEC lines (with Estimate, Std. Error, z value and Pr(>|z|)) from the regression in a really elegant way (I need to run like a million models, so I can't do it by hand one by one). Please note that those variables (INCOME and EXPEC) were not included in the original set of individual variables. This is, ONLY the requested variables (and the demanded interactions, of course) should be printed.
Any piece of advice?
Thanks!!! :D
You can use the AsIs function. See the example below;
fit <- lm(Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width) , data = iris)
fit
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width),
# data = iris)
#
# Coefficients:
# (Intercept) Sepal.Width
# 4.1072 0.2688
# I(Petal.Length * Petal.Width)
# 0.1578
library(broom)
tidy(fit)
# term estimate std.error statistic p.value
# 1 (Intercept) 4.1072163 0.266529393 15.409994 1.702125e-32
# 2 Sepal.Width 0.2687704 0.081280587 3.306698 1.186597e-03
# 3 I(Petal.Length * Petal.Width) 0.1578160 0.007517941 20.991921 4.426899e-46
If you only need part of the coefficients you can just use the coef function from base R and subset the indices you like. For example:
a1 <- lm(Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width) , data = iris)
coefficients(a1)[1:2]
(Intercept) Sepal.Width
4.1072163 0.2687704
If you need the formula call as well you could do a1$call
a1$call
lm(formula = Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width),
data = iris)
Or if you need any other argument just take a look at str(a1) or str(summary(a1))

Resources