Extracting t-stat p values from lm in R - r

I have run a regression model in R using the lm function. The resulting ANOVA table gives me the F-value for each coefficient (which doesnt really make sense to me). What I would like to know is the t-stat for each coefficient and its corresponding p-value. How do I get this? Is it stored by the function or does it require additional computation?
Here is the code and output:
library(lubridate)
library(RCurl)
library(plyr)
[in] fit <- lm(btc_close ~ vix_close + gold_close + eth_close, data = all_dat)
# Other useful functions
coefficients(fit) # model coefficients
confint(fit, level=0.95) # CIs for model parameters
anova(fit) # anova table
[out]
Analysis of Variance Table
Response: btc_close
Df Sum Sq Mean Sq F value Pr(>F)
vix_close 1 20911897 20911897 280.1788 <2e-16 ***
gold_close 1 91902 91902 1.2313 0.2698
eth_close 1 42716393 42716393 572.3168 <2e-16 ***
Residuals 99 7389130 74638
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If my statistics knowledge serves me correctly, these f-values are meaningless. Theoretically, I should receive an F-value for the model and a T-value for each coefficient.

Here is an example with comments of how you can extract just the t-values.
# Some dummy data
n <- 1e3L
df <- data.frame(x = rnorm(n), z = rnorm(n))
df$y <- with(df, 0.01 * x^2 + z/3)
# Run regression
lr1 <- lm(y ~ x + z, data = df)
# R has special summary method for class "lm"
summary(lr1)
# Call:
# lm(formula = y ~ x + z, data = df)
# Residuals:
# Min 1Q Median 3Q Max
# -0.010810 -0.009025 -0.005259 0.003617 0.096771
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.0100122 0.0004313 23.216 <2e-16 ***
# x 0.0008105 0.0004305 1.883 0.06 .
# z 0.3336034 0.0004244 786.036 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.01363 on 997 degrees of freedom
# Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
# F-statistic: 3.09e+05 on 2 and 997 DF, p-value: < 2.2e-16
# Now, if you only want the t-values
summary(lr1)[["coefficients"]][, "t value"]
# Or (better practice as explained in comments by Axeman)
coef(summary(lr1))[, "t value"]
# (Intercept) x z
# 23.216317 1.882841 786.035718

You could try this:
summary(fit)

summary(fit)$coefficients[,4] for p-values
summary(fit)$coefficients[,3] for t-values

As Benjamin has already answered, I would advise using broom::tidy() to coerce the model object to a tidy dataframe. The statistic column will contain the relevant test statistic and is easily available for plotting with ggplot2.

you can use this
summary(fit)$coefficients[,3]
To extract only t-values

Related

multiple regression models results from a list in R

I performed multiple regression results in R and save them in the environment as a list of ncol(data), to display one of the regression results and it's summary i use this command summary(lm_results[[1]]), which prints the following
Call:
lm(formula = fml, data = data)
Residuals:
Min 1Q Median 3Q Max
-4.1615 -0.9830 -0.3605 0.3508 4.5893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04464 0.91212 0.049 0.961506
X2 0.34424 0.08067 4.267 0.000464 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.975 on 18 degrees of freedom
Multiple R-squared: 0.5029, Adjusted R-squared: 0.4753
F-statistic: 18.21 on 1 and 18 DF, p-value: 0.0004637
I want to print all the regression results in one command like
for(i in 1:ncol(data)) Regress[i] <- summary(lm_results[[i]])
and also to be able to extract only all R-squared or Adj R-square values of all regression models (and format them in one dataframe) . How can i do that in R?
You can try any of these approaches (I have used some simulated data):
#Option 1
lapply(listofmodels,function(x)summary(x)[8])
Output:
$model1
$model1$r.squared
[1] 0.01382265
$model2
$model2$r.squared
[1] 0.9271098
Or:
#Option 2
lapply(listofmodels,function(x)summary(x)[['r.squared']])
Output:
$model1
[1] 0.01382265
$model2
[1] 0.9271098
Some data used:
#Data
listofmodels <- list(model1=lm(iris$Sepal.Length~iris$Sepal.Width),model2=lm(iris$Petal.Width~iris$Petal.Length))
We could tidy or glance the model output with broom and extract the relevant component
library(broom)
library(purrr)
map_dfr(listofmodels, tidy)
To extract only the 'r.squared'
map_dfr(listofmodels, ~ glance(.x) %>%
select(r.squared))

How R graphs regressions with a factor IV

I usually use SAS but I trying to use R more. I am trying to show how categorizing a continuous independent variable messes up regressions. So I created some data:
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- rnorm(100) #X is now normally distributed with mean 0 and sd 1, N - 100
y <- 3*x + rnorm(100,0,10) #Y is related to x, but with some noise
x2 <- cut(x, 2) #Cuts x into 2 parts
then I ran a regression on x2:
m2 <- lm(y~as.factor(x2)) #A model with the cut variable
summary(m2)
and the summary was what I expected: A coefficient for the intercept and one for the dummy variable:
Call:
lm(formula = y ~ as.factor(x2))
Residuals:
Min 1Q Median 3Q Max
-30.4646 -6.5614 0.4409 5.4936 29.6696
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.403 1.290 -1.088 0.2795
as.factor(x2)(0.102,2.55] 4.075 2.245 1.815 0.0726 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.56 on 98 degrees of freedom
Multiple R-squared: 0.03253, Adjusted R-squared: 0.02265
F-statistic: 3.295 on 1 and 98 DF, p-value: 0.07257
But when I graphed x vs. y and added a line for the regression from m2, the line was smooth - I would have expected a jump where x2 goes from 0 to 1.
plot(x,y)
abline(reg = m2)
What am I doing wrong? Or am I missing something basic?

Summary Extract Correlation Coefficient

I am using lm() on a large data set in R. Using summary() one can get lot of details about linear regression between these two parameters.
The part I am confused with is which one is the correct parameter in the Coefficients: section of summary, to use as correlation coefficient?
Sample Data
c1 <- c(1:10)
c2 <- c(10:19)
output <- summary(lm(c1 ~ c2))
Summary
Call:
lm(formula = c1 ~ c2)
Residuals:
Min 1Q Median 3Q Max
-2.280e-15 -8.925e-16 -2.144e-16 4.221e-16 4.051e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.000e+00 2.902e-15 -3.101e+15 <2e-16 ***
c2 1.000e+00 1.963e-16 5.093e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.783e-15 on 8 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.594e+31 on 1 and 8 DF, p-value: < 2.2e-16
Is this the correlation coefficient I should use?
output$coefficients[2,1]
1
Please suggest, thanks.
The full variance covariance matrix of the coefficient estimates is:
fm <- lm(c1 ~ c2)
vcov(fm)
and in particular sqrt(diag(vcov(fm))) equals coef(summary(fm))[, 2]
The corresponding correlation matrix is:
cov2cor(vcov(fm))
The correlation between the coefficient estimates is:
cov2cor(vcov(fm))[1, 2]

R: Translate the results from lm() to an equation

I'm using R and I want to translate the results from lm() to an equation.
My model is:
Residuals:
Min 1Q Median 3Q Max
-0.048110 -0.023948 -0.000376 0.024511 0.044190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.17691 0.00909 349.50 < 2e-16 ***
poly(QPB2_REF1, 2)1 0.64947 0.03015 21.54 2.66e-14 ***
poly(QPB2_REF1, 2)2 0.10824 0.03015 3.59 0.00209 **
B2DBSA_REF1DONSON -0.20959 0.01286 -16.30 3.17e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03015 on 18 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9724
F-statistic: 247.6 on 3 and 18 DF, p-value: 8.098e-15
Do you have any idea?
I tried to have something like
f <- function(x) {3.17691 + 0.64947*x +0.10824*x^2 -0.20959*1 + 0.03015^2}
but when I tried to set a x, the f(x) value is incorrect.
Your output indicates that the model includes use of the poly function which be default orthogonalizes the polynomials (includes centering the x's and other things). In your formula there is no orthogonalization done and that is the likely difference. You can refit the model using raw=TRUE in the call to poly to get the raw coefficients that can be multiplied by $x$ and $x^2$.
You may also be interested in the Function function in the rms package which automates creating functions from fitted models.
Edit
Here is an example:
library(rms)
xx <- 1:25
yy <- 5 - 1.5*xx + 0.1*xx^2 + rnorm(25)
plot(xx,yy)
fit <- ols(yy ~ pol(xx,2))
mypred <- Function(fit)
curve(mypred, add=TRUE)
mypred( c(1,25, 3, 3.5))
You need to use the rms functions for fitting (ols and pol for this example instead of lm and poly).
If you want to calculate y-hat based on the model, you can just use predict!
Example:
set.seed(123)
my_dat <- data.frame(x=1:10, e=rnorm(10))
my_dat$y <- with(my_dat, x*2 + e)
my_lm <- lm(y~x, data=my_dat)
summary(my_lm)
Result:
Call:
lm(formula = y ~ x, data = my_dat)
Residuals:
Min 1Q Median 3Q Max
-1.1348 -0.5624 -0.1393 0.3854 1.6814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5255 0.6673 0.787 0.454
x 1.9180 0.1075 17.835 1e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9724
F-statistic: 318.1 on 1 and 8 DF, p-value: 1e-07
Now, instead of making a function like 0.5255 + x * 1.9180 manually, I just call predict for my_lm:
predict(my_lm, data.frame(x=11:20))
Same result as this (not counting minor errors from rounding the slope/intercept estimates):
0.5255 + (11:20) * 1.9180
If you are looking for actually visualizing or writing out a complex equation (e.g. something that has restricted cubic spline transformations), I recommend using the rms package, fitting your model, and using the latex function to see it in latex
my_lm <- ols(y~x, data=my_dat)
latex(my_lm)
Note you will need to render the latex code so as to see your equation. There are websites and, if you are using a Mac, Mac Tex software, that will render it for you.

Select regression coefs by name

After running a regression, how can I select the variable name and corresponding parameter estimate?
For example, after running the following regression, I obtain:
set.seed(1)
n=1000
x=rnorm(n,0,1)
y=.6*x+rnorm(n,0,sqrt(1-.6)^2)
(reg1=summary(lm(y~x)))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.2994 -0.2688 -0.0055 0.3022 1.4577
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.006475 0.013162 -0.492 0.623
x 0.602573 0.012723 47.359 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4162 on 998 degrees of freedom
Multiple R-squared: 0.6921, Adjusted R-squared: 0.6918
F-statistic: 2243 on 1 and 998 DF, p-value: < 2.2e-16
I would like to be able to select the coefficient by the variable names (e.g., (Intercept) -0.006475)
I have tried the following but nothing works...
attr(reg1$coefficients,"terms")
names(reg1$coefficients)
Note: This works reg1$coefficients[1,1] but I want to be able to call it by the name rather than row / column.
The package broom tidies a lot of regression models very nicely.
require(broom)
set.seed(1)
n=1000
x=rnorm(n,0,1)
y=.6*x+rnorm(n,0,sqrt(1-.6)^2)
model = lm(y~x)
tt <- tidy(model, conf.int=TRUE)
subset(tt,term=="x")
## term estimate std.error statistic p.value conf.low conf.high
## 2 x 0.602573 0.01272349 47.35908 1.687125e-257 0.5776051 0.6275409
with(tt,tt[term=="(Intercept)","estimate"])
## [1] -0.006474794
So, your code doesn't run the way you have it. I changed it a bit:
set.seed(1)
n=1000
x=rnorm(n,0,1)
y=.6*x+rnorm(n,0,sqrt(1-.6)^2)
model = lm(y~x)
Now, I can call coef(model)["x"] or coef(model)["(Intercept)"] and get the values.
> coef(model)["x"]
x
0.602573
> coef(model)["(Intercept)"]
(Intercept)
-0.006474794

Resources