Suppose we have two variables that we wish to build a model from:
set.seed(10239)
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
class(model)
# [1] "lm"
summary(model)
#
# Call:
# lm(formula = x ~ y)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.08676 -0.63022 -0.01115 0.75280 2.35169
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.07188 0.11375 -0.632 0.529
# y 0.06999 0.12076 0.580 0.564
#
# Residual standard error: 1.117 on 98 degrees of freedom
# Multiple R-squared: 0.003416, Adjusted R-squared: -0.006754
# F-statistic: 0.3359 on 1 and 98 DF, p-value: 0.5635
How do you plot the F-distribution of the model object?
If you check the structure of the summary of your model str(summary(model)), you'll notice that the parameters for the F-distribution of interest can be found by calling summary(model)$fstatistic. The first element in the list is the F-statistic and the following two element are the numerator degrees of freedom and the denominator degrees of freedom, in that order. So to plot the F-distribution, try something like the following
df <- summary(model)$fstatistic
curve(df(x, df1 = df[2], df2 = df[3]), from = 0, to = 100)
Alternatively, you can also get the parameters for the F-distribution of interest from the model itself. The numerator degrees of freedom is one less than the number of coefficients in the model and the denominator degrees of freedom is the total number of observations less one more than the number of coefficients in the model.
df1 <- length(model$coefficients) - 1
df2 <- length(model$residuals) - df1 - 1
curve(df(x, df1 = df1, df2 = df2), from = 0, to = 100)
I prefer the following way to show the p-value of the F distribution
fstat <- summary(model)$fstatistic
library(HH)
old.omd <- par(omd=c(.05,.88, .05,1))
F.setup(df1=fstat['numdf'], df2=fstat['dendf'])
F.curve(df1=fstat['numdf'], df2=fstat['dendf'], col='blue')
F.observed(fstat['value'], df1=fstat['numdf'], df2=fstat['dendf'])
par(old.omd)
Related
In glm() it is possible to model bernoulli [0,1] outcomes with a logistic regression using the following sort of syntax.
glm(bin ~ x, df, family = "binomial")
However you can also perform aggregated binomial regression, where each observation represents a count of target events from a certain fixed number of bernoulli trials. For example see the following data:
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
With these sort of data you use slightly different syntax in glm()
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
#
# Coefficients:
# (Intercept) x
# -0.3064 0.6786
#
# Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
# Null Deviance: 53.72
# Residual Deviance: 39.54 AIC: 178
I was wondering is it possible to model this type of aggregated binomial data in the glmnet package? If so, what is the syntax?
Yes you can do it as the following
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
x = cbind(x, xx = c(rep(0.5,20), rep(0.7, 20), rep(1,10)))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
I added another covariate here called xx as glmnet accepts minimum of two covariates
In glm as you have it in your post
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
# Coefficients:
# (Intercept) xx xxx
# 0.04366 0.86126 -0.64862
# Degrees of Freedom: 49 Total (i.e. Null); 47 Residual
# Null Deviance: 53.72
# Residual Deviance: 38.82 AIC: 179.3
In glmnet, without regularization (lambda=0) to reproduce similar results as in glm
library(glmnet)
fit = glmnet(x, cbind(cov-y,y), family="binomial", lambda=0)
coef(fit)
# output
# 3 x 1 sparse Matrix of class "dgCMatrix"
# s0
# (Intercept) 0.04352689
# x 0.86111234
# xx -0.64831806
I have xy data, where y is a continuous response and x is a categorical variable:
set.seed(1)
df <- data.frame(y = rnorm(27), group = c(rep("A",9),rep("B",9),rep("C",9)), stringsAsFactors = F)
I would like to fit the linear model: y ~ group to it, in which each of the levels in df$group is contrasted with the mean.
I thought that using Deviation Coding does that:
lm(y ~ group,contrasts = "contr.sum",data=df)
But it skips contrasting group A with the mean:
> summary(lm(y ~ group,contrasts = "contr.sum",data=df))
Call:
lm(formula = y ~ group, data = df, contrasts = "contr.sum")
Residuals:
Min 1Q Median 3Q Max
-1.6445 -0.6946 -0.1304 0.6593 1.9165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2651 0.3457 -0.767 0.451
groupB 0.2057 0.4888 0.421 0.678
groupC 0.3985 0.4888 0.815 0.423
Residual standard error: 1.037 on 24 degrees of freedom
Multiple R-squared: 0.02695, Adjusted R-squared: -0.05414
F-statistic: 0.3324 on 2 and 24 DF, p-value: 0.7205
Is there any function that builds a model matrix to get each of the levels of df$group contrasted with the mean in the summary?
All I can think of is manually adding a "mean" level to df$group and setting it is as baseline with Dummy Coding:
df <- df %>% rbind(data.frame(y = mean(df$y), group ="mean"))
df$group <- factor(df$group, levels = c("mean","A","B","C"))
summary(lm(y ~ group,contrasts = "contr.treatment",data=df))
Call:
lm(formula = y ~ group, data = df, contrasts = "contr.treatment")
Residuals:
Min 1Q Median 3Q Max
-2.30003 -0.34864 0.07575 0.56896 1.42645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.14832 0.95210 0.156 0.878
groupA 0.03250 1.00360 0.032 0.974
groupB -0.06300 1.00360 -0.063 0.950
groupC 0.03049 1.00360 0.030 0.976
Residual standard error: 0.9521 on 24 degrees of freedom
Multiple R-squared: 0.002457, Adjusted R-squared: -0.1222
F-statistic: 0.01971 on 3 and 24 DF, p-value: 0.9961
Similarly, suppose I have data with two categorical variables:
set.seed(1)
df <- data.frame(y = rnorm(18),
group = c(rep("A",9),rep("B",9)),
class = as.character(rep(c(rep(1,3),rep(2,3),rep(3,3)),2)))
and I would like to estimate the interaction effect per each level: (i.e., class1:groupB, class2:groupB, and class3:groupB for:
lm(y ~ class*group,contrasts = c("contr.sum","contr.treatment"),data=df)
How would I obtain it?
Use +0 in the lm formula to omit the intercept, then you should get the expected contrast coding:
summary(lm(y ~ 0 + group, contrasts = "contr.sum", data=df))
Result:
Call:
lm(formula = y ~ 0 + group, data = df, contrasts = "contr.sum")
Residuals:
Min 1Q Median 3Q Max
-2.3000 -0.3627 0.1487 0.5804 1.4264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
groupA 0.18082 0.31737 0.570 0.574
groupB 0.08533 0.31737 0.269 0.790
groupC 0.17882 0.31737 0.563 0.578
Residual standard error: 0.9521 on 24 degrees of freedom
Multiple R-squared: 0.02891, Adjusted R-squared: -0.09248
F-statistic: 0.2381 on 3 and 24 DF, p-value: 0.8689
If you want to do this for an interaction, here's one way:
lm(y ~ 0 + class:group,
contrasts = c("contr.sum","contr.treatment"),
data=df)
Simple logistic regression example.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
result <-glm(out~factor(y), family = 'binomial', data=df)
summary(result)
#Call:
#glm(formula = out ~ factor(y), family = "binomial", data = df)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.4823 -0.9005 -0.9005 0.9005 1.4823
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.931e-01 1.225e+00 -0.566 0.571
#factor(y)B 1.386e+00 1.732e+00 0.800 0.423
#factor(y)C 3.950e-16 1.732e+00 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 12.365 on 8 degrees of freedom
#Residual deviance: 11.457 on 6 degrees of freedom
#AIC: 17.457
#Number of Fisher Scoring iterations: 4
My reference category is now A; results for B and C relative to A are given. I would also like to get the results when B and C are the reference. One can change the reference manually by using levels = in factor(); but this would require fitting 3 models. Is it possible to do this in one go? Or what would be a more efficient approach?
If you want to do all pairwise comparisons, you should usually also do a correction for alpha-error inflation due to multiple testing. You can easily do a Tukey test with package multcomp.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
#y is already a factor, if not, coerce before the model fit
result <-glm(out~y, family = 'binomial', data=df)
summary(result)
library(multcomp)
comps <- glht(result, linfct = mcp(y = "Tukey"))
summary(comps)
#Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: glm(formula = out ~ y, family = "binomial", data = df)
#
#Linear Hypotheses:
# Estimate Std. Error z value Pr(>|z|)
#B - A == 0 1.386e+00 1.732e+00 0.8 0.703
#C - A == 0 1.923e-16 1.732e+00 0.0 1.000
#C - B == 0 -1.386e+00 1.732e+00 -0.8 0.703
#(Adjusted p values reported -- single-step method)
#letter notation often used in graphs and tables
cld(comps)
# A B C
#"a" "a" "a"
My task sounds like:
Here is some generic output from a multiple regression analysis of a
model predicting Y from three numeric variables X1, X2, and X3 on n =
25 observations. I have replaced some of the values in the output by
letters. You are to use the remaining values to compute the values for
A, B, C, … , K. Please make it crystal clear how you obtained your
answers.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.49526 2.63720 1.325 0.19929
X1 -1.17573 0.31557 -3.725734 D
X2 0.03876 0.03193 1.213905 E
X3 -0.15228 0.05011 -3.038914 F
Residual standard error: 0.754 on 21 degrees of freedom
Multiple R-squared: 0.625634, Adjusted R-squared: 0.7150102
F-statistic: 11.7 on 3 and 21 DF, p-value: 0.0001016
anova(model)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
X1 1 8.6400 8.6400 15.2122 0.0008244 ***
X2 1 6.0468 6.0468 10.6465 0.0037181 **
X3 1 5.2459 5.2459 9.2362 0.0062376 **
Residuals 21 11.9273 0.5680
How may i find D,E and F values with R studio comands?
To find D, E, and F you may want to look at summary.lm. In particular,
ans$coefficients <- cbind(Estimate = est, `Std. Error` = se,
`t value` = tval, `Pr(>|t|)` = 2 * pt(abs(tval), rdf,
lower.tail = FALSE))
Hence, the values of interest are
2 * pt(abs(c(-3.725734, 1.213905, -3.038914)), 21, lower.tail = FALSE)
# [1] 0.001249329 0.238260061 0.006240436
respectively. That is, we use the t values from the table. The fact that rdf, the number of degrees of freedom, is 21 comes from
Residual standard error: 0.754 on 21 degrees of freedom
I would like to remove all of the as.factor elements from the output of an ordinary least squares lm() model in R. The last line doesn't work, but for example:
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
mod <- lm(y ~ x + as.factor(block), data = frame)
summary(mod)
summary(mod)$coefficients[3:5,] <- NULL
Is there a way to remove all of these elements so that the saved `lm' object no longer has them? Thanks.
One option is to use felm function in lfe package.
As stated in the package:
The package is intended for linear models with multiple group fixed effects, i.e. with 2 or more factors with a large number of levels. It performs similar functions as lm, but it uses a special method for projecting out multiple group fixed effects from the normal equations, hence it is faster.
set.seed(123)
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
id<-as.factor(frame$block)
mod <- lm(y ~ x + id, data = frame) #lm
summary(mod)
Call:
lm(formula = y ~ x + id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18115 0.17201 1.053 0.2950
x -0.08310 0.09604 -0.865 0.3891
idB 0.04834 0.24645 0.196 0.8449
idC -0.51265 0.25052 -2.046 0.0435 *
idD 0.04905 0.26073 0.188 0.8512
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared: 0.06677, Adjusted R-squared: 0.02747
F-statistic: 1.699 on 4 and 95 DF, p-value: 0.1566
library(lfe)
est <- felm(y ~ x| id)
summary(est)
Call:
felm(formula = y ~ x | id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.08310 0.09604 -0.865 0.389
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared(full model): 0.06677 Adjusted R-squared: 0.02747
Multiple R-squared(proj model): 0.00782 Adjusted R-squared: -0.03396
F-statistic(full model):1.699 on 4 and 95 DF, p-value: 0.1566
F-statistic(proj model): 0.7487 on 1 and 95 DF, p-value: 0.3891
P.S. A similar program for Stata is reghdfe.