My task sounds like:
Here is some generic output from a multiple regression analysis of a
model predicting Y from three numeric variables X1, X2, and X3 on n =
25 observations. I have replaced some of the values in the output by
letters. You are to use the remaining values to compute the values for
A, B, C, … , K. Please make it crystal clear how you obtained your
answers.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.49526 2.63720 1.325 0.19929
X1 -1.17573 0.31557 -3.725734 D
X2 0.03876 0.03193 1.213905 E
X3 -0.15228 0.05011 -3.038914 F
Residual standard error: 0.754 on 21 degrees of freedom
Multiple R-squared: 0.625634, Adjusted R-squared: 0.7150102
F-statistic: 11.7 on 3 and 21 DF, p-value: 0.0001016
anova(model)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
X1 1 8.6400 8.6400 15.2122 0.0008244 ***
X2 1 6.0468 6.0468 10.6465 0.0037181 **
X3 1 5.2459 5.2459 9.2362 0.0062376 **
Residuals 21 11.9273 0.5680
How may i find D,E and F values with R studio comands?
To find D, E, and F you may want to look at summary.lm. In particular,
ans$coefficients <- cbind(Estimate = est, `Std. Error` = se,
`t value` = tval, `Pr(>|t|)` = 2 * pt(abs(tval), rdf,
lower.tail = FALSE))
Hence, the values of interest are
2 * pt(abs(c(-3.725734, 1.213905, -3.038914)), 21, lower.tail = FALSE)
# [1] 0.001249329 0.238260061 0.006240436
respectively. That is, we use the t values from the table. The fact that rdf, the number of degrees of freedom, is 21 comes from
Residual standard error: 0.754 on 21 degrees of freedom
Related
I need to write the following formulas in R. The STAT formula is copying effects of oneway.test-function.
where sample variance is
and
The variables are: m - number of samples, n - sample size, vector sample_means - mean of each sample and vector sample_vars - sample variance of each sample.
I'm trying to work with the following code, but it doesn't give the correct results when I compare it to aov:
my_anova <- function(m, n, sample_means, sample_vars) {
overall_mean <- mean(sample_means)
sample_vars <- sum((sample_means - overall_mean)^2)/(m-1)
STAT <- (n*sample_vars)/(sum(sample_vars/m))
PVAL <- pf(STAT, m - 1, m*(n - 1), lower.tail = FALSE)
}
Not very sure where you obtained the formulas above, but from what I can gather, you want to obtain the F stats and p value for a one way anova. n should be the degree of freedom and not sample size. Try using this table:
So bottom line is SSF should always be the sum of residuals between your predicted mean and overall mean, whereas SSE is the sum of residuals between your predicted mean and actual values. Then you divide by the corresponding degree of freedom. It should be like below:
my_aov <- function(sample_values, sample_means,n){
overall_mean = mean(sample_values)
SSF = sum((sample_means - overall_mean)^2)
SSE = sum((sample_values - sample_means)^2)
DoF = c(n,length(sample_values)-1-n)
Mean_Square = c(SSF/DoF[1] , SSE/DoF[2])
FSTAT = c(Mean_Square[1]/Mean_Square[2],NA)
PVAL <- pf(FSTAT, DoF[1], DoF[2], lower.tail = FALSE)
cbind(Sum_of_Squares= c(SSF,SSE),DoF,Mean_Square,FSTAT,PVAL)
}
Using an example:
values = iris$Sepal.Length
Species_values = tapply(iris$Sepal.Length,iris$Species,mean)
predicted_values = Species_values[as.character(iris$Species)]
# since there are 3 groups, degree of freedom is 3-1
n = length(unique(iris$Species)) - 1
my_aov(values,predicted_values,n)
Sum_of_Squares DoF Mean_Square FSTAT PVAL
[1,] 63.21213 2 31.6060667 119.2645 1.669669e-31
[2,] 38.95620 147 0.2650082 NA NA
Compare with:
summary(aov(Sepal.Length ~ Species,data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Simple logistic regression example.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
result <-glm(out~factor(y), family = 'binomial', data=df)
summary(result)
#Call:
#glm(formula = out ~ factor(y), family = "binomial", data = df)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.4823 -0.9005 -0.9005 0.9005 1.4823
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.931e-01 1.225e+00 -0.566 0.571
#factor(y)B 1.386e+00 1.732e+00 0.800 0.423
#factor(y)C 3.950e-16 1.732e+00 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 12.365 on 8 degrees of freedom
#Residual deviance: 11.457 on 6 degrees of freedom
#AIC: 17.457
#Number of Fisher Scoring iterations: 4
My reference category is now A; results for B and C relative to A are given. I would also like to get the results when B and C are the reference. One can change the reference manually by using levels = in factor(); but this would require fitting 3 models. Is it possible to do this in one go? Or what would be a more efficient approach?
If you want to do all pairwise comparisons, you should usually also do a correction for alpha-error inflation due to multiple testing. You can easily do a Tukey test with package multcomp.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
#y is already a factor, if not, coerce before the model fit
result <-glm(out~y, family = 'binomial', data=df)
summary(result)
library(multcomp)
comps <- glht(result, linfct = mcp(y = "Tukey"))
summary(comps)
#Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: glm(formula = out ~ y, family = "binomial", data = df)
#
#Linear Hypotheses:
# Estimate Std. Error z value Pr(>|z|)
#B - A == 0 1.386e+00 1.732e+00 0.8 0.703
#C - A == 0 1.923e-16 1.732e+00 0.0 1.000
#C - B == 0 -1.386e+00 1.732e+00 -0.8 0.703
#(Adjusted p values reported -- single-step method)
#letter notation often used in graphs and tables
cld(comps)
# A B C
#"a" "a" "a"
When we have a linear model with a factor variable X (with levels A, B, and C)
y ~ factor(X) + Var2 + Var3
The result shows the estimate XB and XC which is differences B - A and C - A. (suppose that the reference is A).
If we want to know the p-value of the difference between B and C: C - B,
we should designate B or C as a reference group and re-run the model.
Can we get the p-values of the effect B - A, C - A, and C - B at one time?
You are looking for linear hypothesis test by check p-value of some linear combination of regression coefficients. Based on my answer: How to conduct linear hypothesis test on regression coefficients with a clustered covariance matrix?, where we only considered sum of coefficients, I will extend the function LinearCombTest to handle more general cases, supposing alpha as some combination coefficients of variables in vars:
LinearCombTest <- function (lmObject, vars, alpha, .vcov = NULL) {
## if `.vcov` missing, use the one returned by `lm`
if (is.null(.vcov)) .vcov <- vcov(lmObject)
## estimated coefficients
beta <- coef(lmObject)
## linear combination of `vars` with combination coefficients `alpha`
LinearComb <- sum(beta[vars] * alpha)
## get standard errors for sum of `LinearComb`
LinearComb_se <- sum(alpha * crossprod(.vcov[vars, vars], alpha)) ^ 0.5
## perform t-test on `sumvars`
tscore <- LinearComb / LinearComb_se
pvalue <- 2 * pt(abs(tscore), lmObject$df.residual, lower.tail = FALSE)
## return a matrix
form <- paste0("(", paste(alpha, vars, sep = " * "), ")")
form <- paste0(paste0(form, collapse = " + "), " = 0")
matrix(c(LinearComb, LinearComb_se, tscore, pvalue), nrow = 1L,
dimnames = list(form, c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
}
Consider a simple example, where we have a balanced design for three groups A, B and C, with group mean 0, 1, 2, respectively.
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1226684 0.09692277 1.265631 2.066372e-01
#xB 0.9317800 0.13706949 6.797866 5.823987e-11
#xC 2.0445528 0.13706949 14.916177 6.141008e-38
Since A is the reference level, xB is giving B - A while xC is giving C - A. Suppose we are now interested in the difference between group B and C, i.e., C - B, we can use
LinearCombTest(fit, c("xC", "xB"), c(1, -1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * xC) + (-1 * xB) = 0 1.112773 0.1370695 8.118312 1.270686e-14
Note, this function is also handy to work out the group mean of B and C, that is (Intercept) + xB and (Intercept) + xC:
LinearCombTest(fit, c("(Intercept)", "xB"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xB) = 0 1.054448 0.09692277 10.87926 2.007956e-23
LinearCombTest(fit, c("(Intercept)", "xC"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xC) = 0 2.167221 0.09692277 22.36029 1.272811e-65
Alternative solution with lsmeans
Consider the above toy example again:
library(lsmeans)
lsmeans(fit, spec = "x", contr = "revpairwise")
#$lsmeans
# x lsmean SE df lower.CL upper.CL
# A 0.1226684 0.09692277 297 -0.06807396 0.3134109
# B 1.0544484 0.09692277 297 0.86370603 1.2451909
# C 2.1672213 0.09692277 297 1.97647888 2.3579637
#
#Confidence level used: 0.95
#
#$contrasts
# contrast estimate SE df t.ratio p.value
# B - A 0.931780 0.1370695 297 6.798 <.0001
# C - A 2.044553 0.1370695 297 14.916 <.0001
# C - B 1.112773 0.1370695 297 8.118 <.0001
#
#P value adjustment: tukey method for comparing a family of 3 estimates
The $lsmeans domain returns the marginal group mean, while $contrasts returns pairwise group mean difference, since we have used "revpairwise" contrast. Read p.32 of lsmeans for difference between "pairwise" and "revpairwise".
Well this is certainly interesting, as we can compare with the result from LinearCombTest. We see that LinearCombTest is doing correctly.
glht (general linear hypothesis testing) from multcomp package makes this sort of multiple hypothesis test easy without re-running a bunch of separate models. It is essentially crafting a customized contrast matrix based on your defined comparisons of interest.
Using your example comparisons and building on the data #ZheyuanLi provided:
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
library(multcomp)
my_ht <- glht(fit, linfct = mcp(x = c("B-A = 0",
"C-A = 0",
"C-B = 0")))
summary(my_ht) will give you the adjusted p-values for the comparisons of interest.
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#B - A == 0 0.9318 0.1371 6.798 1.11e-10 ***
#C - A == 0 2.0446 0.1371 14.916 < 1e-10 ***
#C - B == 0 1.1128 0.1371 8.118 < 1e-10 ***
You could use the library car, and use the function linearHypothesis with the parameter vcov.
Set this as the variance-covariance matrix of your model.
The function takes formulas or a matrix to describe the system of equations that you would like to test.
I would like to remove all of the as.factor elements from the output of an ordinary least squares lm() model in R. The last line doesn't work, but for example:
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
mod <- lm(y ~ x + as.factor(block), data = frame)
summary(mod)
summary(mod)$coefficients[3:5,] <- NULL
Is there a way to remove all of these elements so that the saved `lm' object no longer has them? Thanks.
One option is to use felm function in lfe package.
As stated in the package:
The package is intended for linear models with multiple group fixed effects, i.e. with 2 or more factors with a large number of levels. It performs similar functions as lm, but it uses a special method for projecting out multiple group fixed effects from the normal equations, hence it is faster.
set.seed(123)
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
id<-as.factor(frame$block)
mod <- lm(y ~ x + id, data = frame) #lm
summary(mod)
Call:
lm(formula = y ~ x + id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18115 0.17201 1.053 0.2950
x -0.08310 0.09604 -0.865 0.3891
idB 0.04834 0.24645 0.196 0.8449
idC -0.51265 0.25052 -2.046 0.0435 *
idD 0.04905 0.26073 0.188 0.8512
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared: 0.06677, Adjusted R-squared: 0.02747
F-statistic: 1.699 on 4 and 95 DF, p-value: 0.1566
library(lfe)
est <- felm(y ~ x| id)
summary(est)
Call:
felm(formula = y ~ x | id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.08310 0.09604 -0.865 0.389
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared(full model): 0.06677 Adjusted R-squared: 0.02747
Multiple R-squared(proj model): 0.00782 Adjusted R-squared: -0.03396
F-statistic(full model):1.699 on 4 and 95 DF, p-value: 0.1566
F-statistic(proj model): 0.7487 on 1 and 95 DF, p-value: 0.3891
P.S. A similar program for Stata is reghdfe.
Suppose we have two variables that we wish to build a model from:
set.seed(10239)
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
class(model)
# [1] "lm"
summary(model)
#
# Call:
# lm(formula = x ~ y)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.08676 -0.63022 -0.01115 0.75280 2.35169
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.07188 0.11375 -0.632 0.529
# y 0.06999 0.12076 0.580 0.564
#
# Residual standard error: 1.117 on 98 degrees of freedom
# Multiple R-squared: 0.003416, Adjusted R-squared: -0.006754
# F-statistic: 0.3359 on 1 and 98 DF, p-value: 0.5635
How do you plot the F-distribution of the model object?
If you check the structure of the summary of your model str(summary(model)), you'll notice that the parameters for the F-distribution of interest can be found by calling summary(model)$fstatistic. The first element in the list is the F-statistic and the following two element are the numerator degrees of freedom and the denominator degrees of freedom, in that order. So to plot the F-distribution, try something like the following
df <- summary(model)$fstatistic
curve(df(x, df1 = df[2], df2 = df[3]), from = 0, to = 100)
Alternatively, you can also get the parameters for the F-distribution of interest from the model itself. The numerator degrees of freedom is one less than the number of coefficients in the model and the denominator degrees of freedom is the total number of observations less one more than the number of coefficients in the model.
df1 <- length(model$coefficients) - 1
df2 <- length(model$residuals) - df1 - 1
curve(df(x, df1 = df1, df2 = df2), from = 0, to = 100)
I prefer the following way to show the p-value of the F distribution
fstat <- summary(model)$fstatistic
library(HH)
old.omd <- par(omd=c(.05,.88, .05,1))
F.setup(df1=fstat['numdf'], df2=fstat['dendf'])
F.curve(df1=fstat['numdf'], df2=fstat['dendf'], col='blue')
F.observed(fstat['value'], df1=fstat['numdf'], df2=fstat['dendf'])
par(old.omd)