Fitting a linear regression model in R with confounding variables - r

I have a dataset called datamoth where survival is the response variable and treatment is a variable that can be considered both categorical and quantitative. The dataset looks like follows:
survival <- c(17,22,26,20,11,14,37,26,24,11,11,16,8,5,12,3,5,4,14,8,4,6,3,3,10,13,5,7,3,3)
treatment <- c(3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,9,9,12,12,12,12,12,12,21,21,21,21,21,21)
days <- c(3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,9,9,12,12,12,12,12,12,21,21,21,21,21,21)
datamoth <- data.frame(survival, treatment)
So, I can fit a linear regression model considering treatment as categorical, like this:
lmod<-lm(survival ~ factor(treatment), datamoth)
My question is how to fit a linear regression model with treatment as categorical variable but also considering treatment as a quantitative confounding variable.
I have figured out something like this:
model <- lm(survival ~ factor(treatment) + factor(treatment)*days, data = datamoth)
summary(model)
Call:
lm(formula = survival ~ factor(treatment) + factor(treatment) *
days, data = datamoth)
Residuals:
Min 1Q Median 3Q Max
-9.833 -3.333 -1.167 3.167 16.167
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.333 2.435 7.530 6.96e-08 ***
factor(treatment)6 2.500 3.443 0.726 0.47454
factor(treatment)9 -12.167 3.443 -3.534 0.00162 **
factor(treatment)12 -12.000 3.443 -3.485 0.00183 **
factor(treatment)21 -11.500 3.443 -3.340 0.00263 **
days NA NA NA NA
factor(treatment)6:days NA NA NA NA
factor(treatment)9:days NA NA NA NA
factor(treatment)12:days NA NA NA NA
factor(treatment)21:days NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.964 on 25 degrees of freedom
Multiple R-squared: 0.5869, Adjusted R-squared: 0.5208
F-statistic: 8.879 on 4 and 25 DF, p-value: 0.0001324
But obviously this code is not working because these two variables are collinear.
Does anyone to know how to fix it? Any help will be appreciated.

Related

R equivalent of Stata * in regression

I am looking for an equivalent in R of Stata's * function that I can use when running regressions.
For example, if I have a dataframe like the following:
outcome var1 var2 var3 new
3 2 3 4 3
2 3 2 4 2
4 3 2 1 4
I would like to be able to select all variable names that begin with "var" without typing each one out separately in order to run the following regression more efficiently:
lm(outcome ~ var1 + var2 + var3 + new, data = df)
This question explains how I can select the necessary columns. How can I cleanly incorporate these into a regression?
One technique is to subset the data to the required columns, and then to use the . operator for the formula object to represent the independent variables in lm(). The . operator is interpreted as "all columns not otherwise in the formula".
data <- as.data.frame(matrix(runif(1000),nrow = 100)*100)
colnames(data) <- c("outcome", "x1","x2","x3","x4", "x5","x6", "x7", "var8", "var9")
# select outcome plus vars beginning with var
desiredCols <- grepl("var",colnames(data)) | grepl("outcome",colnames(data))
# use desiredCols to subset data frame argument in lm()
summary(lm(outcome ~ .,data = data[desiredCols]))
...and the output:
> summary(lm(outcome ~ .,data = data[desiredCols]))
Call:
lm(formula = outcome ~ ., data = data[desiredCols])
Residuals:
Min 1Q Median 3Q Max
-57.902 -25.359 2.296 26.213 52.871
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.712722 7.334937 8.005 2.62e-12 ***
var8 0.008617 0.101298 0.085 0.932
var9 -0.154073 0.103438 -1.490 0.140
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.86 on 97 degrees of freedom
Multiple R-squared: 0.02249, Adjusted R-squared: 0.002331
F-statistic: 1.116 on 2 and 97 DF, p-value: 0.3319
>

How to perform a three-way (binary factors) between-subjects ANOVA with main effects and all interactions in R

The study randomized participants by Source (Expert vs Attractive) and by Argument (Strong vs Weak), were categorized into Monitor type (High vs Low). I want to test the significance of the main effects, the two-way interactions, and the three-way interactions of the following dataframe - specifically,
Main effects = Self-Monitors (High vs. Low), Argument (Strong vs. Weak), Source (Attractive vs. Expert)
Two-way interactions = Self-MonitorsArgument, Self-MonitorsSource, Argument*Source
Three-way interactions = Self-MonitorsArgumentSource
This is the code:
data<-data.frame(Monitor=c(rep("High.Self.Monitors", 24),rep("Low.Self.Monitors", 24)),
Argument=c(rep("Strong", 24), rep("Weak", 24), rep("Strong", 24), rep("Weak", 24)),
Source=c(rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12),
rep("Expert",12),rep("Attractive",12),rep("Expert",12),rep("Attractive",12)),
Response=c(4,3,4,5,2,5,4,6,3,4,5,4,4,4,2,3,5,3,2,3,4,3,2,4,3,5,3,2,6,4,4,3,5,3,2,3,5,5,7,5,6,4,3,5,6,7,7,6,
3,5,5,4,3,2,1,5,3,4,3,4,5,4,3,2,4,6,2,4,4,3,4,3,5,6,4,7,6,7,5,6,4,6,7,5,6,4,4,2,4,5,4,3,4,2,3,4))
data$Monitor<-as.factor(data$Monitor)
data$Argument<-as.factor(data$Argument)
data$Source<-as.factor(data$Source)
I'd like to obtain the main effects, as well as all two-way interactions and the three-way interaction. However, if I type in anova(lm(Response ~ Monitor*Argument*Source, data=data)) I obtain:
Analysis of Variance Table
Response: Response
Df Sum Sq Mean Sq F value Pr(>F)
Monitor 1 24.000 24.0000 13.5322 0.0003947 ***
Source 1 0.667 0.6667 0.3759 0.5413218
Monitor:Source 1 0.667 0.6667 0.3759 0.5413218
Residuals 92 163.167 1.7736
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and if I enter summary(aov(Response ~ Monitor*Argument*Source, data=data))
Call:
lm.default(formula = Response ~ Monitor * Argument * Source,
data = data)
Residuals:
Min 1Q Median 3Q Max
-2.7917 -0.7917 0.2083 1.2083 2.5417
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4583 0.2718 12.722 < 2e-16 ***
MonitorLow.Self.Monitors 1.1667 0.3844 3.035 0.00313 **
ArgumentWeak NA NA NA NA
SourceExpert 0.3333 0.3844 0.867 0.38817
MonitorLow.Self.Monitors:ArgumentWeak NA NA NA NA
MonitorLow.Self.Monitors:SourceExpert -0.3333 0.5437 -0.613 0.54132
ArgumentWeak:SourceExpert NA NA NA NA
MonitorLow.Self.Monitors:ArgumentWeak:SourceExpert NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.332 on 92 degrees of freedom
Multiple R-squared: 0.1344, Adjusted R-squared: 0.1062
F-statistic: 4.761 on 3 and 92 DF, p-value: 0.00394
Any thoughts or ideas?
Edit
Your data isn't well randomized as you say. In order to estimate a three-way interaction you'd have to have a group of subjects having "Low", "Strong" and "Expert" combination of levels of the three factors. You do not have such a group.
Look at:
table(data[,1:3])
For example.

Multi Collinearity for Categorical Variables

For Numerical/Continuous data, to detect Collinearity between predictor variables we use the Pearson's Correlation Coefficient and make sure that predictors are not correlated among themselves but are correlated with the response variable.
But How can we detect multicollinearity if we have a dataset, where predictors are all categorical. I am sharing one dataset where I am trying to find out if predictor variables are correlated or not
> A(Response Variable) B C D
> Yes Yes Yes Yes
> No Yes Yes Yes
> Yes No No No
How to do the same?
Collinearity can be, but is not always , a property of just a pair of variables and this is especially true when dealing with categorical variables. So although a high correlation coefficient would be sufficient to establish that collinearity might be a problem, a bunch of pairwise low to medium correlations is not a sufficient test for lack of collinearity. The usual method for continuous mixed or categorical collections for variables is to look at the variance inflation factors (which my memory tells me are proportional to the eigenvalues of the variance-covariance-matrix). At any rate this is the code for the vif-function in package:rms:
vif <-
function (fit)
{
v <- vcov(fit, regcoef.only = TRUE)
nam <- dimnames(v)[[1]]
ns <- num.intercepts(fit)
if (ns > 0) {
v <- v[-(1:ns), -(1:ns), drop = FALSE]
nam <- nam[-(1:ns)]
}
d <- diag(v)^0.5
v <- diag(solve(v/(d %o% d)))
names(v) <- nam
v
}
The reason that categorical variables have a greater tendency to generate collinearity is that the three-way or four-way tabulations often form linear combinations that lead to complete collinearity. You example case is an extreme case of collinearity but you can also get collinearity with
A B C D
1 1 0 0
1 0 1 0
1 0 0 1
Notice that this is collinear because A == B+C+D in all rows. None of pairwise correlations would be high, but the system together causes complete collinearity.
After putting your data into an R object and running lm() on it, it becomes apparent that there is another way to determine collinearity with R and that is because lm will drop factor variables from the results when they are "aliased", which is just another term for being completely collinear.
Here is an example for #Alex demonstrating highly collinear data and the output of vif in that situation. Generally you hope to see variance inflation factors below 10.
> set.seed(123)
> dat2 <- data.frame(res = rnorm(100), A=sample(1:4, 1000, repl=TRUE)
+ )
> dat2$B<-dat2$A
> head(dat2)
res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3 1.55870831 3 3
4 0.07050839 3 3
5 0.12928774 2 2
6 1.71506499 4 4
> dat2[1,2] <- 2
#change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
Call:
lm(formula = res ~ A + B, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-2.41139 -0.58576 -0.02922 0.60271 2.10760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10972 0.07053 1.556 0.120
A -0.66270 0.91060 -0.728 0.467
B 0.65520 0.90988 0.720 0.472
Residual standard error: 0.9093 on 997 degrees of freedom
Multiple R-squared: 0.0005982, Adjusted R-squared: -0.001407
F-statistic: 0.2984 on 2 and 997 DF, p-value: 0.7421
> vif ( mod )
A B
1239.335 1239.335
If you make a fourth variable "C" that is independent of the first two perdictors (admittedly a bad name for a variable since C is also an R function), you get a more desirable result from vif:
dat2$C <- sample(1:4, 1000, repl=TRUE)
vif ( lm( res ~ A + C, dat2) )
#---------
A C
1.003493 1.003493
Edit: I realized that I had not actually created R-representations of a "categorical variable" despite sampling from 1:4. The same sort of result occurs with factor versions of that "sample":
> dat2 <- data.frame(res = rnorm(100), A=factor( sample(1:4, 1000, repl=TRUE) ) )
> dat2$B<-dat2$A
> head(dat2)
res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3 1.55870831 3 3
4 0.07050839 3 3
5 0.12928774 2 2
6 1.71506499 4 4
> dat2[1,2] <- 2
> #change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
Call:
lm(formula = res ~ A + B, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-2.43375 -0.59278 -0.04761 0.62591 2.12461
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11165 0.05766 1.936 0.0531 .
A2 -0.67213 0.91170 -0.737 0.4612
A3 0.01293 0.08146 0.159 0.8739
A4 -0.04624 0.08196 -0.564 0.5728
B2 0.62320 0.91165 0.684 0.4944
B3 NA NA NA NA
B4 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9099 on 995 degrees of freedom
Multiple R-squared: 0.001426, Adjusted R-squared: -0.002588
F-statistic: 0.3553 on 4 and 995 DF, p-value: 0.8404
Notice that two of the factor levels are omitted from the calculation of coefficints. ... because they are completely collinear with the corresponding A levels. So if you want to see what vif returns for factor variables that are almost collinear, you need to change a few more values:
> dat2[1,2] <- 2
> dat2[2,2] <-2; dat2[3,2]<-2; dat2[4,2]<-4
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
Call:
lm(formula = res ~ A + B, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-2.42819 -0.59241 -0.04483 0.62482 2.12461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11165 0.05768 1.936 0.0532 .
A2 -0.67213 0.91201 -0.737 0.4613
A3 -1.51763 1.17803 -1.288 0.1980
A4 -0.97195 1.17710 -0.826 0.4092
B2 0.62320 0.91196 0.683 0.4945
B3 1.52500 1.17520 1.298 0.1947
B4 0.92448 1.17520 0.787 0.4317
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9102 on 993 degrees of freedom
Multiple R-squared: 0.002753, Adjusted R-squared: -0.003272
F-statistic: 0.4569 on 6 and 993 DF, p-value: 0.8403
#--------------
> library(rms)
> vif(mod)
A2 A3 A4 B2 B3 B4
192.6898 312.4128 308.5177 191.2080 312.5856 307.5242

Regression summary in R returns a bunch of NAs

Trying to run an uncomplicated regression in R and receiving long list of coefficient values with NAs for standard error and t-value. I've never experienced this before.
Result:
summary(model)
Call:
lm(formula = fed$SPX.Index ~ fed$Fed.Treasuries...MM., data = fed)
Residuals:
ALL 311 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1258.84 NA NA NA
fed$Fed.Treasuries...MM. 1,016,102 0.94 NA NA NA
fed$Fed.Treasuries...MM. 1,030,985 17.72 NA NA NA
fed$Fed.Treasuries...MM. 1,062,061 27.12 NA NA NA
fed$Fed.Treasuries...MM. 917,451 -52.77 NA NA NA
fed$Fed.Treasuries...MM. 949,612 -30.56 NA NA NA
fed$Fed.Treasuries...MM. 967,553 -23.61 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 310 and 0 DF, p-value: NA
head(fed)
X Fed.Treasuries...MM. Reserve.Repurchases Agency.Debt.Held Treasuries.Maturing.in.5.10.years SPX.Index
1 10/1/2008 476,621 93,063 14,500 93,362 1161.06
2 10/8/2008 476,579 77,349 14,105 93,353 984.94
3 10/15/2008 476,555 107,819 14,105 94,336 907.84
4 10/22/2008 476,512 95,987 14,105 94,327 896.78
5 10/29/2008 476,469 94,655 13,620 94,317 930.09
6 11/5/2008 476,456 96,663 13,235 94,312 952.77
You have commas in your numbers in your CSV file, R reads them as characters. Your model then has as many levels as rows, and so is degenerate.
Illustration. Take this CSV file:
1, "1,234", "2,345,565"
2, "2,345", "3,234,543"
3, "3,234", "3,987,766"
Read in, fit first column (numbers) against third column (comma-separated numbers):
> fed = read.csv("commas.csv",head=FALSE)
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1 NA NA NA
V3 3,234,543 1 NA NA NA
V3 3,987,766 2 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 2 and 0 DF, p-value: NA
Note this is exactly what you are getting but with different column names. So this almost certainly must be what you have.
Fix. Convert column:
> fed$V3 = as.numeric(gsub(",","", fed$V3))
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
1 2 3
0.02522 -0.05499 0.02977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.875e+00 1.890e-01 -9.922 0.0639 .
V3 1.215e-06 5.799e-08 20.952 0.0304 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06742 on 1 degrees of freedom
Multiple R-squared: 0.9977, Adjusted R-squared: 0.9955
F-statistic: 439 on 1 and 1 DF, p-value: 0.03036
Repeat over columns as necessary.

Reporting base levels of categorical predictors in regression summary

Suppose that myGlm is a glm object in R.
summary(myGlm) displays coefficient estimates for all of the interesting dummy variables. However, I often don't know what the reference levels are since I have many nominal factors with many levels. Does there exist a method to output the base levels along with the estimates?
(Apologies in advance if this should really be in SO, not sure where to put it)
edit to include example
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
summary(glm.D93)
Call:
glm(formula = counts ~ outcome + treatment, family = poisson())
Deviance Residuals:
1 2 3 4 5 6 7 8 9
-0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715 -0.09167 -0.96656
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 ***
outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 *
outcome3 -2.930e-01 1.927e-01 -1.520 0.1285
treatment2 8.717e-16 2.000e-01 0.000 1.0000
treatment3 4.557e-16 2.000e-01 0.000 1.0000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here we can see outcome2, outcome3, but not outcome1, which I'd like to see in the output (with an estimate of 0 or blank). In this particular example, it's obvious that the base level is outcome1, but if I'm working with variables such as Country with levels {USA, Mexico, Canada, ...} I might not remember which one comes first and is omitted.
example output I'm looking for:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 ***
outcome1 0 NA NA NA
outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 *
outcome3 -2.930e-01 1.927e-01 -1.520 0.1285
treatment1 0 NA NA NA
treatment2 8.717e-16 2.000e-01 0.000 1.0000
treatment3 4.557e-16 2.000e-01 0.000 1.0000

Resources