Multi Collinearity for Categorical Variables - r

For Numerical/Continuous data, to detect Collinearity between predictor variables we use the Pearson's Correlation Coefficient and make sure that predictors are not correlated among themselves but are correlated with the response variable.
But How can we detect multicollinearity if we have a dataset, where predictors are all categorical. I am sharing one dataset where I am trying to find out if predictor variables are correlated or not
> A(Response Variable) B C D
> Yes Yes Yes Yes
> No Yes Yes Yes
> Yes No No No
How to do the same?

Collinearity can be, but is not always , a property of just a pair of variables and this is especially true when dealing with categorical variables. So although a high correlation coefficient would be sufficient to establish that collinearity might be a problem, a bunch of pairwise low to medium correlations is not a sufficient test for lack of collinearity. The usual method for continuous mixed or categorical collections for variables is to look at the variance inflation factors (which my memory tells me are proportional to the eigenvalues of the variance-covariance-matrix). At any rate this is the code for the vif-function in package:rms:
vif <-
function (fit)
v <- vcov(fit, regcoef.only = TRUE)
nam <- dimnames(v)[[1]]
ns <- num.intercepts(fit)
if (ns > 0) {
v <- v[-(1:ns), -(1:ns), drop = FALSE]
nam <- nam[-(1:ns)]
d <- diag(v)^0.5
v <- diag(solve(v/(d %o% d)))
names(v) <- nam
The reason that categorical variables have a greater tendency to generate collinearity is that the three-way or four-way tabulations often form linear combinations that lead to complete collinearity. You example case is an extreme case of collinearity but you can also get collinearity with
1 1 0 0
1 0 1 0
1 0 0 1
Notice that this is collinear because A == B+C+D in all rows. None of pairwise correlations would be high, but the system together causes complete collinearity.
After putting your data into an R object and running lm() on it, it becomes apparent that there is another way to determine collinearity with R and that is because lm will drop factor variables from the results when they are "aliased", which is just another term for being completely collinear.
Here is an example for #Alex demonstrating highly collinear data and the output of vif in that situation. Generally you hope to see variance inflation factors below 10.
> set.seed(123)
> dat2 <- data.frame(res = rnorm(100), A=sample(1:4, 1000, repl=TRUE)
+ )
> dat2$B<-dat2$A
> head(dat2)
res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3 1.55870831 3 3
4 0.07050839 3 3
5 0.12928774 2 2
6 1.71506499 4 4
> dat2[1,2] <- 2
#change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
lm(formula = res ~ A + B, data = dat2)
Min 1Q Median 3Q Max
-2.41139 -0.58576 -0.02922 0.60271 2.10760
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10972 0.07053 1.556 0.120
A -0.66270 0.91060 -0.728 0.467
B 0.65520 0.90988 0.720 0.472
Residual standard error: 0.9093 on 997 degrees of freedom
Multiple R-squared: 0.0005982, Adjusted R-squared: -0.001407
F-statistic: 0.2984 on 2 and 997 DF, p-value: 0.7421
> vif ( mod )
1239.335 1239.335
If you make a fourth variable "C" that is independent of the first two perdictors (admittedly a bad name for a variable since C is also an R function), you get a more desirable result from vif:
dat2$C <- sample(1:4, 1000, repl=TRUE)
vif ( lm( res ~ A + C, dat2) )
1.003493 1.003493
Edit: I realized that I had not actually created R-representations of a "categorical variable" despite sampling from 1:4. The same sort of result occurs with factor versions of that "sample":
> dat2 <- data.frame(res = rnorm(100), A=factor( sample(1:4, 1000, repl=TRUE) ) )
> dat2$B<-dat2$A
> head(dat2)
res A B
1 -0.56047565 1 1
2 -0.23017749 4 4
3 1.55870831 3 3
4 0.07050839 3 3
5 0.12928774 2 2
6 1.71506499 4 4
> dat2[1,2] <- 2
> #change only one value to prevent the "anti-aliasing" routines in `lm` from kicking in
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
lm(formula = res ~ A + B, data = dat2)
Min 1Q Median 3Q Max
-2.43375 -0.59278 -0.04761 0.62591 2.12461
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11165 0.05766 1.936 0.0531 .
A2 -0.67213 0.91170 -0.737 0.4612
A3 0.01293 0.08146 0.159 0.8739
A4 -0.04624 0.08196 -0.564 0.5728
B2 0.62320 0.91165 0.684 0.4944
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9099 on 995 degrees of freedom
Multiple R-squared: 0.001426, Adjusted R-squared: -0.002588
F-statistic: 0.3553 on 4 and 995 DF, p-value: 0.8404
Notice that two of the factor levels are omitted from the calculation of coefficints. ... because they are completely collinear with the corresponding A levels. So if you want to see what vif returns for factor variables that are almost collinear, you need to change a few more values:
> dat2[1,2] <- 2
> dat2[2,2] <-2; dat2[3,2]<-2; dat2[4,2]<-4
> mod <- lm( res ~ A+B, dat2)
> summary(mod)
lm(formula = res ~ A + B, data = dat2)
Min 1Q Median 3Q Max
-2.42819 -0.59241 -0.04483 0.62482 2.12461
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11165 0.05768 1.936 0.0532 .
A2 -0.67213 0.91201 -0.737 0.4613
A3 -1.51763 1.17803 -1.288 0.1980
A4 -0.97195 1.17710 -0.826 0.4092
B2 0.62320 0.91196 0.683 0.4945
B3 1.52500 1.17520 1.298 0.1947
B4 0.92448 1.17520 0.787 0.4317
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9102 on 993 degrees of freedom
Multiple R-squared: 0.002753, Adjusted R-squared: -0.003272
F-statistic: 0.4569 on 6 and 993 DF, p-value: 0.8403
> library(rms)
> vif(mod)
A2 A3 A4 B2 B3 B4
192.6898 312.4128 308.5177 191.2080 312.5856 307.5242


Regression without intercept in R and Stata

Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
> summary(lm(y ~ x + d, data = df))
lm(formula = y ~ x + d, data = df)
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
lm(formula = y ~ x + d + 0, data = df)
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
y = model.response(mod_frame_def),
x = mod_matrix_def
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)

R equivalent of Stata * in regression

I am looking for an equivalent in R of Stata's * function that I can use when running regressions.
For example, if I have a dataframe like the following:
outcome var1 var2 var3 new
3 2 3 4 3
2 3 2 4 2
4 3 2 1 4
I would like to be able to select all variable names that begin with "var" without typing each one out separately in order to run the following regression more efficiently:
lm(outcome ~ var1 + var2 + var3 + new, data = df)
This question explains how I can select the necessary columns. How can I cleanly incorporate these into a regression?
One technique is to subset the data to the required columns, and then to use the . operator for the formula object to represent the independent variables in lm(). The . operator is interpreted as "all columns not otherwise in the formula".
data <-,nrow = 100)*100)
colnames(data) <- c("outcome", "x1","x2","x3","x4", "x5","x6", "x7", "var8", "var9")
# select outcome plus vars beginning with var
desiredCols <- grepl("var",colnames(data)) | grepl("outcome",colnames(data))
# use desiredCols to subset data frame argument in lm()
summary(lm(outcome ~ .,data = data[desiredCols]))
...and the output:
> summary(lm(outcome ~ .,data = data[desiredCols]))
lm(formula = outcome ~ ., data = data[desiredCols])
Min 1Q Median 3Q Max
-57.902 -25.359 2.296 26.213 52.871
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.712722 7.334937 8.005 2.62e-12 ***
var8 0.008617 0.101298 0.085 0.932
var9 -0.154073 0.103438 -1.490 0.140
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.86 on 97 degrees of freedom
Multiple R-squared: 0.02249, Adjusted R-squared: 0.002331
F-statistic: 1.116 on 2 and 97 DF, p-value: 0.3319

How is Pr(>|t|) in a linear regression in R calculated?

What formula is used to calculate the value of Pr(>|t|) that is output when linear regression is performed by R?
I understand that the value of Pr (> | t |) is a p-value, but I do not understand how the value is calculated.
For example, although the value of Pr (> | t |) of x1 is displayed as 0.021 in the output result below, I want to know how this value was calculated
x1 <- c(10,20,30,40,50,60,70,80,90,100)
x2 <- c(20,30,60,70,100,110,140,150,180,190)
y <- c(100,120,150,180,210,220,250,280,310,330)
summary(lm(y ~ x1+x2))
lm(formula = y ~ x1 + x2)
Min 1Q Median 3Q Max
-6 -2 0 2 6
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0000 3.4226 21.621 1.14e-07 ***
x1 1.8000 0.6071 2.965 0.021 *
x2 0.4000 0.3071 1.303 0.234
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.781 on 7 degrees of freedom
Multiple R-squared: 0.9971, Adjusted R-squared: 0.9963
F-statistic: 1209 on 2 and 7 DF, p-value: 1.291e-09
Basically, the values in the column t-value are obtained by dividing the coefficient estimate (which is in the Estimate column) by the standard error.
For example in your case in the second row we get that:
tval = 1.8000 / 0.6071 = 2.965
The column you are interested in is the p-value. It is the probability that the absolute value of t-distribution is greater than 2.965. Using the symmetry of the t-distribution this probability is:
2 * pt(abs(tval), rdf, lower.tail = FALSE)
Here rdf denotes the residual degrees of freedom, which in our case is equal to 7:
rdf = number of observations minus total number of coefficient = 10 - 3 = 7
And a simple check shows that this is indeed what R does:
2 * pt(2.965, 7, lower.tail = FALSE)
[1] 0.02095584

Fitting a linear regression model in R with confounding variables

I have a dataset called datamoth where survival is the response variable and treatment is a variable that can be considered both categorical and quantitative. The dataset looks like follows:
survival <- c(17,22,26,20,11,14,37,26,24,11,11,16,8,5,12,3,5,4,14,8,4,6,3,3,10,13,5,7,3,3)
treatment <- c(3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,9,9,12,12,12,12,12,12,21,21,21,21,21,21)
days <- c(3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,9,9,12,12,12,12,12,12,21,21,21,21,21,21)
datamoth <- data.frame(survival, treatment)
So, I can fit a linear regression model considering treatment as categorical, like this:
lmod<-lm(survival ~ factor(treatment), datamoth)
My question is how to fit a linear regression model with treatment as categorical variable but also considering treatment as a quantitative confounding variable.
I have figured out something like this:
model <- lm(survival ~ factor(treatment) + factor(treatment)*days, data = datamoth)
lm(formula = survival ~ factor(treatment) + factor(treatment) *
days, data = datamoth)
Min 1Q Median 3Q Max
-9.833 -3.333 -1.167 3.167 16.167
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.333 2.435 7.530 6.96e-08 ***
factor(treatment)6 2.500 3.443 0.726 0.47454
factor(treatment)9 -12.167 3.443 -3.534 0.00162 **
factor(treatment)12 -12.000 3.443 -3.485 0.00183 **
factor(treatment)21 -11.500 3.443 -3.340 0.00263 **
days NA NA NA NA
factor(treatment)6:days NA NA NA NA
factor(treatment)9:days NA NA NA NA
factor(treatment)12:days NA NA NA NA
factor(treatment)21:days NA NA NA NA
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.964 on 25 degrees of freedom
Multiple R-squared: 0.5869, Adjusted R-squared: 0.5208
F-statistic: 8.879 on 4 and 25 DF, p-value: 0.0001324
But obviously this code is not working because these two variables are collinear.
Does anyone to know how to fix it? Any help will be appreciated.

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
Let’s see how it looks like:
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
## The following objects are masked _by_ .GlobalEnv:
## A, B, Y
## Warning: non-integer #successes in a binomial glm!
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
# Note: robust standard errors
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
