Mixed Interaction terms in linear model - r

I am testing a mixed model with 4 predictors : 2 categorical predictors (with 6 and 7 levels respectively) and 2 quantitative predictors.
I would like to know if I am allowed, while testing my model, to create interaction terms in which I mix categorical and quantitative predictors.
Suppose Y = f(a, b) is the model I want to test, a is the quantitative predictor and b is the categorical predictor.
Am I allowed to search for (example in R):
linfit <- lm(Y ~ a +b +a:b, data=mydata)
The interpretation of the results is similar of the one I have when mixing quantitative predictors?

First, the code you wrote is right, R will give you a result. And if the class of b is already been set up as factor, R will do the regression considering b as a categorical predictor.
Second, I assume you are asking about the statistical interpretation of the interaction term. The statistical meaning of the below three situations are not the same,
(1) a and b are quantitative predictors.
In the regression result from R, there will be four rows, a, b, ab, interception. The regression process takes ab as another quantitative variable and do linear regression.
y = β0 + β1⋅a + β2⋅b + β3⋅a*b
(2) a and b are categorical predictors.
Suppose a has 3 levels and b has 2. Draw out the the design matrix which is consisted with 1 or 0;
y = β0 + β1⋅a2 + β2⋅a3 + β3⋅b2 + β4⋅a2*b2 + β5⋅a3*b2
(3) a is categorical and b is quantitative predictor.
Suppose a has 3 levels.
y = β0 + β1⋅a2 + β2⋅a3 + β3⋅b + β4⋅a2*b + β5⋅a3*b
For more details of interaction term and design matrix, generalized linear model will talk more about it. Also, it's easy to try it out in R from the regression results.

Related

Correlation test for binary DV and categorical IV

I have two variables and I want to test the correlation between them. The dependent variable is binary (0/1) and the independent variable is categorical with 5 possible categories. My instinct was to do this using logistic regression, but I am wondering if there are more suitable alternatives given some of the challenges below.
Basically, I am having a little bit of trouble properly interpreting the logistic regression output in light of my specific goal. In R, the default parameters for estimating logistic regression dictate that it holds one of these categories constant (as the intercept) and reports the coefficients of the other categories relative to the intercept. That's not what I want; rather, I want to be able to report the effect of each category in the IV on the DV with all other categories held constant. I have tried suppressing the intercept, but have read elsewhere that this is generally not a good idea in logistic regression. So I am wondering if anybody can shed light on this strategy, or offer alternatives that will help me get to where I need to be. Thanks!
For testing correlations among categorical variables, apply chi-square and check its Pearson's residuals. You can then plot them using the corrplot package.
Explanation
I think you are misunderstanding how the intercept works with categorical variables, so it is important to remember that it is a linear equation (why this is important is detailed below). The intercept in this case is the reference level of your category. So if you have a predictor that has three categories (e.g. "Control Group", "Treatment 1", and "Treatment 2"), whichever is either the default or assigned first level will be used in the intercept (in this case, "Control Group" would be used because it is the first level.
Single Predictor Case
An example below from the hdp data I borrowed from here, which is supposed to be used for a logistic GLMM, but can still be used here for a simple demonstration of a regular logistic regression:
#### Load Data ####
hdp <- read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
hdp <- within(hdp, {
Married <- factor(Married, levels = 0:1, labels = c("no", "yes"))
DID <- factor(DID)
HID <- factor(HID)
CancerStage <- factor(CancerStage)
})
We will fit the data with remission as the binary outcome (whether or not cancer goes into remission is coded as 0/1) and sex as a categorical (with female as the reference group). We will also add a continuous variable of red blood cell count (RBC). Then we summarize the model:
#### Fit First Model ####
fit <- glm(remission
~ Sex
+ RBC,
family = binomial,
data = hdp)
summary(fit)
If you run the last code summary(fit), you will get a lot of info, so I have only included the coefficients below:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.26965 0.41124 -3.087 0.00202 **
Sexmale 0.05474 0.04835 1.132 0.25753
RBC 0.07602 0.08210 0.926 0.35447
Linear Equations and Predict Function
This linear equation is represented as:
remission = -1.26965 + (0.05474 * gender is male) + (0.07602 * RBC)
So if we have a female, the equation drops the middle term (because female is dummy coded as 0, so .05 * 0 = 0.) and simplifies to:
remission = -1.26965 + (0.07602 * RBC)
You can actually test this in R with the predict function. Here, I have created new data that has a new male and a new RBC count of 5.
new.data <- data.frame(Sex = "male",
RBC = 5)
Then obtain a prediction from your model with this data's linear equation:
predict(fit.add,
newdata = new.data)
The output is below:
1
-0.8347961
This is correct, as the linear equation when the gender is male (gender = 1) and the RBC is five would equal:
remission = -1.26965 + (0.05474 * 1) + (0.07602 * 5) = -0.8347961
If they are female, this equation would become:
remission = -1.26965 + (0.05474 * 0) + (0.07602 * 5) = -0.88955
Resource
By the way, a good book on learning logistic regression in R is Practical Guide to Logistic Regression by Joseph Hilbe, and a specific section detailing how to interpret categorical predictors can be found on Page 28.

Error when using a multilevel regression (lme4)

I want to use a multilevel regression to analyse the effect of some independent variables on a dependent variable and use varying intercept and slope.
My regression includes non-numeric independent variables which I want to use for the varying intercept and slope. The dependent variable is a numeric variable. When using this multilevel regression I get the following error:
Error: number of observations (=88594) <= number of random effects (=337477) for term (1 + x | z); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable
x and z are characters and are correlated with each other and with y. This is the regression I use:
multi_reg1 <- lmer(y~ 1 + x + (1 + x| z), REML = FALSE, data = data_frame1)
Is there a way to fix this problem or is it not possible and I have to use other regression methods?

understanding lmer random effects in R

What is the point of the "1 +" in (1 + X1|X2) structure of the random effect of an lmer function in lme4 package of R, and how does this differ from (1|X1) + (1|X2)?
As the comment suggests, looking at the GLMM FAQ might be useful.
(1+X1|X2) is identical to (X1|X2) (due to R's default of adding an intercept). This fits a model where all of the effects of X1 (i.e. all of the predictors that we would get from a linear model using y ~ X1) vary across the groups/levels defined by X2, and all of the correlations among these varying effects are estimated.
if X1 is numeric, this fits a random-slopes model that estimates the variation in the intercept across groups, the variation in the slope across groups, and their covariance (correlation).
if X1 is categorical (a factor), this estimates variation based on the contrasts used for X1. Suppose X1 has three levels {A, B, C} and the default treatment contrasts are being used. Then a 3x3 covariance matrix is estimated which includes
the variation in the intercept (== the expected value in level A) across groups
the variation in the difference between A and B across groups
the variation in the difference between A and C across groups
all three pairwise covariances/correlations (A vs A-B, A vs A-C, A-B vs A-C)
The formula (1|X1) + (1|X2) only makes sense if X1 is categorical (only categorical variables, or variables that can be treated as categorical, make sense as grouping variables). This estimates the variation in the intercept (baseline response) among levels of X2 and the variation in the intercept (baseline response) among levels of X1.
As a final note, it's hard for me to think of a case where the latter formula ((1|X1) + (1|X2)) would make sense as an alternative to (X1|X2) ...

How to deal with NA in the output of the linear regression in R, specifically NA for the three-way interaction terms?

I am running a linear regression with three-way interaction in R
lm(A~XYZ), A=numerical variable, whereas X, Y, Z all are categorical variables with factors.
X=5 factors, Y=2 factors, Z=4 factors. Every time I am running regression the three-way interaction for the last level is missing. For e.g. if I relevel the Z factors, the last is getting dropped in the three-way interaction.
Coefficients: (8 not defined because of singularities). (This is mentioned in the R output)
I have tried using zero intercepts but it did not make any difference
lm(A~0+XYZ) or lm(A~XYZ-1) and all other possible combinations.
lm (A ~ X * Y * Z, data = dat)
X = 5 factors, Y = 2 factors and Z = 4 factors [one of the factors from each variable is acting as base level]
If you want to try to use the three categorical variables you could use the mixed effect model package
model = lmer(A ~ noise + (1 | Y) + (1 | X) +(1 | Z), data = data)
This will give you coefficients for the factors by creating random intercepts but you might need some additional information for the fixed effects aka "noise".
Another approach could be creating dummy variables using the fastdummies package. This will create a binary variable out of each category that you have.

Matlab/R - linear regression with categorical & continuous predictors - why is the continuous predictor squared?

I'm doing a linear regression using categorical predictors and a 0 to 1 numerical outcome. On this page I saw it suggested to square a numerical predictor when it is alongside a nominal on (see third section on Linear Regression with Categorical Predictor). The example they give (for Matlab, but this generalizes to R as well) is the following formula where weight is continuous and year is nominal:
mdl = fitlm(tbl,'MPG ~ Year + Weight^2')
Is this a universal rule? When I do it, I do get much stronger coefficients but I want to make sure I'm not inflating them without warrant. Could someone explain the logic of using .^ for numericals alongside categoricals?
If you graph mpg vs. weight for each year separately and you see curvature then a polynomial in weight might help correct for the non-linearity.
library(lattice)
u <- "https://raw.githubusercontent.com/shifteight/R/master/ISLR/Auto.csv"
Cars <- read.csv(u)
o <- with(Cars, order(year, weight))
xyplot(mpg ~ weight | year, Cars[o, ], type = c("p", "smooth"))

Resources