Dummy variables in R

Dummy variables in R - r

I'm constructing a linear model to evaluate the effect of distances from a habitat boundary on the richness of an order of insects. There are some differences in equipment used so I am including equipment as a categorical variable to ensure that it hasn't had a significant affect on richness.
The categorical factor is 3 leveled so I asked r to produced dummy variables in the lm by using the code:
lm(Richness ~ Distances + factor(Equipment), data = Data)
When I ask for the summary of the model I can see two of the levels with their coefficients. I am assuming that this means r is using one of the levels as the "standard" to compare the coefficients of the other levels to.
How can I find the coefficient for the third level in order to see what effect it has on the model?
Thank you

You can do lm(y~x-1) to remove the intercept, which in your case is the reference level of one of the factors. That being said, there are statistical reasons for using one of the levels as a reference.

To determine how to extract your coefficient, here is a simple example:
# load data
data(mtcars)
head(mtcars)
# what are the means of wt given the factor carb?
(means <- with(mtcars, tapply(wt, factor(carb), mean)))
# run the lm
mod <- with(mtcars, lm(wt~factor(carb)))
# extract the coefficients
coef(mod)
# the intercept is the reference level (i.e., carb 1)
coef(mod)[1]
coef(mod)[2:6]
coef(mod)[1] + coef(mod)[2:6]
means
So you can see that the coefficients are simply added to the reference level (i.e., intercept) in this simple case. However, if you have a covariate, it gets more complicated
mod2 <- lm(wt ~ factor(carb) + disp, data=mtcars)
summary(mod2)
The intercept is now the carb 1 when disp = 0.

Related

Is there a way to derive GVIF in Jamovi?

Apparently the car package vif function and performance package check_collinearity function both calculate generalized variance inflation factor (GVIF) automatically if a categorical variable is entered into a regression. As an example, this can be done with the iris dataset with the following code:
#### Categorical Fit ####
fit.cat <- lm(
Petal.Length ~ Species + Petal.Width + Sepal.Width,
iris
)
check_collinearity(fit.cat)
Which gives an an expected value of 26.10 that I have already hand calculated. However, Jamovi doesn't allow one to automatically add factors to a regression, so I dummy coded the same regression factor and entered the regression like so:
You can see in the arrow that the value doesn't match that obtained from the R function. I also double checked in R to see if it is just calculating VIF instead:
1/(1-summary(lm(as.numeric(Species) ~ 1 + Petal.Width + Sepal.Width,
iris))$r.squared)
But the values don't match, as this gives me a VIF of 12.78. Why is it doing this and is there a solution in Jamovi for hacking this?

3-way interaction in plot_model

I have fit a mixed-effects model and included a 3-way interaction between my fixed effects which are:
two categorical variables: A1(level1, level2), A2 (level1, level2)
continuous: B
model <- lmer( dependent variable~ A1*A2 * B + random factors, data)
To visualise the interaction, I am using plot_model from the "sjPlot" package:
plot_model(model, type="int", terms=c("A1", "A2", "B"))
The output seems to have broken down my continuous variable (B) into two separate categories (high B, low B) and then plot the interaction for each of the categories in two separate windows.
My question is:
What criterion does the "sjPlot" package use to categorise my continuous variable? What determines "high B" and what determines "low B"? And I wonder if there is any other way to visualise a three-way interaction which is more informative.
Thank you!

welcome to SO. I use sjPlot quite a bit and took this opportunity to consolidate some of my knowledge about it.
There is also the fact that calling plot_model(..., type = "int) will use the order of the variables in the regression formula to decide how to plot that interaction. Here I've drawn up an example using the 'mtcars' data-set included in R. I transform two of the binary variables into factors first so it matches your example.
library(sjPlot)
mtcars$vs <- as.factor(as.character(mtcars$vs))
mtcars$am <- as.factor(as.character(mtcars$am))
m1 <- lm(mpg ~ vs * am * hp,
data = mtcars)
plot_model(m1,
type = "int")
This code produces the following plot, where two values of the continuous variable hp have been selected for plotting in two separate windows:
Ben is correct, and these values are pulled from the mdrt.values argument, which defaults to "minmax", meaning these values should be the highest and lowest in that column, which we can verify:
range(mtcars$hp)
[1] 52 335
There are other options for this argument, some of which will be better or worse depending on your case. As you only have one continuous variable, we might want to show the whole predictor along the x-axis. We can do this in a few ways, one of which is changing the model formula so that hp is first.
m2 <- lm(mpg ~ hp * vs * am,
data = mtcars)
plot_model(m2,
type = "int")
However, this is probably not the best way to do this, as sometimes models take a while to fit and re-fitting the model for plotting is a waste of time/electricity. It's useful to know here that the type = "int" call is just for convenience, and that we can also plot interactions by setting type="pred" and then passing the effects we want to be plotted using the terms argument. Thefollowing code will make a very similar plot to the second one, but using the first model that we fit. Changing the order of the terms included within c() will change the plot.
plot_model(m1,
type = "pred",
terms = c("hp", "vs", "am"))

How to find overall significance for main effects in a dummy interaction using anova()

I run a Cox Regression with two categorical variables (x1 and x2) and their interaction. I need to know the significance of the overall effect of x1, x2 and of the interaction.
The overall effect of the interaction:
I know how do find out the overall effect of the interaction using anova():
library(survival)
fit_x1_x2 <- coxph(Surv(time, death) ~ x1 + x2 , data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x1_x2, fit_full)
But how are we supposed to use anova() to find out the overall effect of x1 or x2? What I tried is this:
The overall effect of x1
fit_x2_ia <- coxph(Surv(time, death) ~ x2 + x1:x2, data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x2_ia, fit_full)
The overall effect of x2
fit_x1_ia <- coxph(Surv(time, death) ~ x1 + x1:x2, data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x1_ia, fit_full)
I am not sure whether this is how we are supposed to use anova(). The fact that the output shows degree of freedom is zero makes me sceptical. I am even more puzzled that both times, for the overall effect of x1 and x2, the test is significant, although the log likelihood values of the models are the same and the Chi value is zero.
Here is the data I used
set.seed(1) # make it reproducible
df <- data.frame(x1= rnorm(1000), x2= rnorm(1000)) # generate data
df$death <- rbinom(1000,1, 1/(1+exp(-(1 + 2 * df$x1 + 3 * df$x2 + df$x1 * df$x2)))) # dead or not
library(tidyverse) # for cut_number() function
df$x1 <- cut_number(df$x1, 4); df$x2 <- cut_number(df$x2, 4) # make predictors to groups
df$time <- rnorm(1000); df$time[df$time<0] <- -df$time[df$time<0] # add survival times

The two models you have constructed for "overall effect" do really not appear to satisfy the statistical property of being hierarchical, i.e properly nested. Specifically, if you look at the actual models that get constructed with that code you should see that they are actually the same model with different labels for the two-way crossed effects. In both cases you have 15 estimated coefficients (hence zero degrees of freedom difference) and you will not that the x1 parameter in the full model has the same coefficient as the x2[-3.2532,-0.6843):x1[-0.6973,-0.0347) parameter in the "reduced" model looking for an x1-effect, namely 0.19729. The crossing operator is basically filling in all the missing cells for the main effects with interaction results.
There really is little value in looking at interaction models without all of the main effects if you want to stay within the bounds of generally accepted statistical practice.
If you type:
fit_full
... you should get a summary of the model that has p-values for x1 levels, x2 levels,and the interaction levels. Because you chose to categorize these by four arbitrary cutpoints each you will end up with a total of 15 parameter estimates. If instead you made no cuts and modeled the linear effects and the linear-by-linear interaction, you could get three p-values directly. I'm guessing there was suspicion that the effects were not linear and if so I thought a cubic spline model might be more parsimonious and distort the biological reality less than discretization into 4 disjoint levels. If you thought the effects might be non-linear but ordinal, there is an ordinal version of factor classed variables, but the results are generally confusion to the uninitiated.

The answer from 42- is informative but after reading it I still did not know how to determine the three p values or if this is possible at all. Thus I talked to the professor of biostatistic of my university. His answer was quite simple and I share it in case others have similar questions.
In this design it is not possible to determine the three p values for overall effect of x1, x2 and their interaction. If we want to know the p values of the three overall effects, we need to keep the continuous variables as they are. But breaking up the variables into groups answers a different question, hence we can not test the hypothesis of the overall effects no matter which statisstical model we use.

How can I apply extreme bounds analysis to a dataset of over 100 variables with the ExtremeBounds package in R?

I have a dataset consisting of 107 variables with 1794 observations. I want to implement Extreme Bounds Analysis in order to determine which of the 106 variables are robustly correlated with the dependent variable throughout a wide range of regressions, each one with a different model specification. I intend to select the most robust variables for my definitive model.
I'm using Marek Hlavac's ExtremeBounds package. I'm trying to run the following line of code:
free=eba(formula=flg_activacion_0_12~., data=Data1, k=0:106, reg.fun=glm, family=binomial(link='logit'), draws=100)
The dependent variable
flg_activacion_0_12
is a dummy, that's why I choose the binomial link in the family argument.
The reg.fun argument is for R not to run OLS regressions but generalized linear models such as logit.
I set the k argument as 0:106. That means I want to determine if the variables are robust among models that include up to 106 variables. However, the total amount of models to estimate would be inmense. There are 106 possible models that include only one explanatory variable. The are 106!/[2!(104!)] possible models that include two explanatory variables. The argument draws=100 limits the amount of models to just 100. It runs only 100 models chosen randomly from the inmense pool of models that can be written as combinations of the 106 variables.
I believe the argument draws should make this task possible for my computer, but I get the following error messages:
All variables in argument 'focus' must be in the data frame.
Argument 'k' is too high for the given number of doubtful variables.
I have already checked the documentation, and since I haven't specified which variables are free, which ones are focus and which ones are doubtful then all 106 variables should be considered focus. I don't understand why it suggests that some focus variables are not in my dataframe. Please tell me what am I doing wrong and how could I do what I'm intending to do.

I think the problem here is with the formula argument. You will end up with the same error also with this code:
library(ExtremeBounds)
naive.eba <- eba(formula = mpg ~. , data = mtcars, k = 0:9)
The model works well if you use (as in the ExtremeBounds vignette) the following command, which spells the dependent variables in the formula:
naive.eba <- eba(formula = mpg ~ cyl + carb + disp + hp + vs + drat + wt + qsec + gear + am, data = mtcars, k = 0:9)

Stata's xtlogit (fe, re) equivalent in R?

Stata allows for fixed effects and random effects specification of the logistic regression through the xtlogit fe and xtlogit re commands accordingly. I was wondering what are the equivalent commands for these specifications in R.
The only similar specification I am aware of is the mixed effects logistic regression
mymixedlogit <- glmer(y ~ x1 + x2 + x3 + (1 | x4), data = d, family = binomial)
but I am not sure whether this maps to any of the aforementioned commands.

The glmer command is used to quickly fit logistic regression models with varying intercepts and varying slopes (or, equivalently, a mixed model with fixed and random effects).
To fit a varying intercept multilevel logistic regression model in R (that is, a random effects logistic regression model), you can run the following using the in-built "mtcars" data set:
data(mtcars)
head(mtcars)
m <- glmer(mtcars$am ~ 1 + mtcars$wt + (1|mtcars$gear), family="binomial")
summary(m)
# and you can examine the fixed and random effects
fixef(m); ranef(m)
To fit a varying-intercept slope model in Stata, you of course use the xtlogit command (using the similar but not identical in-built "auto" data set in Stata):
sysuse auto
xtset gear_ratio
xtlogit foreign weight, re
I'll add that I find the entire reference to "fixed" versus "random" effects ambiguous, and I prefer to refer to the structure of the model itself (e.g., are the intercepts varying? which slopes are varying, if any? is the model nested in 2 levels or more? are the levels cross-classified or not?). For a similar view, see Andrew Gelman's thoughts on "fixed" versus "random" effects.
Update: Ben Bolker's excellent comment below points out that in R it's more informative when using predict commands to use the data=mtcars option instead of, say, the dollar notation:
data(mtcars)
m1 <- glmer(mtcars$am ~ 1 + mtcars$wt + (1|mtcars$gear), family="binomial")
m2 <- glmer(am ~ 1 + wt + (1|gear), family="binomial", data=mtcars)
p1 <- predict(m1); p2 <- predict(m2)
names(p1) # not that informative...
names(p2) # very informative!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex