How do you exclude interaction term in r lm? - r

I'm working with a model from the Prestige dataset in the car package in R.
library(car)
library(carData)
data = na.omit(Prestige)
prestige = data$prestige
income = data$income
education = data$education
type = data$type
I'm trying to fit the model lm(prestige ~ income + education + type + income:type + education:type). For class I'm starting with the full model and working down to a smaller model, just backward selection. One of the least useful covariates according to p-value is the education:typeprof. How do I just delete that covariate from the model without taking out all the education:type interactions? In general how do you exclude interactions with factors? I saw an answer with the update function specifying which interaction to exclude but it didn't work in my case. Maybe I implemented it incorrectly.
fit4 = lm(prestige ~ income + education + type + income:type + education:type)
newfit = update(fit4, . ~ . - education:typeprof)
Unfortunately this didn't work for me.

So there is a way to drop a single interaction term. Suppose you have the linear model
fullmodel = lm(y_sim ~ income + education + type + income:type + education:type - 1)
You can call model.matrix on fullmodel which will give you the X matrix for your linear model. From there you can specify which column you'd like to drop and refit your model.
X = model.matrix(fullmodel)
drop = which(colnames(X) == 'education:typeprof')
X1 = X[,-1]
newfit = lm(presitge ~ X1 - 1)

Related

Random effects specification in gamlss in R

I would like to use the gamlss package for fitting a model benefiting from more available distributions in that package. However, I am struggling to correctly specify my random effects or at least I think there is a mistake because if I compare the output of a lmer model with Gaussian distribution and the gamlss model with Gaussian distribution output differs. If comparing a lm model without the random effects and a gamlss model with Gaussian distribution and without random effects output is similar.
I unfortunately cannot share my data to reproduce it.
Here my code:
df <- subset.data.frame(GFW_food_agg, GFW_food_agg$fourC_area_perc < 200, select = c("ISO3", "Year", "Forest_loss_annual_perc_boxcox", "fourC_area_perc", "Pop_Dens_km2", "Pop_Growth_perc", "GDP_Capita_current_USD", "GDP_Capita_growth_perc",
"GDP_AgrForFis_percGDP", "Gini_2008_2018", "Arable_land_perc", "Forest_loss_annual_perc_previous_year", "Forest_extent_2000_perc"))
fourC <- lmer(Forest_loss_annual_perc_boxcox ~ fourC_area_perc + Pop_Dens_km2 + Pop_Growth_perc + GDP_Capita_current_USD +
GDP_Capita_growth_perc + GDP_AgrForFis_percGDP + Gini_2008_2018 + Arable_land_perc + Forest_extent_2000_perc + (1|ISO3) + (1|Year),
data = df)
summary(fourC)
resid_panel(fourC)
df <- subset.data.frame(GFW_food_agg, GFW_food_agg$fourC_area_perc < 200, select = c("ISO3", "Year", "Forest_loss_annual_perc_boxcox", "fourC_area_perc", "Pop_Dens_km2", "Pop_Growth_perc", "GDP_Capita_current_USD", "GDP_Capita_growth_perc",
"GDP_AgrForFis_percGDP", "Gini_2008_2018", "Arable_land_perc", "Forest_loss_annual_perc_previous_year", "Forest_extent_2000_perc"))
df <- na.omit(df)
df$ISO3 <- as.factor(df$ISO3)
df$Year <- as.factor(df$Year)
fourC <- gamlss(Forest_loss_annual_perc_boxcox ~ fourC_area_perc + Pop_Dens_km2 + Pop_Growth_perc + GDP_Capita_current_USD +
GDP_Capita_growth_perc + GDP_AgrForFis_percGDP + Gini_2008_2018 + Arable_land_perc + Forest_extent_2000_perc + random(ISO3) + random(Year),
data = df, family = NO, control = gamlss.control(n.cyc = 200))
summary(fourC)
plot(fourC)
How do the random effects need to be specified in gamlss to be similar to the random effects in lmer?
If I specify the random effects instead using
re(random = ~1|ISO3) + re(random = ~1|Year)
I get the following error:
Error in model.frame.default(formula = Forest_loss_annual_perc_boxcox ~ :
variable lengths differ (found for 're(random = ~1 | ISO3)')
I found the +re(random=~1|x) specification to work fairly well with my GAMLSS. Have you double check that the NA's are being removed from your dataset? Sometimes na.omit does not work properly.
Have a look at this thread that has the same error than yours, but in a GAM. You can try that code to remove your NA's
Error in model.frame.default: variable lengths differ

Problems with Fixed effects panel data

I am trying to run a regression with a panel data from the Michigan Consumers Survey. It is the first time I am using panel data on R so I am not very aware of the package "plm" that is needed. I am setting my panel data for fixed effects on individuals (CASEID) and time (YYYY):
Michigan_panel <- pdata.frame(Michigan_survey, index = c("CASEID", "YYYY"))
Then I am using the following regression:
mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
However R is showing me the following error:
> mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
Error in plm.fit(data, model, effect, random.method, random.models, random.dfcor, :
empty model
Does anyone know what I am doing wrong?
Could you give the link where is this specific survey? I found various dataset with this data name.
I suspect (only suspect), you data isn't panel data, please check the CASEID variable.
Changing the order between formula and data in plm won't be solve your problem.
.
I think the error come when you write the model. Your solution is this:
mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
In my view, you have to specify indexes in the formula, and follow the order of the plm package. I would like to write your formula as follows:
mod_1 <- plm(ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq,
data = Michigan_panel,
index= c("CASEID", "YYYY"),
model = "within")
1. Different Approach
From my knowledge we can also code this formula in a more elegant format.
library(plm)
Michigan_panel <- pdata.frame(Michigan_survey, index = c("CASEID", "YYYY"))
attach(Michigan_panel)
y <- cbind(ICS)
X <- cbind(ICE,PX1Q2,RATEX,ZLB,INCOME,AGE,EDUC,MARRY,SEX,AGE_sq)
model1 <- plm(y~X+factor(CASEID)+factor(YEAR), data=Michigan_panel, model="within")
summary(model1)
detach()
Adding factor(CASEID) and factor(YEAR) will add dummy variables in your model.

did modeling in R - right set up of data in staggered model

I appreciated any insights into staggered did (difference-in-differences) models.
I wanted to ask if I use the correct function to set-up the model for a did (data structure provided below):
did=time*treated
didreg = lm(y ~ time + treated + did + x + factor(year) + factor(firm), data = sample)
The data looks like:
I'm not familiar with difference-in-difference modelling, but from skimming the Wiki it seems that what you want is a simple interaction. To fit that, you don't even need to calculate a new variable (did), but you can specify it directly in the model. There's couple of ways to specify that with R formula syntax:
# Simple main effects models, no interactions
main_mod <- lm(y ~ time + treated + x + factor(year) + factor(firm), data = sample)
# Model with the interaction effect explicitly specified
did_mod1 <- lm(y ~ time + treated + time:treated + x + factor(year) + factor(firm), data = sample)
# Model with shortened syntax for specifying interactions
did_mod2 <- lm(y ~ time * treated + x + factor(year) + factor(firm), data = sample)
did_mod1 and did_mod2 are identical, did_mod2 is just a more compact way of writing the same model. The * indicates that you want both the main effects and the interactions of the variables to the left and the right. It's recommended to always fit main effects when you fit interactions, so the second way of writing the model saves time & space.

How to get p-values for random effects in glmer

I want to analyze when the claims of a protest are directed at the state, based on action and country level characteristics, using glmer. So, I would like to obtain p-values of both the fixed and random effects. My model looks like this:
targets <- glmer(state ~ ENV + HLH + HRI + LAB + SMO + Capital +
(1 + rile + parties + rep + rep2 + gdppc + election| Country),
data = df, family = binomial)
The output only gives me the Variance & Std.Dev. of the random effects, as well as the correlations among them, which makes sense for most multilevel analyses but not for my purposes. Is there any way I can get something like the estimates and the p-values for the random effects?
If this cannot be done with R, is there any other statistical software that would give such an output?
UPDATE: Following the suggestions here, I have moved this question to Cross Validated: https://stats.stackexchange.com/questions/381208/r-how-to-get-estimates-and-p-values-for-random-effects-in-glmer
library(lme4)
library(lattice)
xyplot(incidence/size ~ period|herd, cbpp, type=c('g','p','l'),
layout=c(3,5), index.cond = function(x,y)max(y))
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
summary(gm1)

How to fit frailty survival models in R

Because this is such a long question I've broken it down into 2 parts; the first being just the basic question and the second providing details of what I've attempted so far.
Question - Short
How do you fit an individual frailty survival model in R? In particular I am trying to re-create the coefficient estimates and SE's in the table below that were found from fitting the a semi-parametric frailty model to this dataset link. The model takes the form:
h_i(t) = z_i h_0(t) exp(\beta'X_i)
where z_i is the unknown frailty parameter per each patient, X_i is a vector of explanatory variables, \beta is the corresponding vector of coefficients and h_0(t) is the baseline hazard function using the explanatory variables disease, gender, bmi & age ( I have included code below to clean up the factor reference levels).
Question - Long
I am attempting to follow and re-create the Modelling Survival Data in Medical Research text book example for fitting frailty mdoels. In particular I am focusing on the semi parametric model for which the textbook provides parameter and variance estimates for the normal cox model, lognormal frailty and Gamma frailty which are shown in the above table
I am able to recreate the no frailty model estimates using
library(dplyr)
library(survival)
dat <- read.table(
"./Survival of patients registered for a lung transplant.dat",
header = T
) %>%
as_data_frame %>%
mutate( disease = factor(disease, levels = c(3,1,2,4))) %>%
mutate( gender = factor(gender, levels = c(2,1)))
mod_cox <- coxph( Surv(time, status) ~ age + gender + bmi + disease ,data = dat)
mod_cox
however I am really struggling to find a package that can reliably re-create the results of the second 2 columns. Searching online I found this table which attempts to summarise the available packages:
source
Below I have posted my current findings as well as the code I've used encase it helps someone identify if I have simply specified the functions incorrectly:
frailtyEM - Seems to work the best for gamma however doesn't offer log-normal models
frailtyEM::emfrail(
Surv(time, status) ~ age + gender + bmi + disease + cluster(patient),
data = dat ,
distribution = frailtyEM::emfrail_dist(dist = "gamma")
)
survival - Gives warnings on the gamma and from everything I've read it seems that its frailty functionality is classed as depreciated with the recommendation to use coxme instead.
coxph(
Surv(time, status) ~ age + gender + bmi + disease + frailty.gamma(patient),
data = dat
)
coxph(
Surv(time, status) ~ age + gender + bmi + disease + frailty.gaussian(patient),
data = dat
)
coxme - Seems to work but provides different estimates to those in the table and doesn't support gamma distribution
coxme::coxme(
Surv(time, status) ~ age + gender + bmi + disease + (1|patient),
data = dat
)
frailtySurv - I couldn't get to work properly and seemed to always fit the variance parameter with a flat value of 1 and provide coefficient estimates as if a no frailty model had been fitted. Additionally the documentation doesn't state what strings are support for the frailty argument so I couldn't work out how to get it to fit a log-normal
frailtySurv::fitfrail(
Surv(time, status) ~ age + gender + bmi + disease + cluster(patient),
dat = dat,
frailty = "gamma"
)
frailtyHL - Produce warning messages saying "did not converge" however it still produced coeficiant estimates however they were different to that of the text books
mod_n <- frailtyHL::frailtyHL(
Surv(time, status) ~ age + gender + bmi + disease + (1|patient),
data = dat,
RandDist = "Normal"
)
mod_g <- frailtyHL::frailtyHL(
Surv(time, status) ~ age + gender + bmi + disease + (1|patient),
data = dat,
RandDist = "Gamma"
)
frailtypack - I simply don't understand the implementation (or at least its very different from what is taught in the text book). The function requires the specification of knots and a smoother which seem to greatly impact the resulting estimates.
parfm - Only fits parametric models; having said that everytime I tried to use it to fit a weibull proportional hazards model it just errored.
phmm - Have not yet tried
I fully appreciate given the large number of packages that I've gotten through unsuccessfully that it is highly likely that the problem is myself not properly understanding the implementation and miss using the packages. Any help or examples on how to successfully re-create the above estimates though would be greatly appreciated.
Regarding
I am really struggling to find a package that can reliably re-create the results of the second 2 columns.
See the Survival Analysis CRAN task view under Random Effect Models or do a search on R Site Search on e.g., "survival frailty".

Resources