fixed effects in R: plm vs lm + factor() - r

I'm trying to run a fixed effects regression model in R. I want to control for heterogeneity in variables C and D (neither are a time variable).
I tried the following two approaches:
1) Use the plm package: Gives me the following error message
formula = Y ~ A + B + C + D
reg = plm(formula, data= data, index=c('C','D'), method = 'within')
duplicate couples (time-id)Error in pdim.default(index[[1]], index[[2]]) :
I also tried creating first a panel using
data_p = pdata.frame(data,index=c('C','D'))
But I have repeated observations in both columns.
2) Use factor() and lm: works well
formula = Y ~ A + B + factor(C) + factor(D)
reg = lm(formula, data= data)
What is the difference between the two methods? Why is plm not working for me? is it because one of the indices should be time?

That error is saying you have repeated id-time pairs formed by variables C and D.
Let's say you have a third variable F which jointly with C keep individuals distinct from other one (or your first dimension, whatever it is). Then with dplyr you can create a unique indice, say id :
data.frame$id <- data.frame %>% group_indices(C, F)
The the index argument in plm becomes index = c(id, D).
The lm + factor() is a solution just in case you have distinct observations. If this is not the case, it will not properly weights the result within each id, that is, the fixed effect is not properly identified.

Related

Reporting interactions in a linear model in r

I am trying to report all the interactions in a linear model that reads:
mod1.lme <- lm(volume ~ Group * Treatment + Group + Treatment, data = df)
Group is a factor variable with 3 levels: A, B and C.
The result that I currently get is for (I made up the data):
These two estimates are in reference to Treatment:A, but I would like to see each effect independently. So the output that I would like to get is:
Treatment:A
Treatment:B
Treatment:C
If I eliminate the intercept adding -1 at the end I get:
What is the best way to code this?
Thanks
The reason you are seeing the output that you are, is that one of the factor levels of Treatment becomes a reference level. When interpreting the model the coefficients become "the difference in effect from the reference level". This is necessary as long as the model includes an intercept, so the only way to get the interpretation with all coefficients shown is to remove the intercept as shown below.
mod1.lme <- lm(volume ~ Group * Treatment - 1, data = df)
Edit:
To change the name of the interaction effect, one would have to edit the name manually
sum.lm <- summary(mod1.lme)
rownames(sum.lm$coef) <- c("groupA","groupB","groupC", "groupA:Treatment", "groupB:Treatment", "groupC:Treatment")
or alternatively use another package for summaries such as sjPlot
library(sjPlot)
tab_model(mod1.lme, pred.labels = c("groupA","groupB","groupC", "groupA:Treatment", "groupB:Treatment", "groupC:Treatment"))

Why is R treating this data.frame object as a list?

I am trying to run a least discriminant analysis (lda()) on a data.frame I created by dividing several variables by an additional scaling variable (not shown here) in R using the MASS package. Below is a sample dataset and a sample version of the code I am using that reproduces the error.
class Var1 Var2 Var3 Var4
2 0.732459522 0.973014649 0.612952968 0.127216654
3 0.76692254 0.990230286 0.629448709 0.104675506
2 0.847487002 1.021663778 0.649046794 0.187175043
3 0.823583181 1.050274223 0.673674589 0.170018282
1 0.796279894 1.058458813 0.583702391 0.222320638
2 0.925681255 1.009909166 0.636663914 0.205615194
2 0.627334465 1.074702886 0.59762309 0.23344652
3 0.980376124 1.011447261 0.646770237 0.232215863
3 0.79342723 1.048826291 0.750234742 0.248826291
1 0.960655738 1.042622951 0.6 0.262295082
2 0.963788301 1.005571031 0.590529248 0.233983287
1 1.013157895 1.049342105 0.657894737 0.223684211
2 1.211538462 1.060897436 0.733974359 0.288461538
3 1.25083612 1.023411371 0.759197324 0.311036789
3 0.959196485 1.009416196 0.635907094 0.12868801
1 0.823681936 1.005185825 0.590319793 0.219533276
2 0.777508091 0.998381877 0.624595469 0.165048544
3 0.749114103 0.985825656 0.585400425 0.133947555
1 0.816999133 1.036426713 0.604509974 0.197745013
data<-read.csv("data.csv",header=TRUE)
data_train<-na.omit(data)
scores_train<-data_train[-c(1)]
lda_train<-lda(data_train$class~scores_train,prior = c(1,1,1)/3,CV=TRUE)
scores_test<-data[-c(1)]
lda_test<-predict(lda_train,as.data.frame(scores_test),prior = c(1,1,1)/3)
lda_train<-lda(data_train$class~as.matrix(scores_train),prior = c(1,1,1)/3,CV=TRUE)
class(scores_train)
class(scores_test)
When I try to perform the lda using the dataset, I get the following error message.
Error in model.frame.default(formula = data_train$class ~ scores_train) :
invalid type (list) for variable 'scores_train'
I am able to coerce the data into working by coercing it into a matrix format using as.matrix. Notably, trying to do something similar using as.data.frame() and data.frame() does not work. However then when I try to apply the resulting discriminant function to the total dataset the I get the following message...
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
However, when I check the class of the objects of using class(), it says both objects are in a data.frame format. I checked the dataset to see if there were any incomplete rows or columns that could cause it to treat them as a series of lists instead of a single data.frame, but there are no missing values. Similarly, it does not appear to be due to the names of any variables.
I am not sure why R is treating the object as a list instead of a data.frame (and thereby causing the least discriminant analysis to fail), especially as it recognizes the objects are of the class data.frame.
for lda, you have to provide the formula, so the below works if you provide a dataframe:
lda_train<-lda(class ~ .,data=data_train,prior = c(1,1,1)/3,CV=TRUE)
else if you don't provide the formula, do:
lda(grouping=data_train$class,x=data_train[,-1],prior = c(1,1,1)/3, CV=TRUE)
When you use CV=TRUE, it uses leave-one-out cross validation to give you the posterior, but unfortunately it is not able to retain the model, and you can see it:
class(lda_train)
[1] "list"
To predict, you need to train with CV=FALSE. You provide a data.frame or matrix that has the same column has as that used for the training, and in your case it will be:
lda_train<-lda(class ~ .,data=data_train,prior = c(1,1,1)/3)
data_test=data.frame(Var1=rnorm(10),Var2=rnorm(10),
Var3=rnorm(10),Var4=rnorm(10))
predict(lda_train,data_test)
For lda from MASS, there is no hyper-parameter to be obtained from training, so maybe you want to elaborate on why you need the cross-validation?
In case you would want to explore it, here's how you can run cross-validation for lda (note using lda2):
data_train$class =factor(data$class)
lda_train = train(class ~ .,data=data_train,method="lda2",
trControl = trainControl(method = "cv"))
predict(lda_train,data_test)
The formula argument is looking for a structured formula declaring how the variables relate. Each variable named must be a vector. You can pass all the names in the same dataframe whilst declaring the data argument:
lda(class ~ Var1 + Var2 + Var3 + Var4,
data = data, prior = c(1,1,1)/3, CV=TRUE)
Or pass the columns separately:
lda(data$class ~ scores_train$Var1 +
scores_train$Var2 +
scores_train$Var3 +
scores_train$Var4,
prior = c(1,1,1)/3, CV=TRUE)
For the problem of predict not accepting it as an object, you need to change CV to FALSE, otherwise it only returns a list (not a lda object which predict needs):
model <- lda(data$class ~ scores_train$Var1 +
scores_train$Var2 +
scores_train$Var3 +
scores_train$Var4,
prior = c(1,1,1)/3, CV=FALSE)
predict(model)

Plot how the estimated survival depends upon the value of a covariate of interest. Problems with relevel

I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.

Discrepancy emmeans in R (using ezAnova) vs estimated marginal means in SPSS

So this is a bit of a hail mary, but I'm hoping someone here has encountered this before. I recently switched from SPSS to R, and I'm now trying to do a mixed-model ANOVA. Since I'm not confident in my R skills yet, I use the exact same dataset in SPSS to compare my results.
I have a dataset with
dv = RT
within = Session (2 levels), Cue (3 levels), Flanker (2 levels)
between = Group(3 levels).
no covariates.
unequal number of participants per group level (25,25,23)
In R I'm using the ezAnova package to do the mixed-model anova:
results <- ezANOVA(
data = ant_rt_correct
, wid = subject
, dv = rt
, between = group
, within = .(session, cue, flanker)
, detailed = T
, type = 3
, return_aov = T
)
In SPSS I use the following GLM:
GLM rt.1.center.congruent rt.1.center.incongruent rt.1.no.congruent rt.1.no.incongruent
rt.1.spatial.congruent rt.1.spatial.incongruent rt.2.center.congruent rt.2.center.incongruent
rt.2.no.congruent rt.2.no.incongruent rt.2.spatial.congruent rt.2.spatial.incongruent BY group
/WSFACTOR=session 2 Polynomial cue 3 Polynomial flanker 2 Polynomial
/METHOD=SSTYPE(3)
/EMMEANS=TABLES(group*session*cue*flanker)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/WSDESIGN=session cue flanker session*cue session*flanker cue*flanker session*cue*flanker
/DESIGN=group.
The results of which line up great, ie:
R: Session F(1,70) = 46.123 p = .000
SPSS: Session F(1,70) = 46.123 p = .000
I also ask for the means per cell using:
descMeans <- ezStats(
data = ant_rt_correct
, wid = subject
, dv = rt
, between = group
, within = .(session, cue, flanker) #,cue,flanker)
, within_full = .(location,direction)
, type = 3
)
Which again line up perfectly with the descriptives from SPSS, e.g. for the cell:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 484.22
SPSS: M = 484.22
However, when I try to get to the estimated marginal means, using the emmeans package:
eMeans <- emmeans(results$aov, ~ group | session | cue | flanker)
I run into descrepancies as compared to the Estimated Marginal Means table from the SPSS GLM output (for the same interactions), eg:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 522.5643
SPSS: M = 484.22
It's been my understanding that the estimated marginal means should be the same as the descriptive means in this case, as I have not included any covariates. Am I mistaken in this? And if so, how come the two give different results?
Since the group sizes are unbalanced, I also redid the analyses above after making the groups of equal size. In that case the emmeans became:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M =521.2954
SPSS: M = 482.426
So even with equal group sizes in both conditions, I end up with quite different means. Keep in mind that the rest of the statistics and the descriptive means áre equal between SPSS and R. What am I missing... ?
Thanks!
EDIT:
The plot thickens.. If I perform the ANOVA using the AFEX package:
results <- aov_ez(
"subject"
,"rt"
,ant_rt_correct
,between=c("group")
,within=c("session", "cue", "flanker")
)
)
and then take the emmeans again:
eMeans <- emmeans(results, ~ group | session | cue | flanker)
I suddenly get values much closer to that of SPSS (and the descriptive means)
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 484.08
SPSS: M = 484.22
So perhaps ezANOVA is doing something fishy somewhere?
I suggest you try this:
library(lme4) ### I'm guessing you need to install this package first
mod <- lmer(rt ~ session + cue + flanker + (1|group),
data = ant_rt_correct)
library(emmeans)
emm <- emmeans(mod, ~ session * cue * flanker)
pairs(emm, by = c("cue", "flanker") # simple comparisons for session
pairs(emm, by = c("session", "flanker") # simple comparisons for cue
pairs(emm, by = c("session", "cue") # simple comparisons for flanker
This fits a mixed model with random intercepts for each group. It uses REML estimation, which is likely to be what SPSS uses.
In contrast, ezANOVA fits a fixed-effects model (no within factor at all), and aov_ez uses the aov function which produces an analysis that ignores the inter-block effects. Those make a difference especially with unbalanced data.
An alternative is to use afex::mixed, which in fact uses lme4::lmer to fit the model.

R: How to make column of predictions for logistic regression model?

So I have a data set called x. The contents are simple enough to just write out so I'll just outline it here:
the dependent variable, Report, in the first column is binary yes/no (0 = no, 1 = yes)
the subsequent 3 columns are all categorical variables (race.f, sex.f, gender.f) that have all been converted to factors, and they're designated by numbers (e.g. 1= white, 2 = black, etc.)
I have run a logistic regression on x as follows:
glm <- glm(Report ~ race.f + sex.f + gender.f, data=x,
family = binomial(link="logit"))
And I can check the fitted probabilities by looking at summary(glm$fitted).
My question: How do I create a fifth column on the right side of this data set x that will include the predictions (i.e. fitted probabilities) for Report? Of course, I could just insert the glm$fitted as a column, but I'd like to try to write a code that predicts it based on whatever is in the race, sex, gender columns for a more generalized use.
Right now I the follow code which I will hope create a predicted column as well as lower and upper bounds for the confidence interval.
xnew <- cbind(xnew, predict(glm5, newdata = xnew, type = "link", se = TRUE))
xnew <- within(xnew, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
Unfortunately I get the error:
Error in eval(expr, envir, enclos) : object 'race.f' not found
after the cbind code.
Anyone have any idea?
There appears to be a few typo in your codes; First Xnew calls on glm5 but your model as far as I can see is glm (by the way using glm as name of your output is probably not a good idea). Secondly make sure the variable race.f is actually in the dataset you wish to do the prediction from. My guess is R can't find that variable hence the error.

Resources