Related
How to run Latent Class Growth Modelling (LCGM) with a multinomial response variable in R (using the flexmix package)?
And how to stratify each class by a binary/categorical dependent variable?
The idea is to let gender shape the growth curve by cluster (cf. Mikolai and Lyons-Amos (2017, p. 194/3) where the stratification is done by education. They used Mplus)
I think I might have come close with the following syntax:
lcgm_formula <- as.formula(rel_stat~age + I(age^2) + gender + gender:age)
lcgm <- flexmix::stepFlexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
nrep=1, # would be 50 in real analysis to avoid local maxima
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula,varFix=T,fixed = ~0))
,which is close to what Wardenaar (2020,p. 10) suggests in his methodological paper for a continuous outcome:
stepFlexmix(.~ .|ID, k = 1:4,nrep = 50, model = FLXMRglmfix(y~ time, varFix=TRUE), data = mydata, control = list(iter.max = 500, minprior = 0))
The only difference is that the FLXMRmultinom probably does not support varFix and fixed parameters, altough adding them do produce different results. The binomial equivalent for FLXMRmultinom in flexmix might be FLXMRglm (with family="binomial") as opposed FLXMRglmfix so I suspect that the restrictions of the LCGM (eg. fixed slope & intercept per class) are not specified they way it should.
The results are otherwise sensible, but model fails to put men and women with similar trajectories in the same classes (below are the fitted probabilities for each relationship status in each class by gender):
We should have the following matches by cluster and gender...
1<->1
2<->2
3<->3
...but instead we have
1<->3
2<->1
3<->2
That is, if for example men in class one and women in class three would be forced in the same group, the created group would be more similar than the current first row of the plot grid.
Here is the full MVE to reproduce the code.
Got similar results with another dataset with diffent number of classes and up to 50 iterations/class. Have tried two alternative ways to predict the probabilities, with identical results. I conclude that the problem is most likely in the model specification (stepflexmix(...,model=FLXMRmultinom(...) or this is some sort of label switch issue.
If the model would be specified correctly and the issue is that similar trajectories for men/women end up in different classes, is there a way to fix that? By for example restricting the parameters?
Any assistance will be highly appreciated.
This seems to be a an identifiability issue apparently common in mixture modelling. In other words the labels are switched so that while there might not be a problem with the modelling as such, men and women end up in different groups and that will have to be dealt with one way or another
In the the new linked code, I have swapped the order manually and calculated the predictions with by hand.
Will be happy to hear, should someone has an alternative approach to deal with the label swithcing issue (like restricting parameters or switching labels algorithmically). Also curious if the model could/should be specified in some other way.
A few remarks:
I believe that this is indeed performing a LCGM as we do not specify random effects for the slopes or intercepts. Therefore I assume that intercepts and slopes are fixed within classes for both sexes. That would mean that the model performs LCGM as intended. By the same token, it seems that running GMM with random intercept, slope or both is not possible.
Since we are calculating the predictions by hand, we need to be able to separate parameters between the sexes. Therefore I also added an interaction term gender x age^2. The calculations seems to slow down somewhat, but the estimates are similar to the original. It also makes conceptually sense to include the interaction for age^2 if we have it for age already.
varFix=T,fixed = ~0 seem to be reduntant: specifying them do not change anything. The subsampling procedure (of my real data) was unaffected by the set.seed() command for some reason.
The new model specification becomes:
lcgm_formula <- as.formula(rel_stat~ age + I(age^2) +gender + age:gender + I(age^2):gender)
lcgm <- flexmix::flexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
#nrep=1, # would be 50 in real analysis to avoid local maxima (and we would use the stepFlexmix function instead)
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula))
And the plots:
Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.
I'm attempting to run some statistical analyses on a field trial that was constructed over 2 sites over the same growing season.
At both sites (Site, levels: HF|NW) the experimental design was a RCBD with 4 (n=4) blocks (Block, levels: 1|2|3|4 within each Site).
There were 4 treatments - 3 different forms of nitrogen fertiliser and a control (no nitrogen fertiliser) (Treatment, levels: AN, U, IU, C).
During the field trial there were 3 distinct periods that commenced with fertiliser addition and ended with harvesting of the grass. These periods have been given the levels 1|2|3 under the factor N_app.
There are a range of measurements that I would like to test the following null hypothesis H0 on:
Treatment (H0) had no effect on measurement
Two of the measurements I am particularly interested in are: grass yield and ammonia emissions.
Starting with grass yield (Dry_tonnes_ha) as
shown here, a nice balanced data set
The data can be downloaded in R using the following code:
library(tidyverse)
download.file('https://www.dropbox.com/s/w5ramntwdgpn0e3/HF_NW_grass_yield_data.csv?raw=1', destfile = "HF_NW_grass_yield_data.csv", method = "auto")
raw_data <- read.csv("HF_NW_grass_yield_data.csv", stringsAsFactors = FALSE)
HF_NW_grass <- raw_data %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>%
mutate(Date = as.Date(Date, format = "%d/%m/%Y"),
Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))
I have had a go at running an ANOVA on this using the following approach:
model_1 <- aov(formula = Dry_tonnes_ha ~ Treatment * N_app + Site/Block, data = HF_NW_grass, projections = TRUE)
I have a few concerns with this.
Firstly, what is the best way to test assumptions? For a simple one-way ANOVA I would use shapiro.test() and bartlett.test() on the dependent variable (Dry_tonnes_ha) to assess normality and heterogeneity of variance. Can I use the same approach here?
Secondly, I am concerned that N_app is a repeated measure as the same measurement is taken from the same plot over 3 different periods - what is the best way to build this repeated measures into the model?
Thirdly, I'm not sure of the best way to nest Block within Site. At both sites the levels of Block are 1:4. Do I need to have unique Block levels for each site?
I have another data set for NH3 emissions here. R code to download:
download.file('https://www.dropbox.com/s/0ax16x95m2z3fb5/HF_NW_NH3_emissions.csv?raw=1', destfile = "HF_NW_NH3_emissions.csv", method = "auto")
raw_data_1 <- read.csv("HF_NW_NH3_emissions.csv", stringsAsFactors = FALSE)
HF_NW_NH3 <- raw_data_1 %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>%
mutate(Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))
For this I have all the concerns above with the addition that the data set is unbalanced.
At HF for N_app 1 n=3, but for N_app 2 & 3 n=4
At NW n=4 for all N_app levels.
At NF measurements were only made on the Treatment levels U and IU
At NW measuremnts were made on Treatment levels AN, U and IU
I'm not sure how to deal with this added level of complexity. I am tempted to just analyse as 2 separate site (the fact that the N_app periods are not the same at each site may encourage this approach).
Can I use a type iii sum of squares ANOVA here?
It has been suggested to me that a linear mixed modelling approach may be the way forward but I'm not familiar with using these.
I would welcome your thoughts on any of the above. Thanks for your time.
Rory
To answer your first question on the best way of testing assumptions. While your attempt of using another statistical test, implemented in R, is reasonable, I would actually just visualize the distribution and see if the data meet ANOVA assumptions. This approach may seem somewhat subjective, but it does work in most cases.
independently, identically distributed (i.i.d) data: this is a question that you may already have an answer based on how much you know about your data. It's possible to use a chi-square test to determine independence (or not).
normally distributed data: use a histogram / QQ plot to check. Based on the distribution, I think it is reasonable to use aov despite the slightly bimodal distribution.
(It appears that log-transformation help further meet normality assumption. This is something you may consider, especially for downstream analyses.)
par(mfrow=c(2,2))
plot(density(HF_NW_grass$Dry_tonnes_ha), col="red", main="Density")
qqnorm(HF_NW_grass$Dry_tonnes_ha, col="red", main="qqplot")
qqline(HF_NW_grass$Dry_tonnes_ha)
DTH_trans <- log10(HF_NW_grass$Dry_tonnes_ha)
plot(density(DTH_trans), col="blue", main="transformed density")
qqnorm(DTH_trans, col="blue", main="transformed density")
qqline(DTH_trans)
Regarding your second question on what the best way to build repeated measures into the model is: Unfortunately, it is difficult to pinpoint such a "best" model, but based on my knowledge (mostly through genomics big data), you may want to use a linear mixed effect model. This can be implemented through the lme4 R package, for example. Since it appears you already know how to construct a linear model in R, you should have no problem with applying lme4 functions.
Your third question regarding whether to nest two variables is tricky. If I were you, I would start with Site and Block as if they were independent factors. However, if you know they are not independent, you should probably nest them.
I think your questions and concerns are quite open-ended. My recommendation is that as long as you have a plausible justification, go ahead and proceed.
I agree with #David C on the use of visual diagnostics. Simple QQ plots should work
# dependent variable.
par(mfrow=c(1,2))
qqnorm(dt[,dry_tonnes_ha]); qqline(dt[,dry_tonnes_ha], probs= c(0.15, 0.85))
qqnorm(log(dt[,dry_tonnes_ha])); qqline(log(dt[,dry_tonnes_ha]), probs= c(0.15, 0.85))
The log transformation looks reasonable to me. You can also see this from the density plot, which is long tailed and somewhat bi-modal
par(mfrow=c(1,1))
plot(density(dt[,dry_tonnes_ha]))
You could alternatively use lineup plots (Buja et al, 2009) if you wish. I'm not sure they're needed in this case. Vignette provided
library(nullabor)
# this may not be the best X variable. I'm not familiar with your data
dt_l <- lineup(null_permute("dry_tonnes_ha"), dt)
qplot(dry_tonnes_ha, treatment, data = dt_l) + facet_wrap(~ .sample)
For the other assumptions, you can just use the standard diagnostic plots from the lm
lm2 <- lm(log(dry_tonnes_ha) ~ treatment * n_app + site/block, data = dt)
plot(lm2)
I don't see anything too troublesome in these plots.
I have an issue with Random Forest with the Importance / varImPlot function, I hope someone could help me with?
I tried to code versions but I am confused about the (different) results:
1.)
rffit = randomForest(price~.,data=train,mtry=x,ntree=500)
rfvalpred = predict(rffit,newdata=test)
varImpPlot(rffit)
importance(rffit)
Shows the plot and the data of “importance”, however only “IncNodePurity”. And the data is different the plot and the data, I tried with "Scale" but did not work.
2.)
rf.analyzed_data = randomForest(price~.,data=train,mtry=x,ntree=500,importance=TRUE)
yhat.rf = predict(rf.analyzed_data,newdata=test)
varImpPlot(rf.analyzed_data)
importance(rf.analyzed_data)
In that case it does not produce any plot anymore and the importance data is showing “%IncMSE” and “IncNodePurity” data but the “IncNodePurity” data is different to first code?
Questions:
1.) Any idea why data is different for “IncNodePurity”?
2.) Any idea why no “%IncMSE” is shown in the first version?
3.) Why no plot is shown in the second version?
Many thanks!!
Ed
1) IncNodePurity is derived from the loss function, and you get that measure for free just by training the model. On the downside it is a more unstable estimate as results may vary from each model run. It is also more biased as it favors variables with many levels. I guess your found the differences are due to randomness.
2) VI, %IncMSE takes a little extra time to compute and is therefore optional. Roughly all values in data set needs to be shuffled and every OOB sample needs to be predicted once for every tree times for every variable. As the package randomForest is designed, you have to compute VI during training. importance must be set to TRUE. varImpPlot cannot plot it as it has not been computed.
3) Not sure. In this code example I see both plots at least.
library(randomForest)
#data
X = data.frame(replicate(6,rnorm(1000)))
y = with(X, X1^2 + sin(X2*pi) + X3*X4)
train = data.frame(y=y,X=X)
#training
rf1=randomForest(y~.,data=train,importance=F)
rf2=randomForest(y~.,data=train, importance=T)
#plotting importnace
varImpPlot(rf1) #plot only with IncNodePurity
varImpPlot(rf2) #bi-plot also with %IncMSE
I am trying to create a survival plot in R for deaths from exposure to a fungal disease over a number of weeks. I have the death week(continuous), whether they were alive (TRUE/FALSE), as well as categorical variables for diet (high/low) and sex(male/female). I have run a coxph model:
surv1 <- coxph(Surv(week_died,alive) ~ exposed + diet + sex,
data=surv)
I would like to plot a survival curve, with individual lines for exposed males on high and low diets, and the same for females on high and low diets (resulting in 4 individual survival curves on the same plot). if I use this then I only get a single curve.
plot(survfit(surv1), ylim=c(), xlab="Weeks")
I have also tried to use the ggsurv function created by Edwin Thoen (http://www.r-statistics.com/2013/07/creating-good-looking-survival-curves-the-ggsurv-function/) but keep getting an error for "invalid line type". I have tried to work out what could be causing this and think it would be this last ifelse statement - but I am not sure.
pl <- if(strata == 1) {ggsurv.s(s, CI , plot.cens, surv.col ,
cens.col, lty.est, lty.ci,
cens.shape, back.white, xlab,
ylab, main)
} else {ggsurv.m(s, CI, plot.cens, surv.col ,
cens.col, lty.est, lty.ci,
cens.shape, back.white, xlab,
ylab, main)}
Does anyone have any idea on what is causing this error/how to fix it or if I completely trying to do the wrong thing to plot these curves.
Many thanks!
survfit on a coxph model without any other specifications gives the survival curve for a case whose covariate predictors are the average of the population that the model was created with. From the help for survfit.coxph
Serious thought has been given to removing the default value for newdata, which is to use a single "psuedo" subject with covariate values equal to the means of the data set, since the resulting curve(s) almost never make sense. ... Two particularly egregious examples are factor variables and interactions. Suppose one were studying interspecies transmission of a virus, and the data set has a factor variable with levels ("pig", "chicken") and about equal numbers of observations for each. The “mean” covariate level will be 1/2 – is this a flying pig? ... Users are strongly advised to use the newdata argument.
So after you have computed surv1,
sf <- survfit(surv1,
newdata = expand.grid(diet = unique(surv$diet),
sex = unique(surv$sex)))
plot(sf)
sf should also work as an argument to ggsurv, though I've not tested it.