LMM on Chickweight data - r

I would like to write a model with random intercepts and random slopes with respect to time. I am not sure if my code is correct.
model4<-lmer(weight~Time + Diet + Time*Diet + (1+Time|Chick), data = Data, REML = TRUE)
summary(model4)

Yes, that is the correct specification for those random effects. You can check this out, by applying a similar model, but temporarily removing the fixed effect on diet and the interaction between time and diet
model4<-lmer(weight~Time + (1+Time|Chick), data = ChickWeight, REML = TRUE)
Column bind the original data, plus predictions from this simple model above, and select five random Chicks to plot
weight_hat = predict(model4)
cw = cbind(ChickWeight,weight_hat)
random_chicks = sample(unique(cw$Chick),5)
ggplot(cw[cw$Chick %in% random_chicks,], aes(Time, color=Chick)) +
geom_point(aes(y=weight), size=2) +
geom_line(aes(y=weight_hat), size=1.5) +
theme(legend.position="bottom")+
guides(color=guide_legend(nrow=1))
You can see that the intercept and slope for each Chick differs.

Related

ROC for Logistic regression in R

I would like to ask for help with my project. My goal is to get ROC curve from existing logistic regression.
First of all, here is what I'm analyzing.
glm.fit <- glm(Severity_Binary ~ Side + State + Timezone + Temperature.F. + Wind_Chill.F. + Humidity... + Pressure.in. + Visibility.mi. + Wind_Direction + Wind_Speed.mph. + Precipitation.in. + Amenity + Bump + Crossing + Give_Way + Junction + No_Exit + Railway + Station + Stop + Traffic_Calming + Traffic_Signal + Sunrise_Sunset , data = train_data, family = binomial)
glm.probs <- predict(glm.fit,type = "response")
glm.probs = predict(glm.fit, newdata = test_data, type = "response")
glm.pred = ifelse(glm.probs > 0.5, "1", "0")
This part works fine, I am able to show a table of prediction and mean result. But here comes the problem for me, I'm using pROC library, but I am open to use anything else which you can help me with. I'm using test_data with approximately 975 rows, but variable proc has only 3 sensitivities/specificities values.
library(pROC)
proc <- roc(test_data$Severity_Binary,glm.probs)
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
ggplot(test_data, aes(x=spec, y=sens)) + geom_line()
HereĀ“s what I have as a result:
With Warning message:
Removed 972 row(s) containing missing values (geom_path).
As I found out, proc has only 3 values as I said.
You can't (and shouldn't) assign the sensitivity and specificity to the data. They are summary data and exist in a different dimension than your data.
Specifically, these two lines are wrong and make no sense at all:
test_data$sens <- proc$sensitivities[1:975]
test_data$spec <- proc$specificities[1:975]
Instead you must either save them to a new data.frame, or use some of the existing functions like ggroc:
ggroc(proc)
If you consider what the ROC curve does, there is no reason to expect it to have the same dimensions as your dataframe. It provides summary statistics of your model performance (sensitivity, specificity) evaluated on your dataset for different thresholds in your prediction.
Usually you would expect some more nuance on the curve (more than the 3 datapoints at thresholds -Inf, 0.5, Inf). You can look at the distribution of your glm.probs - this ROC curve indicates that all predictions are either 0 or 1, with very little inbetween (hence only one threshold at 0.5 on your curve). [This could also mean that you unintentially used your binary glm.pred for calculating the ROC curve, and not glm.probs as shown in the question (?)]
This seems to be more an issue with your model than with your code - here an example from a random different dataset, using the same steps you took (glm(..., family = binomial, predict(, type = "response"). This produces a ROC curve with 333 steps for ~1300 datapoints.
PS: (Ingore the fact that this is evaluated on training data, the point is the code looks alright up to the point of generating the ROC curve)
m1 <- glm(survived ~ passengerClass + sex + age, data = dftitanic, family = binomial)
myroc <- roc(dftitanic$survived,predict(m1, dftitanic, type = "response"))
plot(myroc)

Accelerated Failure Time modelling: plotting survival probabilities with CI-s (example provided)

As proportional hazards assumption is violated with my real data, I am using an AFT model, trying to calculate adjusted survival probabilities for study groups in interest. The example below is on kidney data and I tried to follow ciTools vignette.
library(tidyverse)
library(ciTools)
library(here)
library(survival)
library(survminer)
#data
kidney
Model
fit1 = survreg(Surv(time, censored) ~ sex + age + disease, data = kidney)
Call:
survreg(formula = Surv(time, censored) ~ sex + age + disease,
data = kidney)
Coefficients:
(Intercept) sexfemale age diseaseGN diseaseAN diseasePKD
8.41830937 -0.93959839 -0.01578812 -0.25274448 -0.38306425 -0.32830433
Scale= 1.642239
Loglik(model)= -122.1 Loglik(intercept only)= -122.7
Chisq= 1.33 on 5 degrees of freedom, p= 0.931
n= 76
Adding survival probabilities for both sexes for surviving at least 365 days
probs = ciTools::add_probs(kidney, fit1, q = 365,
name = c("prob", "lcb", "ucb"),
comparison = ">")
probs
Trying to plot one-year survival probabilities for both sexes, but there are multiple point estimates for geom_point?
It seems for me that these point estimates are given for each age value. Can I edit the prediction so that it is made for mean or median age?
probs %>% ggplot(aes(x = prob, y = sex)) +
ggtitle("1-year survival probability") +
xlim(c(0,1)) +
theme_bw() +
geom_point(aes(x = prob), alpha = 0.5, colour = "red")+
geom_linerange(aes(xmin = lcb, xmax = ucb), alpha = 0.5)
However, this approach seems to work with a simple model
fit2 = survreg(Surv(time, censored) ~ sex, data = kidney)
probs2 = ciTools::add_probs(kidney, fit2, q = 365,
name = c("prob", "lcb", "ucb"),
comparison = ">")
probs2 %>% ggplot(aes(x = prob, y = sex)) +
ggtitle("1-year survival probability") +
xlim(c(0,1)) +
theme_bw() +
geom_point(aes(x = prob), alpha = 0.5, colour = "red")+
geom_linerange(aes(xmin = lcb, xmax = ucb), alpha = 0.5)
Questions:
How can I get adjusted survival probabilities for both sexes? Or if this is impossible, what would be possible alternatives? Code would help with alternatives.
If I would like to get adjusted survival probabilities for both sexes and for different time points, should I edit "q" value in ciTools::add_probs() function? For example: q = 30 for one month; q = 90 for three months etc. Or I should run a separate model for each time period?
The way you have set up these models, the predictions are being returned for each individual in the kidney data set, based on the covariate values included in the model.
In your first model, you have included sex + age + disease so that you get a prediction for each combination of those 3 covariate values in your data set.
In the second model, you have only included sex as a predictor, so you only get predictions based on sex.
Survival model prediction functions allow you to specify a set of covariate values from which to predict in a new data frame. According to the manual for add_probs.survreg, you do so by specifying a new data frame with specified covariate values in the df argument to the function. You used the kidney data frame there, so you got predictions for all those cases.
I'm not familiar with ciTools::add_probs specifically, but such software typically will (without warning) accept your values for the covariates you specify and then use some type of "average" value for the covariates that you don't specify. As "averages" don't have much meaning for categorical covariates like disease, it's usually better to specify a complete useful set of values for all covariates yourself.
The functions in the rms package in R often do a better job at providing useful predictions, as they choose typical rather than "average" values for unspecified covariates, based on an initial evaluation of the data set by a datadist() function whose output you then specify as a system option. The learning curve for this package is a bit steep but well worth it if you will be doing a lot of survival or other regression modeling.

Not getting a smooth curve using ggplot2

I am trying to fitting a mixed effects models using lme4 package. Unfortunately I cannot share the data that i am working with. Also i couldn't find a toy data set is relevant to my problem . So here i have showed the steps that i followed so far :
First i plotted the overall trend of the data as follows :
p21 <- ggplot(data = sub_data, aes(x = age_cent, y = y))
p21+ geom_point() + geom_smooth()
Based on this , there seems to be a some nonlinear trend in the data. Hence I tried to fit the quadratic model as follows :
sub_data$age_cent=sub_data$age-mean((sub_data)$age)
sub_data$age_centsqr=(sub_data$age-mean((sub_data)$age))^2
m1= lmer(y ~ 1 + age_cent + age_centsqr +(1 | id) , sub_data, REML = TRUE)
In the above model i only included a random intercept because i don't have enough data to include both random slope and intercept.Then i extracted the predictions of these model at population level as follows :
pred1=predict(m1,re.form=NA)
Next I plotted these predictions along with a smooth quadratic function like this
p21+ geom_point() + geom_smooth(method = "lm", formula = y ~ I(x) + I(x^2)
,col="red")+geom_line(aes(y=pred1,group = id) ,col="blue", lwd = 0.5)
In the above plot , the curve corresponds to predictions are not smooth. Can any one helps me to figure out the reason for that ?
I am doing anything wrong here ?
Update :
As eipi10 pointed out , this may due to fitting different curves for different people.
But when i tried the same thing using a toy data set which is in the lme4 package , i got the same curve for each person as follows :
m1 <- lmer(Reaction ~ 1+I(Days) + (1+ Days| Subject) , data = sleepstudy)
pred1new1=predict(m1,re.form=NA)
p21 <- ggplot(data = sleepstudy, aes(x = Days, y = Reaction))
p21+ geom_point() + geom_smooth()
p21+ geom_point() + geom_smooth()+ geom_line(aes(y=pred1new1,group = Subject) ,col="red", lwd = 0.5)
What may be the reason the for different results ? Is this due to unbalance of the data ?
The data i used collected in 3 time steps and some people didn't have it for all 3 time steps. But the toy data set is a balanced data set.
Thank you
tl;dr use expand.grid() or something like it to generate a balanced/evenly spaced sample for every group (if you have a strongly nonlinear curve you may want to generate a larger/more finely spaced set of x values than in the original data)
You could also take a look at the sjPlot package, which does a lot of this stuff automatically ...
You need both an unbalanced data set and a non-linear (e.g. polynomial) model for the fixed effects to see this effect.
if the model is linear, then you don't notice missing values because the linear interpolation done by geom_line() works perfectly
if the data are balanced then there are no gaps to get weirdly filled by linear interpolation
Generate an example with quadratic effects and an unbalanced data set; fit the model
library(lme4)
set.seed(101)
dd <- expand.grid(id=factor(1:10),x=1:10)
dd$y <- simulate(~poly(x,2)+(poly(x,2)|id),
newdata=dd,
family=gaussian,
newparams=list(beta=c(0,0,0.1),
theta=rep(0.1,6),
sigma=1))[[1]]
## subsample randomly (missing values)
dd <- dd[sort(sample(nrow(dd),size=round(0.7*nrow(dd)))),]
m1 <- lmer(y ~ poly(x,2) + (poly(x,2)|id) , data = dd)
Naive prediction and plot:
dd$pred1 <- predict(m1,re.form=NA)
library(ggplot2)
p11 <- (ggplot(data = dd, aes(x = x, y = y))
+ geom_point() + geom_smooth(method="lm",formula=y~poly(x,2))
)
p11 + geom_line(aes(y=pred1,group = id) ,col="red", lwd = 0.5)
Now generate a balanced data set. This version generates 51 evenly spaced points between the min and max - this will be useful if the original data are unevenly spaced. If you have NA values in your x variable, don't forget na.rm=TRUE ...
pframe <- with(dd,expand.grid(id=levels(id),x=seq(min(x),max(x),length.out=51)
Make predictions, and overlay them on the original plot:
pframe$pred1 <- predict(m1,newdata=pframe,re.form=NA)
p11 + geom_line(data=pframe,aes(y=pred1,group = id) ,col="red", lwd = 0.5)

How to add weight to variable for GAM model?

I am running a gam model based on a large dataset with many variables. My response variable is the level of "recruitment" by a herd every fall/autumn. This is calculated by the fawn:female ratio every fall/autumn over a 60 year period.
My problem is that there are many years and study sites where only between 1 - 10 females are recorded. This means that the robustness of the ratio is not trustworthy. For example if one female and one fawn is seen, it has a recruitment of 100%, but if they see one more female, that drops by 50%!
I need to tell the model that years/study sites with smaller sample sizes should be weighted less than those with larger sample sizes as these smaller sample sizes are no doubt affecting the results.
Above is a table of the females observed every year and a histogram of the same.
My model is as follows:
gamFIN <- gam(Fw.FratioFall
~ s(year)
+ s(percentage_woody_coverage)
+ s(kmRoads.km2)
+ s(WELLS_ACTIVEinsideD)
+ s(d3)
+ s(WT_DEER_springsurveys)
+ s(BadlandsCoyote.1000_mi)
+ s(Average_mintemp_winter, BadlandsCoyote.1000_mi)
+ s(BadlandsCoyote.1000_mi, WELLS_ACTIVEinsideD)
+ s(BadlandsCoyote.1000_mi, d3)
+ s(YEAR, bs = "re") + s(StudyArea, bs = "re"), method = "REML", select = T, data = mydata)
How might I tell the model to weight my response variable by the sample sizes they are based on.
Do not model this as a ratio for your outcome. Instead model the fawn counts as your outcome and model the female counts via an offset() term using logged values on the RHS of the formula. You should be offsetting with the log of the fawn count. So the formula would look like this:
Fawns
~ s(year)
+ all_those_smooth_terms
+ offset( lnFemale_counts)
The gam models have an implicit log link which is the reason for the logging of the Female counts.
Edit (Gavin's correct. The default for gam is not a linear link):
gamFIN <- gam(FawnFall ~ s(year) + s(percentage_woody_coverage) + s(kmRoads.km2) +
s(WELLS_ACTIVEinsideD) + s(d3) + s(WT_DEER_springsurveys) +
s(BadlandsCoyote.1000_mi) + s(Average_mintemp_winter, BadlandsCoyote.1000_mi) +
s(BadlandsCoyote.1000_mi, WELLS_ACTIVEinsideD) + s(BadlandsCoyote.1000_mi, d3) +
s(YEAR, bs = "re") + s(StudyArea, bs = "re") + offset(FemaleFall),
family="poisson", method = "REML", select = T, data = mydata)

Interpreting interactions in a regression model

A simple question I hope.
I have an experimental design where I measure some response (let's say blood pressure) from two groups: a control group and an affected group, where both are given three treatments: t1, t2, t3. The data are not paired in any sense.
Here is an example data:
set.seed(1)
df <- data.frame(response = c(rnorm(5,10,1),rnorm(5,10,1),rnorm(5,10,1),
rnorm(5,7,1),rnorm(5,5,1),rnorm(5,10,1)),
group = as.factor(c(rep("control",15),rep("affected",15))),
treatment = as.factor(rep(c(rep("t1",5),rep("t2",5),rep("t3",5)),2)))
What I am interested in is quantifying the effect that each treatment has on the affected group relative to the control group. How would I model this, say using an linear model (for example lm in R)?
Am I wrong thinking that:
lm(response ~ 0 + treatment * group, data = df)
which is equivalent to:
lm(response ~ 0 + treatment + group + treatment:group, data = df)
is not what I need? I think that in this model the treatment:group interaction terms are relative to the mean over all baseline group and baseline treatment measurements.
I therefore thought that this model:
lm(response ~ 0 + treatment:group, data = df)
is what I need but it's quantifying each combination of treatment and group interaction terms: treatmentt1:groupcontrol treatmentt1:groupaffected treatmentt2:groupcontrol treatmentt2:groupaffected treatmentt3:groupcontrol treatmentt3:groupaffected
So perhaps this model:
lm(response ~ 0 + treatment + treatment:group, data = df)
is the correct one?
Although in addition to quantifying each combination of treatment and groupaffected interaction term it's also quantifying the effect of each treatment. I'm not sure what is the baseline each of the treatment and groupaffected interaction terms are compared to in this model.
Help would be appreciated.
Also, let's say I ran a fourth treatment which is actually the combination of two treatments, say t1+t3, where I don't know what the expectation of their combined effect is: additive/subtractive or synergistic. Is there any way this can be combined?
The interaction term tells you that the difference between groups is dependent on treatment, that is, that the difference between affected and control is not the same for t1, t2 and t3.
I would model the intercept though.
lm(response ~ group + treatment + group:treatment, data=df)
After getting a significant interaction term I would use t.tests to further investigate and to help with interpretation.
As can be seen the interaction is driven by the larger effect of t2 relative to the others.
library(data.table)
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(response = c(rnorm(5,10,1),rnorm(5,10,1),rnorm(5,10,1),rnorm(5,7,1),rnorm(5,5,1),rnorm(5,10,1)),
group = as.factor(c(rep("control",15),rep("affected",15))),
treatment = as.factor(rep(c(rep("t1",5),rep("t2",5),rep("t3",5)),2)))
# t tests of the desired comparisons to see if there is a difference and get 95% confidence intervals
t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])
# plot 95% C.I.
ci_plot <- matrix(nrow=3, ncol=3)
ci_plot <- as.data.frame(ci_plot)
colnames(ci_plot) <- c("treatment", "lci", "uci")
ci_plot[,1] <- c("t1", "t2", "t3")
ci_plot[,3] <- c(t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])$conf.int[1],
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])$conf.int[1],
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])$conf.int[1])
ci_plot[,4] <- c(t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])$conf.int[2],
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])$conf.int[2],
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])$conf.int[2])
ggplot(ci_plot, aes(x=treatment, y=uci)) +
geom_errorbar(aes(ymin=uci, ymax=lci), width=0.5, position=position_dodge(0.9), weight=0.5) +
xlab("Treatment") +
ylab("Change in mean relative to control (95% C.I.)") +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle = 90, hjust = 1))
Your first specification is fine.
lm(response ~ 0 + treatment * group, data = df)
Call:
lm(formula = response ~ 0 + treatment * group, data = df)
Coefficients:
treatmentt1 treatmentt2 treatmentt3
7.460 5.081 9.651
groupcontrol treatmentt2:groupcontrol treatmentt3:groupcontrol
2.670 2.384 -2.283
The first coefficient, 7.460, represents the effect that occurs when a participant is both treated with t1 and affected. Going from left to right, the second coefficient, 5.081, represents when a participant is both treated with t2 and affected, etc...
So for example, when a participant is treated with t2 and in the control the effect is 5.081 + 2.384.
If I were doing this analysis, I would keep the intercept.
Call:
lm(formula = response ~ treatment * group, data = df)
Coefficients:
(Intercept) treatmentt2 treatmentt3
7.460 -2.378 2.192
groupcontrol treatmentt2:groupcontrol treatmentt3:groupcontrol
2.670 2.384 -2.283
Now the second coefficient, going from left to right, represents the effect of participants treated with t2 and affected relative to participants treated with t1 and affected. To see this notice that 7.460 - 2.378 = 5.081 (the second coefficient in the first specification). I like this approach because it makes it easier to interpret the relative effects.
That all being said #MrFlick is right. This is a question for Cross Validation.

Resources