R: predict.averaging is not taking an offset into account when plotting - r

I'm currently trying to use the predict.averaging function in MuMIn to create some graphs from some model averaging I've done on some GLMMs. I'm interested in whether the number of insects caught per daylight hour in some traps changes when the traps are left out for different lengths of time; I included offset(log(Daylight)) in my GLMMs to account for this. But when I use the predict function it doesn't take the offset into account and I get the same graph that I get if I hadn't included the offset in the first place. But I know the offset is having an effect due to the output from my model averaged GLMMs, and it's the kind of effect I would expect from observations of my data.
Does anyone know why this problem might be and how I might make predict.averaging take the offset into account? I've included the code that I'm using below:
# global model for total insect abundance
glmm11 <- glmmadmb(Total_polls ~ Max_temp+Wind+Precipitation+Veg_height+Season+Year+log(Mean.nectar+1)+I(log(Nectar+1)-log(Mean.nectar+1))+Pan_colour*Assoc_col+Treatment*Area*Depth+(1|Transect)+offset(log(Daylight)), data = ab, zeroInflation = FALSE, family = "nbinom")
# make predictions based on model averaging output (subset delta < 2)
preds<-predict(ave21, full = F, type = "response", backtransform = FALSE) # on the response scale
Where ave21 is a model averaging object generated using pdredge and model.avg that was constrained to have the offset in every model: model11 <- pdredge(glmm11, cluster = clust, fixed = ~offset(log(Daylight))+(1|Transect)). The object itself looks like this:
Call:
model.avg(object = get.models(object = model11, subset = delta <
2))
Component model call:
glmmadmb(formula = Total_polls ~ <3 unique rhs>, data = ab, family = nbinom,
zeroInflation = FALSE)
Component models:
df logLik AICc delta weight
1/2/3/4/5/6/7/8/9/10/11/12 20 -864.14 1769.22 0.00 0.47
1/2/3/4/5/6/7/8/9/10/11/12/13 23 -861.39 1770.03 0.81 0.31
1/3/4/5/6/7/8/9/10/11/12 19 -865.97 1770.79 1.57 0.21
Term codes:
Area Assoc_col
1 2
Depth I(log(Nectar + 1) - log(Mean.nectar + 1))
3 4
Max_temp Pan_colour
5 6
Season Treatment
7 8
Year log(Mean.nectar + 1)
9 10
offset(log(Daylight)) Area:Depth
11 12
Assoc_col:Pan_colour
13
Which I then used to get predictions:
pred_results<-cbind(glmm21$frame, preds) # append original dataframe to predictions
plot(pred_results$preds~pred_results$Treatment) # Treatment = trap duration (hours)
This code might go around the houses a little as I borrowed it off of a fellow PhD student. The graph I get when I plot my predictions looks like this:[Model predictions vs. Trap duration (hours)][1], which is very different from the view given by the summary results of my model averaging:
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -5.896725 0.948102 0.949386 6.211 < 2e-16 ***
Treatment24 -0.714283 0.130226 0.130403 5.478 < 2e-16 ***
Treatment48 -0.983881 0.122416 0.122582 8.026 < 2e-16 ***
Any help would be great, as I can't find any specific instances where this has been addressed on the site to date. Thank you in advance and please let me know if you need me to add anything to make this question better.
Tom
[1]:
https://i.stack.imgur.com/Pn4dK.jpg

Related

How to Interpret a Coefficient table for Multinom() Function in R

I have a dataset that has weather=0 if temp is <65 degrees Fahrenheit, weather = 1 if temp is =65 degrees Fahrenheit, and weather = 2 if temp is >68 degrees Fahrenheit. I need to estimate a probability that the temp is between 65 <= weather < 68 degrees Fahrenheit, given the days = 20. Here is the formula and output
multinom(formula = weather ~ days, data = USWeather13)
Which gives the coefficient table:
Coefficients:
(Intercept) days
1 5.142 -.252
2 25.120 .343
Std. Errors:
(Intercept) days
1 1.742 .007
2 1.819 .004
Does anyone know how I can interpret this or figure out this problem?
In your example, weather=0 is the reference level, and you have the coefficients as the log odds ratio of weather=1 or weather=2 for every unit of your predictor Days .
It's an example without the complete information, but reading your coefficients, it means for every unit increase in days, you reduce the log-odd probability of 1 vs 0 by -.252 and log-odd probability of 2 vs 0 by .343.
If you need to figure the respective probabilities at days=20, you do:
fit = multinom(formula = weather ~ days, data = USWeather13)
predict(fit,newdata=data.frame(days=20),type="prob")
Think this website might provide a good guide on how to interpret the coefficients.

Get marginal effect and predicted probability for glmer model in R

I'm trying to calculate both the predicted probability values and marginal effects values (with p-values) for a categorical variable over time in a logistic regression model in R. Basically, I want to know 1) the predicted probability of the response variable (an event occurring) in each year for sample sites in one of 2 categories and 2) the average marginal effect of a site being in 1 category vs. the other in each year. I can get predicted probability values using the ggeffects package and marginal effects values from the margins package, but I haven't figured out a way to get both sets of values from a single package.
So my questions are 1) is there a package/method to get both of these sets of values, and 2) if I get the predicted probability values from ggeffects and the marginal effects values from margins, are these values compatible? Or are there differences in the ways that the packages treat the models that mean I can't assume the marginal effects from one correspond to the predicted probabilities of the other? 3) In the margins package, how can I get the average marginal effect of the interaction of two factor variables over time? And 4) how can I get margins() to work with a large dataset?
Here is some sample data:
### Make dataset
df <- data.frame(year = rep(2001:2010, each = 100),
state = rep(c("montana", "idaho",
"colorado", "wyoming", "utah"),
times = 10, each = 20),
site_id = as.factor(rep(1:100, times = 10)),
cat_variable = as.factor(rep(0:1, times = 5, each = 10)),
ind_cont_variable = rnorm(100, mean = 20, sd = 5),
event_occurred = as.factor(sample(c(0, 1),
replace = TRUE,
size = 1000)))
### Add dummy columns for states
library(fastDummies)
df <- dummy_cols(df,
select_columns = "state",
remove_first_dummy = TRUE)
I'm interested in the effects of the state and the categorical variable on the probability that the event occurred, and in how the effect of the state and categorical variable changed over time. Here's the model:
library(lme4)
fit_state <- glmer(event_occurred ~ ind_cont_variable +
cat_variable*year*state +
(1|site_id),
data = df,
family = binomial(link = "logit"),
nAGQ = 0,
control = glmerControl(optimizer = "nloptwrap"))
I can use ggeffects to get the predicted probability values for each state and category combination over time:
library(ggeffects)
fit_pp_state <- data.frame(ggpredict(fit_state,
terms = c("year [all]",
"cat_variable",
"state")))
head(fit_pp_state)
### x = year, predicted = predicted probability, group = categorical variable level, facet = state
# x predicted std.error conf.low conf.high group facet
# 2001 0.2835665 0.3981910 0.1535170 0.4634655 0 colorado
# 2001 0.5911911 0.3762090 0.4089121 0.7514289 0 idaho
# 2001 0.5038673 0.3719418 0.3288209 0.6779708 0 montana
# 4 2001 0.7101610 0.3964843 0.5297327 0.8420101 0 utah
# 5 2001 0.5714579 0.3747205 0.3901606 0.7354088 0 wyoming
# 6 2001 0.6788503 0.3892568 0.4963910 0.8192719 1 colorado
This is really great for visualizing the changes in predicted probability over time in the 5 states. But I can't figure out how to go from these values to estimates of marginal effects using ggeffects. Using the margins package, I can get the marginal effect of the categorical variable over time, but I'm not sure how to interpret the outputs of the two different packages together or if that's even appropriate (my first two questions). In addition, I'm not sure how to get margins to give me the marginal effect of a sample site being in each combination of categorical variable level/state over time (bringing me to my third question):
library(margins)
fit_state_me <- summary(margins(fit_state,
at = list(year = 2001:2010),
variables = "cat_variable"))
head(fit_state_me)
# factor year AME SE z p lower
# cat_variable1 2001.0000 0.0224 0.0567 0.3953 0.6926 -0.0887
# cat_variable1 2002.0000 0.0146 0.0490 0.2978 0.7659 -0.0814
# cat_variable1 2003.0000 0.0062 0.0418 0.1478 0.8825 -0.0757
# cat_variable1 2004.0000 -0.0026 0.0359 -0.0737 0.9413 -0.0731
# cat_variable1 2005.0000 -0.0117 0.0325 -0.3604 0.7186 -0.0754
# cat_variable1 2006.0000 -0.0208 0.0325 -0.6400 0.5222 -0.0845
The actual dataset I'm using is fairly large (the csv of raw data is 1.51 GB and the regression model object is 1.29 GB when I save it as a .rds file). When I try to use margins() on my data, I get an error message:
Error: cannot allocate vector of size 369.5 Gb
Any advice for getting around this issue so that I can use this function on my data?
I'd be grateful for any tips-- packages I should check out, mistakes I'm making in my code or my conceptual understanding, etc. Thank you!

Get lm estimate for each categorical variable

So I am doing a multiple linear regression to see if fracture density and rock type effect retreat rates in rocks.
retreat <- lm(retreat_rate ~ fracture_dens + rock_unit, data = coast)
> summary(retreat)
I would like it to treat the 'rock_unit' as a category. I have two rock types in the vector. Here is my current result.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22631 0.53806 -0.421 0.676353
fracture_dens 0.11467 0.02704 4.241 0.000132 ***
rock_unitSC_mudstone 1.73490 0.36097 4.806 2.3e-05 ***
I would like there to be 'SC_mudstone' and 'Purisima' (the other rock type) instead of the 'rock_unitSC_mudstone' it is giving me now.
this is the typical outcome for linear models: the variable rock_unitSC_mudstoneis a dummy variable which is defined as:
rock_unitSC_mudstone = 1 if rock unit = SC_mudstone and 0 otherwise.
Adding a further variable rock_unitPurisima would cause the model matrix $X$ to not have full rank.
Anyway, you do not need the rock_unitPurisima variable. You can interpret the results as follows:
Average retreat rate for SC_mudstone = -0.22631 + 1.73490
Average retreat rate for Purisima = -0.22631
If you insist on a variable rock_unitPurisimayou can set the intercept to zero:
retreat2 <- lm(retreat_rate ~ 0 + fracture_dens + rock_unit, data = coast)
But as I said, an intercept and both dummy variables would simply contain too much information.
Hope that this was helpful.

R - 2x2 mixed ANOVA with repeated measures simple effect analysis

I would like to ask how to perform the simple main effect analysis in R correctly, in case of presence interaction effects between Group and Stage variables ?
One of my friends do same analysis in SPSS (using Bonferroni correction) and I try to reproduce his result in R.
I have data set of following structure:
ID Group Stage Y
1 I pre 0.123
1 I post 0.453
2 II pre 0.676
2 II post 0.867
3 I pre 0.324
3 I post 0.786
4 II pre 0.986
4 II post 0.112
... ... ... ...
This is 2x2 mixed ANOVA schema (1 between subject variable 'Group', 1 within subject variable 'Stage', which constitutes repated measure of y dependent variable).
I analysed it using ezANOVA function:
ezANOVA(data = dat, dv = y, wid = ID, between = Group, within = Stage, detailed = TRUE, type = "III")
I found a significant interaction Stage*Group. So I have determine simple effects using Bonferroni correction. I tried to do that with many methods. For example, if I want to find significant interactions in group I, between levels of Stage variable, I tried to use:
dataControl <- subset(dat, Group == "control" )
ezANOVA(data = dataControl, dv = y, wid = ID, within = Stage, detailed = TRUE, type = "III" ) // method 1
aov(data = dataControl, y ~ Stage + Error(ID/Stage)) // method 2
t.test(y ~ Stage, paired=TRUE) // method 3
But every method gave me different p-value result. None of these p-values matched those calculated with SPSS. Interesingly main effects p-values and other calculation gave the same result in SPSS and R. So I conclude that I am using wrong method in simple main effect analysis.
I would be very thankful I you could help me.
If you want R to give you the same numbers as SPSS, do this:
#pairwise comparisons
library(asbio)
bonf <- pairw.anova(data$dv, data$group, method="bonf") #also try "tukey" or "lsd"
print(bonf)
#plot(bonf) #can plot the CFs
This will give you t(s), mean differences, upper and lower bounds, HLSD Diff Lower Upper Decision Adj. p-value decision, and adjusted p-value.

Converting Repeated Measures mixed model formula from SAS to R

There are several questions and posts about mixed models for more complex experimental designs, so I thought this more simple model would help other beginners in this process as well as I.
So, my question is I would like to formulate a repeated measures ancova in R from sas proc mixed procedure:
proc mixed data=df1;
FitStatistics=akaike
class GROUP person day;
model Y = GROUP X1 / solution alpha=.1 cl;
repeated / type=cs subject=person group=GROUP;
lsmeans GROUP;
run;
Here is the SAS output using the data created in R (below):
. Effect panel Estimate Error DF t Value Pr > |t| Alpha Lower Upper
Intercept -9.8693 251.04 7 -0.04 0.9697 0.1 -485.49 465.75
panel 1 -247.17 112.86 7 -2.19 0.0647 0.1 -460.99 -33.3510
panel 2 0 . . . . . . .
X1 20.4125 10.0228 7 2.04 0.0811 0.1 1.4235 39.4016
Below is how I formulated the model in R using 'nlme' package, but am not getting similar coefficient estimates:
## create reproducible example fake panel data set:
set.seed(94); subject.id = abs(round(rnorm(10)*10000,0))
set.seed(99); sds = rnorm(10,15,5);means = 1:10*runif(10,7,13);trends = runif(10,0.5,2.5)
this = NULL; set.seed(98)
for(i in 1:10) { this = c(this,rnorm(6, mean = means[i], sd = sds[i])*trends[i]*1:6)}
set.seed(97)
that = sort(rep(rnorm(10,mean = 20, sd = 3),6))
df1 = data.frame(day = rep(1:6,10), GROUP = c(rep('TEST',30),rep('CONTROL',30)),
Y = this,
X1 = that,
person = sort(rep(subject.id,6)))
## use package nlme
require(nlme)
## run repeated measures mixed model using compound symmetry covariance structure:
summary(lme(Y ~ GROUP + X1, random = ~ +1 | person,
correlation=corCompSymm(form=~day|person), na.action = na.exclude,
data = df1,method='REML'))
Now, the output from R, which I now realize is similar to the output from lm():
Value Std.Error DF t-value p-value
(Intercept) -626.1622 527.9890 50 -1.1859379 0.2413
GROUPTEST -101.3647 156.2940 7 -0.6485518 0.5373
X1 47.0919 22.6698 7 2.0772934 0.0764
I believe I'm close as to the specification, but not sure what piece I'm missing to make the results match (within reason..). Any help would be appreciated!
UPDATE: Using the code in the answer below, the R output becomes:
> summary(model2)
Scroll to bottom for the parameter estimates -- look! identical to SAS.
Linear mixed-effects model fit by REML
Data: df1
AIC BIC logLik
776.942 793.2864 -380.471
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUPCONTROL GROUPTEST Residual
StdDev: 184.692 14.56864 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
TEST CONTROL
1.000000 3.068837
Fixed effects: Y ~ GROUP + X1
Value Std.Error DF t-value p-value
(Intercept) -9.8706 251.04678 50 -0.0393178 0.9688
GROUPTEST -247.1712 112.85945 7 -2.1900795 0.0647
X1 20.4126 10.02292 7 2.0365914 0.0811
Please try below:
model1 <- lme(
Y ~ GROUP + X1,
random = ~ GROUP | person,
correlation = corCompSymm(form = ~ day | person),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model1)
I think random = ~ groupvar | subjvar option with R lme provides similar result of repeated / subject = subjvar group = groupvar option with SAS/MIXED in this case.
Edit:
SAS/MIXED
R (a revised model2)
model2 <- lme(
Y ~ GROUP + X1,
random = list(person = pdDiag(form = ~ GROUP - 1)),
correlation = corCompSymm(form = ~ day | person),
weights = varIdent(form = ~ 1 | GROUP),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model2)
So, I think these covariance structures are very similar (σg1 = τg2 + σ1).
Edit 2:
Covariate estimates (SAS/MIXED):
Variance person GROUP TEST 8789.23
CS person GROUP TEST 125.79
Variance person GROUP CONTROL 82775
CS person GROUP CONTROL 33297
So
TEST group diagonal element
= 125.79 + 8789.23
= 8915.02
CONTROL group diagonal element
= 33297 + 82775
= 116072
where diagonal element = σk1 + σk2.
Covariate estimates (R lme):
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUP1TEST GROUP2CONTROL Residual
StdDev: 14.56864 184.692 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
1TEST 2CONTROL
1.000000 3.068837
So
TEST group diagonal element
= 14.56864^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + 93.28885^2
= 8913.432
CONTROL group diagonal element
= 184.692^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + (3.068837 * 93.28885)^2
= 116070.5
where diagonal element = τg2 + σ1 + σg2.
Oooh, this is going to be a tricky one, and if it's even possible using standard nlme functions, is going to take some serious study of Pinheiro/Bates.
Before you spend the time doing that though, you should make absolutely sure that this is exact model you need. Perhaps there's something else that might fit the story of your data better. Or maybe there's something R can do more easily that is just as good, but not quite the same.
First, here's my take on what you're doing in SAS with this line:
repeated / type=cs subject=person group=GROUP;
This type=cs subject=person is inducing correlation between all the measurements on the same person, and that correlation is the same for all pairs of days. The group=GROUP is allowing the correlation for each group to be different.
In contrast, here's my take on what your R code is doing:
random = ~ +1 | person,
correlation=corCompSymm(form=~day|person)
This code is actually adding almost the same effect in two different ways; the random line is adding a random effect for each person, and the correlation line is inducing correlation between all the measurements on the same person. However, these two things are almost identical; if the correlation is positive, you get the exact same result by including either of them. I'm not sure what happens when you include both, but I do know that only one is necessary. Regardless, this code has the same correlation for all individuals, it's not allowing each group to have their own correlation.
To let each group have their own correlation, I think you have to build a more complicated correlation structure up out of two different pieces; I've never done this but I'm pretty sure I remember Pinheiro/Bates doing it.
You might consider instead adding a random effect for person and then letting the variance be different for the different groups with weights=varIdent(form=~1|group) (from memory, check my syntax, please). This won't quite be the same but tells a similar story. The story in SAS is that the measurements on some individuals are more correlated than the measurements on other individuals. Thinking about what that means, the measurements for individuals with higher correlation will be closer together than the measurements for individuals with lower correlation. In contrast, the story in R is that the variability of measurements within individuals varies; thinking about that, measurements with higher variability with have lower correlation. So they do tell similar stories, but come at it from opposite sides.
It is even possible (but I would be surprised) that these two models end up being different parameterizations of the same thing. My intuition is that the overall measurement variability will be different in some way. But even if they aren't the same thing, it would be worth writing out the parameterizations just to be sure you understand them and to make sure that they are appropriately describing the story of your data.

Resources