I am using glmmTMB to analyze a negative binomial generalized linear mixed model (GLMM) where the dependent variable is count data (CT), which is over-dispersed.
There are 115 samples (rows) in the relevant data frame. There are two fixed effects (F1, F2) and a random intercept (R), within which is nested a further random effect (NR). There is also an offset, consisting of the natural logarithm of the total counts in each sample (LOG_TOT).
An example of a data frame, df, is:
CT F1 F2 R NR LOG_TOT
77 0 0 1 1 12.9
167 0 0 2 6 13.7
289 0 0 3 11 13.9
253 0 0 4 16 13.9
125 0 0 5 21 13.7
109 0 0 6 26 13.6
96 1 0 1 2 13.1
169 1 0 2 7 13.7
190 1 0 3 12 13.8
258 1 0 4 17 13.9
101 1 0 5 22 13.5
94 1 0 6 27 13.5
89 1 25 1 4 13.0
166 1 25 2 9 13.6
175 1 25 3 14 13.7
221 1 25 4 19 13.8
131 1 25 5 24 13.5
118 1 25 6 29 13.6
58 1 75 1 5 12.9
123 1 75 2 10 13.4
197 1 75 3 15 13.7
208 1 75 4 20 13.8
113 1 8 1 3 13.2
125 1 8 2 8 13.7
182 1 8 3 13 13.7
224 1 8 4 18 13.9
104 1 8 5 23 13.5
116 1 8 6 28 13.7
122 2 0 1 2 13.1
115 2 0 2 7 13.6
149 2 0 3 12 13.7
270 2 0 4 17 14.1
116 2 0 5 22 13.5
94 2 0 6 27 13.7
73 2 25 1 4 12.8
61 2 25 2 9 13.0
185 2 25 3 14 13.8
159 2 25 4 19 13.7
125 2 25 5 24 13.6
75 2 25 6 29 13.5
121 2 8 1 3 13.0
143 2 8 2 8 13.8
219 2 8 3 13 13.9
191 2 8 4 18 13.7
98 2 8 5 23 13.5
115 2 8 6 28 13.6
110 3 0 1 2 12.8
123 3 0 2 7 13.6
210 3 0 3 12 13.9
354 3 0 4 17 14.4
160 3 0 5 22 13.7
101 3 0 6 27 13.6
69 3 25 1 4 12.6
112 3 25 2 9 13.5
258 3 25 3 14 13.8
174 3 25 4 19 13.5
171 3 25 5 24 13.9
117 3 25 6 29 13.7
38 3 75 1 5 12.1
222 3 75 2 10 14.1
204 3 75 3 15 13.5
235 3 75 4 20 13.7
241 3 75 5 25 13.8
141 3 75 6 30 13.9
113 3 8 1 3 12.9
90 3 8 2 8 13.5
276 3 8 3 13 14.1
199 3 8 4 18 13.8
111 3 8 5 23 13.6
109 3 8 6 28 13.7
135 4 0 1 2 13.1
144 4 0 2 7 13.6
289 4 0 3 12 14.2
395 4 0 4 17 14.6
154 4 0 5 22 13.7
148 4 0 6 27 13.8
58 4 25 1 4 12.8
136 4 25 2 9 13.8
288 4 25 3 14 14.0
113 4 25 4 19 13.5
162 4 25 5 24 13.7
172 4 25 6 29 14.1
2 4 75 1 5 12.3
246 4 75 3 15 13.7
247 4 75 4 20 13.9
114 4 8 1 3 13.1
107 4 8 2 8 13.6
209 4 8 3 13 14.0
190 4 8 4 18 13.9
127 4 8 5 23 13.5
101 4 8 6 28 13.7
167 6 0 1 2 13.4
131 6 0 2 7 13.5
369 6 0 3 12 14.5
434 6 0 4 17 14.9
172 6 0 5 22 13.8
126 6 0 6 27 13.8
90 6 25 1 4 13.1
172 6 25 2 9 13.7
330 6 25 3 14 14.2
131 6 25 4 19 13.7
151 6 25 5 24 13.9
141 6 25 6 29 14.2
7 6 75 1 5 12.2
194 6 75 2 10 14.2
280 6 75 3 15 13.7
253 6 75 4 20 13.8
45 6 75 5 25 13.4
155 6 75 6 30 13.9
208 6 8 1 3 13.5
97 6 8 2 8 13.5
325 6 8 3 13 14.3
235 6 8 4 18 14.1
112 6 8 5 23 13.6
188 6 8 6 28 14.1
The random and nested random effects are treated as factors. The fixed effect F1 has the value 0, 1, 2, 3, 4 and 6. The fixed effect F2 has the values 0, 8, 25 and 75. I am treating the fixed effects as continuous, rather than ordinal, because I would like to identify monotonic unidirectional changes in the dependent variable CT rather than up and down changes.
I previously used the lme4 package to analyze the data as a mixed model:
library(lme4)
m1 <- lmer(CT ~ F1*F2 + (1|R/NR) +
offset(LOG_TOT), data = df, verbose=FALSE)
Followed by the use of glht in the multcomp package for post-hoc analysis employing the formula approach:
library(multcomp)
glht_fixed1 <- glht(m1, linfct = c(
"F1 == 0",
"F1 + 8*F1:F2 == 0",
"F1 + 25*F1:F2 == 0",
"F1 + 75*F1:F2 == 0",
"F1 + (27)*F1:F2 == 0"))
glht_fixed2 <- glht(m1, linfct = c(
"F2 + 1*F1:F2 == 0",
"F2 + 2*F1:F2 == 0",
"F2 + 3*F1:F2 == 0",
"F2 + 4*F1:F2 == 0",
"F2 + 6*F1:F2 == 0",
"F2 + (3.2)*F1:F2 == 0"))
glht_omni <- glht(m1)
Here is the corresponding negative binomial glmmTMB model, which I now prefer:
library(glmmTMB)
m2 <- glmmTMB(CT ~ F1*F2 + (1|R/NR) +
offset(LOG_TOT), data = df, verbose=FALSE, family="nbinom2")
According to this suggestion by Ben Bolker (https://stat.ethz.ch/pipermail/r-sig-mixed-models/2017q3/025813.html), the best approach to post hoc testing with glmmTMB is to use lsmeans (?or its more recent equivalent, emmeans).
I follwed Ben's suggestion, running
source(system.file("other_methods","lsmeans_methods.R",package="glmmTMB"))
and I can then use emmeans on the glmmTMB object. For example,
as.glht(emmeans(m2,~(F1 + 27*F1:F2)))
General Linear Hypotheses
Linear Hypotheses:
Estimate
3.11304347826087, 21 == 0 -8.813
But this does not seem correct. I can also change F1 and F2 to factors and then try this:
as.glht(emmeans(m2,~(week + 27*week:conc)))
General Linear Hypotheses
Linear Hypotheses:
Estimate
0, 0 == 0 -6.721
1, 0 == 0 -6.621
2, 0 == 0 -6.342
3, 0 == 0 -6.740
4, 0 == 0 -6.474
6, 0 == 0 -6.967
0, 8 == 0 -6.694
1, 8 == 0 -6.651
2, 8 == 0 -6.227
3, 8 == 0 -6.812
4, 8 == 0 -6.371
6, 8 == 0 -6.920
0, 25 == 0 -6.653
1, 25 == 0 -6.648
2, 25 == 0 -6.282
3, 25 == 0 -6.766
4, 25 == 0 -6.338
6, 25 == 0 -6.702
0, 75 == 0 -6.470
1, 75 == 0 -6.642
2, 75 == 0 -6.091
3, 75 == 0 -6.531
4, 75 == 0 -5.762
6, 75 == 0 -6.612
But, again, I am not sure how to bend this output to my will. If some kind person could tell me how to correctly carry over the use of formulae in glht and linfct to the emmeans scenario with glmmTMB, I would be very grateful. I have read all the manuals and vignettes until I am blue in face (or it feels that way, at least), but I am still at a loss. In my defense (culpability?) I am a statistical tyro, so many apologies if I am asking a question with very obvious answers here.
The glht software and post hoc testing carries directly over to the glmmADMB package, but glmmADMB is 10x slower than glmmTMB. I need to perform multiple runs of this analysis, each with 300,000 examples of the negative binomial mixed model, so speed is essential.
Many thanks for your suggestions and help!
The second argument (specs) to emmeans is not the same as the linfct argument in glht, so you can't use it in the same way. You have to call emmeans() using it the way it was intended. The as.glht() function converts the result to a glht object, but it really is not necessary to do that as the emmeans summary yields similar results.
I think the results you were trying to get are obtainable via
emmeans(m2, ~ F2, at = list(F2 = c(0, 8, 25, 75)))
(using the original model with the predictors as quantitative variables). This will compute the adjusted means holding F1 at its average, and at each of the specified values of F2.
Please look at the documentation for emmeans(). In addition, there are lots of vignettes that provide explanations and examples -- starting with https://cran.r-project.org/web/packages/emmeans/vignettes/basics.html.
Following the advice of my excellent statistical consultant, I think the solution below provides what I had previously obtained using glht and linfct.
The slopes for F1 are calculated at the various levels of F2 by using contrast and emmeans to compute the differences in the dependendent variable between two values of F1 separated by one unit (i.e. c(0,1)). (Since the regression is linear, the two values of F1 are arbitrary, provided they are separated by one unit, eg c(3,4)). Vice versa for the slopes of F2.
Thus, slopes of F1 at F2 = 0, 8, 25, 75 and 27 (27 is average of F2):
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=0)),list(c(-1,1)))
(above equivalent to: summary(m1)$coefficients$cond["F1",])
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=8)),list(c(-1,1)))
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=25)),list(c(-1,1)))
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=75)),list(c(-1,1)))
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=27)),list(c(-1,1)))
and slopes of F2 at F1 = 1, 2, 3, 4, 6 and 3.2 (3.2 is average of F1, excluding zero value):
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=0)),list(c(-1,1)))
(above equivalent to: summary(m1)$coefficients$cond["F2",])
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=1)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=2)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=3)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=4)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=6)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=3.2)),list(c(-1,1)))
Interaction of F1 and F2 slopes at F1 = 0 and F2 = 0
contrast(emmeans(m1, specs=c("F1","F2"), at=list(F1=c(0,1),F2=c(0,1))),list(c(1,-1,-1,1)))
(above equivalent to: summary(m1)$coefficients$cond["F1:F2",])
From the resulting emmGrid objects provided from contrast(), one can pick out as desired the estimate of the slope (estimate), standard deviation of the estimated slope (SE), Z score for the difference of the estimated slope from a null hypothesized slope of zero (z.ratio, calculated by emmGrid from estimate divided by SE) and corresponding P value (p.value calculated by emmGrid as 2*pnorm(-abs(z.ratio)).
For example:
contrast(emmeans(m1, specs="F1", at=list(F2=c(0,1), F1=0)),list(c(-1,1)))
yields:
NOTE: Results may be misleading due to involvement in interactions
contrast estimate SE df z.ratio p.value
c(-1, 1) 0.001971714 0.002616634 NA 0.754 0.4511
Postscript added 1.25 yrs later:
The above gives the correct solutions, but as Russell Lenth pointed out the answers are more easily obtained using emtrends. However, I have selected this answer as being correct since there may be have some didactic value in showing how to calculate slopes using emmeans to find the resulting change in the predicted dependent variable when the independent variable changes by 1.
I have a large data set consisting of factor variables, numeric variables, and a target column I'm trying to properly feed into xgboost with the objective of making an xgb.Matrix and training a model.
I'm confused about the proper processing to get my dataframe into an xgb.DMatrix object. Specifically, I have NAs in both factor and numeric variables and I want to make a sparse.model.matrix from my dataframe before creating the xgb.Matrix. The proper handling of the NAs is really messing me up.
I have the following sample dataframe df consisting of one binary categorical variable, two continuous variables, and a target. the categorical variable and one continuous variable has NAs
'data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 2 levels "0","1": 1 2 2 1 NA 2 1 1 NA 2
$ v2 : num 3.2 5.4 8.3 NA 7.1 8.2 9.4 NA 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 NA 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 NA 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
sparse.model.matrix from the matrix library won't accept NAs. It eliminates the rows (which I don't want). So I'll need to change the NAs to a numeric replacement like -999
if I use the simple command:
df[is.na(df)] = -999
it only replaces the NAs in the numeric columns:
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 -999.0 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 -999.0 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
So I first (think I) need to change the factor variables to numeric and then do
the substitution. Doing that I get:
v1 v2 v3 target
1 1 3.2 22.1 0
2 2 5.4 44.1 0
3 2 8.3 57.0 1
4 1 -999.0 64.2 1
5 -999 7.1 33.1 0
6 2 8.2 56.9 0
7 1 9.4 71.2 0
8 1 -999.0 33.9 1
9 -999 9.9 89.3 0
10 2 4.2 97.2 0
but converting the factor variable back to a factor (I think this is necessary
so xgboost will later know its a factor) I get three levels:
data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 3 levels "-999","1","2": 2 3 3 2 1 3 2 2 1 3
$ v2 : num 3.2 5.4 8.3 -999 7.1 8.2 9.4 -999 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
I'm ultimately not sure now that making the sparse.model.matrix and ultimately
the xgb.matrix object will be meaningful because v1 appears messed up.
To make matters more confusing, xgb.Dmatrix() has an argument missing
that I can use to identify numeric values (-999) that represent NAs. But this
can only be used for a dense matrix. If I submitted the dense matrix I'd
just have the NAs and wouldn't need that. However, in the sparse matrix
where I have -999s, I can't use it.
I hope I'm not overlooking something easy. Been through xgboost.pdf extensively and looked on Google.
Please help. Thanks in advance.
options(na.action='na.pass') as mentioned by #mtoto is the best way to deal with this problem. It will make sure that you don't loose any data while building model matrix.
Specifically XGBoost implementation; in case of NAs, check for higher gain when doing the splits while growing tree. So for example if splits without considering NAs is decided to be a variable var1's (range [0,1]) value 0.5 then it calculates the gain considering var1 NAs to be < 0.5 and > 0.5. To whatever split direction it gets more gain it attributes NAs to have that split direction. So NAs now have a range [0,0.5] or [0.5,1] but not actual value assigned to it (i.e. imputed). Refer (original author tqchen's comment on Aug 12, 2014).
If you are imputing -99xxx there then you are limiting the algorithm ability to learn NA's proper range (conditional on labels).
I have a pretty large data frame in R stored in long form. It contains body temperature data collected from 40 different individuals, with 10 sec intervals, over 16 days. Individuals have been exposed to conditions (cond1 and cond2). It essentially looks like this:
ID Cond1 Cond2 Day ToD Temp
1 A B 1 18.0 37.1
1 A B 1 18.3 37.2
1 A B 2 18.6 37.5
2 B A 1 18.0 37.0
2 B A 1 18.3 36.9
2 B A 2 18.6 36.9
3 A A 1 18.0 36.8
3 A A 1 18.3 36.7
3 A A 2 18.6 36.7
...
I want to create four separate line plots for each combination of conditions(AB, BA, AA, BB) that shows mean temp over time (day 1-16).
p.s. ToD stands for time of day. Not sure if I need to provide it in order to create the plot.
So far I have tried to define the dataset as time series by doing
ts <- ts(data=dataset$Temp, start=1, end=16, frequency=8640)
plot(ts)
This returns a plot of Temp, but I can't figure out how to define condition values for breaking up the data.
Edit:
Essentially I want a plot that looks like this 1, but one for each group separately, and using mean Temp values. This plot is just for one individual in one condition, and I want one that shows the mean for all individuals in the same condition.
You can use summarise and group_by to group the data by condition and then plot it. Is this what you're looking for?
library(dplyr)
## I created a dataframe df that looks like this:
ID Cond1 Cond2 Day ToD Temp
1 1 A B 1 18.0 37.1
2 1 A B 1 18.3 37.2
3 1 A B 2 18.6 37.5
4 2 B A 1 18.0 37.0
5 2 B A 1 18.3 36.9
6 2 B A 2 18.6 36.9
7 3 A A 1 18.0 36.8
8 3 A A 1 18.3 36.7
9 3 A A 2 18.6 36.7
df$Cond <- paste0(df$Cond1, df$Cond2)
d <- summarise(group_by(df, Cond, Day), t = mean(Temp))
ggplot(d, aes(Day, t, color = Cond)) + geom_line()
which results in:
i'm not sure how to best describe this, so i'll just show you. i have two variables.
A:
ID
1 121
2 122
3 123
4 124
5 125
6 126
7 127
8 128
9 129
and B:
var1 var2 var3
1 57.1 116.5 73.0
2 38.1 15.8 22.7
3 84.2 99.2 72.2
and i would like them to end up as such:
ID
1 121 57.1
2 122 116.5
3 123 73.0
4 124 38.1
5 125 15.8
6 126 22.7
7 127 84.2
8 128 99.2
9 129 72.2
does that make sense? i'd like to maintain the original variable and add a column that is the rows, in order, of the other variable. preferably i'd like this as a data frame.
thanks in advance.
data.frames and matrices are filled by column by default, as you want create a numeric vector filling by row, you will need to transpose the data.frame before coercing to a numeric variable, so it will be in the order you want.
A$value <- c(t(B))
transposing a data.frame gives a matrix, which is coerced to a numeric vector by c.
Assuming B is a data.frame, you can do:
cbind(A,var.name=as.vector(as.matrix(B)))
You can pass the new column name instead of var.name