Extracting logLik from multiple glmer models into a data frame - r

I'm trying to make an AIC table as described by Anderson et al. (Suggestions for Presenting the Results of Data Analyses. J.Wildl Manage. 65(3): 2001). To do so, I have been using ICtab in the bbmle package.
aic2 <- ICtab(du1500o,du1500a,du1500b,du1500c,du1500d,du1500e,du1500f,
du1500g,du1500h,du1500i,du1500j,du1500k,du1500l,
type="AICc", weights=T,delta=T,sort=T,base=T,nobs=87)
This gives me a nice little ranked table.
AICc df dAICc weight
du1500a 402.4 6 0.0 0.4201
du1500k 403.4 6 1.0 0.2580
du1500j 404.4 6 2.0 0.1520
du1500f 405.6 6 3.2 0.0842
du1500g 406.8 6 4.4 0.0459
du1500e 408.7 6 6.3 0.0176
du1500d 410.2 6 7.8 0.0084
du1500l 410.7 6 8.3 0.0065
du1500i 412.3 6 9.9 0.0030
du1500o 412.4 4 10.0 0.0029
du1500c 415.2 6 12.8 <0.001
du1500h 416.6 6 14.2 <0.001
du1500b 416.6 6 14.2 <0.001
I want to add the log likelihood of each model to this table. I can extract one log likelihood at a time with the logLik function. However, I have been unable to find a more efficient away to extract the LL from all of the models.
head(logLik(du1500a))
[1] -194.6726
Should I turn aic2 into a data frame? How?
If so, how can I then efficiently append the LL of each model to that data frame?
Much appreciated,
Nava

Related

R - two types of prediction in cross validation

When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.

Find where species accumulation curve reaches asymptote

I have used the specaccum() command to develop species accumulation curves for my samples.
Here is some example data:
site1<-c(0,8,9,7,0,0,0,8,0,7,8,0)
site2<-c(5,0,9,0,5,0,0,0,0,0,0,0)
site3<-c(5,0,9,0,0,0,0,0,0,6,0,0)
site4<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site5<-c(5,0,9,0,0,6,6,0,0,0,0,0)
site6<-c(5,0,9,0,0,0,6,6,0,0,0,0)
site7<-c(5,0,9,0,0,0,0,0,7,0,0,3)
site8<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site9<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site10<-c(5,0,9,0,0,0,0,0,0,0,1,6)
site11<-c(5,0,9,0,0,0,5,0,0,0,0,0)
site12<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site13<-c(5,1,9,0,0,0,0,0,0,0,0,0)
species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13)
accum <- specaccum(species_counts, method="random", permutations=100)
plot(accum)
In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13).
results <- with(accum, data.frame(sites, richness, sd))
Produces this:
sites richness sd
1 1 3.46 0.9991916
2 2 4.94 1.6625403
3 3 5.94 1.7513054
4 4 7.05 1.6779918
5 5 8.03 1.6542263
6 6 8.74 1.6794660
7 7 9.32 1.5497149
8 8 9.92 1.3534841
9 9 10.51 1.0492422
10 10 11.00 0.8408750
11 11 11.35 0.7017295
12 12 11.67 0.4725816
13 13 12.00 0.0000000
I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit:
species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0,
5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0,
5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3,
5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0,
5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0),
byrow=TRUE,nrow=13)
Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package):
set.seed(101)
library(vegan)
accum <- specaccum(species_counts, method="random", permutations=100)
Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites)
(slopes <- with(accum,diff(richness)/diff(sites)))
## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31
In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3:
which(slopes<0.3)[1]
returns NA.

Why MARS (earth package) generates so many predictors?

I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help

Obtain the baseline hazard function/survival function from an extended Cox model (with external time-dependent covariates)

I am applying an extended Cox model with external time-dependent covariates. Here is a small example (df) which I borrowed and modified from Themeau and Grambsch's book, Modeling survival data : extending the Cox model (2001):
id start stop event trt bili albumin
1 0 188 0 1 1.8 2.54
1 188 372 0 1 1.6 2.88
1 372 729 0 1 1.7 2.80
1 729 1254 0 1 3.2 2.92
1 1254 1462 0 1 3.7 2.59
1 1462 1824 0 1 4.0 2.59
1 1824 1925 1 1 5.3 1.83
2 0 56 0 0 1.8 2.36
2 56 172 0 0 1.6 1.89
2 172 521 1 0 1.7 1.56
3 0 36 0 1 3.2 2.10
3 36 232 0 1 3.7 2.32
3 232 352 0 1 4.0 1.96
3 352 610 1 1 5.3 2.05
I would like to obtain the baseline hazard/survival function from the extended Cox model. In the classical Cox PH model which handles time-independent covariates, it seems that we can obtain the estimate of H(t) using the Nelson-Aalen estimator:
fit1<- coxph(Surv(time, event) ~ tidc's, data=df)
sfit<-survfit(fit1)
sfit$surv
H<- -log(sfit$surv)
H<- c(H, tail(H, 1))
I am wondering how to obtain the baseline hazard/survival function from the extended Cox model, when external time-dependent covariates are used instead? Could I use the similar method like this?
model_1<-coxph(Surv(start,stop,event) ~ treat+log(bili)+log(albumin),data=df)
mfit<-survfit(model_1)
mfit$surv
H1<- -log(mfit$surv)
H1<- c(H1, tail(H1, 1))
Thanks.
The survfit object also has an element named 'cumhaz'. That would seem to be the correct item to pull rather than recalculating it.
> all.equal( -log(mfit$surv), mfit$cumhaz )
[1] TRUE
It's unclear what you expect as a "baseline hazard" if you are using time-dependent covariates. At least in the survival package a "baseline hazard" is calculated on the basis of the estimated hazard for a hypothetical subject who had the mean value for each of the covariates. The packages author, Terry Therneau, devotes the entire first paragraph in the Details section of the help page for survfit.coxph discussing why he thinks this is often delivers a result of questionable value. I seriously doubt that he would consider a baseline hazard to be meaningful in a time-dependent covariate model where the baseline would be jumping around. He has specifically suggested that calculating survival curves in that setting is statistically unsupportable (despite the fact that they often appear in medical articles.)

Error in R: multi effects models

I'm having a few issue's I'd appreciate some help with.
head(new.data)
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB seasons
1 2 3 1996 1 30.7 0.35 0.5000750 0.75 7.4 0.055 winter
2 6 1 1996 2 24.8 0.25 0.5001375 0.75 6.9 0.200 winter
3 7 4 1996 2 60.4 0.05 0.5001375 0.75 7.1 0.055 winter
4 7 4 1996 2 58.1 0.15 0.5001570 0.75 7.5 0.055 winter
5 7 4 1996 3 62.2 0.20 0.5003881 2.00 7.6 0.055 spring
6 5 2 1996 3 40.3 0.15 0.5003500 2.00 7.7 0.055 spring
library(nlme)
> mod3 <- lme(TTHM ~ CL2_FREE, random= ~ 1| Treatment_Code/WSZ_Code, data=new.data, method ="ML")
> mod3
Linear mixed-effects model fit by maximum likelihood
Data: new.data
Log-likelihood: -1401.529
Fixed: TTHM ~ CL2_FREE
(Intercept) CL2_FREE
54.45240 -40.15033
Random effects:
Formula: ~1 | Treatment_Code
(Intercept)
StdDev: 0.004156934
Formula: ~1 | WSZ_Code %in% Treatment_Code
(Intercept) Residual
StdDev: 10.90637 13.52372
Number of Observations: 345
Number of Groups:
Treatment_Code WSZ_Code %in% Treatment_Code
4 8
> plot(augPred(mod3))
Error in plot(augPred(mod3)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Error in sprintf(gettext(fmt, domain = domain), ...) :
invalid type of argument[1]: 'symbol'
I'm not sure why I get this error. The ranef plot seems OK
plot(ranef(mod3))
But that only gives the value of the random intercepts, no TTHM predictions.
I'm looking for a way to plot the predictions like in a typical augPred which would show all the random effects for each zone. Hope that makes sense.
You need a groupedData object to use augPred. I hope this helps.
Best wishes #CSJCampbell
con <- textConnection("
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB seasons
2 3 1996 1 30.7 0.35 0.5000750 0.75 7.4 0.055 winter
6 1 1996 2 24.8 0.25 0.5001375 0.75 6.9 0.200 winter
7 4 1996 2 60.4 0.05 0.5001375 0.75 7.1 0.055 winter
7 4 1996 2 58.1 0.15 0.5001570 0.75 7.5 0.055 winter
7 4 1996 3 62.2 0.20 0.5003881 2.00 7.6 0.055 spring
5 2 1996 3 40.3 0.15 0.5003500 2.00 7.7 0.055 spring
")
new.data <- read.table(con, header = TRUE)
library(nlme)
new.data.grp <- groupedData(TTHM ~ CL2_FREE | Treatment_Code/WSZ_Code, data = new.data)
mod3 <- lme(TTHM ~ CL2_FREE, random= ~ 1| Treatment_Code/WSZ_Code, data=new.data.grp, method ="ML")
mod3
ap3 <- augPred(mod3)
plot(ap3)
I realize most are probably using ggplot2 and lme4 at this point, but I'm a bit crufty.
Here are a couple of things that I've found working with lists of response variables that are fit using lme().
So, I've been working with a number of response variables that I want to fit to a particular set of inputs. In short my code looks something like
mymodels = list()
for(resp in my_response_vars){
f = as.formula(paste(resp,paste(my_input_vars,collapse='+'),sep='~'))
mymodels[[resp]] = lme(fixed=f,random=~wave|group,method="ML",
data=mydata, na.action=na.exclude)
}
I've been successful in treating the entries in the resulting list as normal lme() objects. The problem comes when I want to plot predictions via augPred(). Specifically I get the following error,
Error in tapply(object[[nm]], groups, FUN[["numeric"]], ...) :
arguments must have same length
So, after much searching, I decided to have a look under the hood of augPred() via debug(). Here are some of the insights I came to... I'm not sure that these qualify as bugs or if they would require a patch, but I hope they can help others with similar problems.
When calling augPred() the function looks for the name of the data that was used in the original lme() call, then inherits this object from the parent.frame() via a call to eval(). I'm not sure if this defaults to the object frame or the global, but, when I change this to data = object$data in the debug, things work. So, ostensibly, if you have used a subset of these data in your model, it will call on the full set of data.
The above causes issues if one response has missing values and you are interested in one that does not. Since it includes everything in the data.frame as part of an eventual call to gsummary() the missing values in the non-response variable will throw a wrench into things.
So, missing values mess things up. I have defaulted to making a temporary data.frame with the columns of interest, then running complete.cases() on this prior to fitting the lme() model.
mymods = list()
for(resp in my_response_vars){
f = as.formula(paste(resp,paste(my_input_vars,collapse='+'),sep='~'))
v2keep = all.vars(f) # grab terms
smdat = mydata[,c(v2keep,'group')] # include group
smdat=smdat[complete.cases(smdat),] # scrub missing
tmpmod = lme(fixed=f, random=~wave|group,
method='ML', data=smdat)
mymods[[resp]] = tmpmod
# include augPred() call here
}
If you are not including a primary argument in your call to augPred() it will require that your data.frame is a groupedData() object.
So, if you are running into the arguments must have the same length error, try: subsetting your data first under a different name, make sure to clear out missing rows explicitly prior to fitting your model.

Resources