Unnest nested tidydrc models - r

Problem
I've been using a tidy wrapper for the drc package—tidydrc— to build growth curves which produces a tidy version of the normal output (best for ggplot). However, due to the inherit nesting of the models, I can't run simple drc functions since the models are nested inside a dataframe. I've attached code that mirrors drc and tidydrc package below.
Goal
To compare information criteria from multiple model fits for the tidydrc output using the drc function mselect()—ultimately to efficiently select the best fitting model.
Ideal Result (works with drc)
library(tidydrc) # To load the Puromycin data
library(drc)
model_1 <- drm(rate ~ conc, state, data = Puromycin, fct = MM.3())
mselect(model_1, list(LL.3(), LL.5(), W1.3(), W1.4(), W2.4(), baro5()))
# DESIRED OUTPUT SIMILAR TO THIS
logLik IC Lack of fit Res var
MM.3 -78.10685 170.2137 0.9779485 70.54874 # Best fitting model
LL.3 -78.52648 171.0530 0.9491058 73.17059
W1.3 -79.22592 172.4518 0.8763679 77.75903
W2.4 -77.87330 173.7466 0.9315559 78.34783
W1.4 -78.16193 174.3239 0.8862192 80.33907
LL.5 -77.53835 177.0767 0.7936113 87.80627
baro5 -78.00206 178.0041 0.6357592 91.41919
Not Working Example with tidydrc
library(tidyverse) # tidydrc utilizes tidyverse functions
model_2 <- tidydrc_model(data = Puromycin, conc, rate, state, model = MM.3())
summary(model_2)
Error: summary.vctrs_list_of() not implemented.
Now, I can manually tease apart the list of models in the dataframe model_2 but can't seem to figure out the correct apply statements (it's a mess) to get this working.
Progress Thus Far
These both produce the same error, so at least I've subsetted a level down but now I'm stuck and pretty sure this is not the ideal solution.
mselect(model_2$drmod, list(LL.3(), LL.5(), W1.3(), W1.4(), W2.4(), baro5()))
model_2_sub <- model_2$drmod # Manually subset the drmod column
apply(model_2_sub, 2, mselect(list(LL.3(), LL.5(), W1.3(), W1.4(), W2.4(), baro5())))
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "list"
I've even tried the tidyverse function unnest() to no avail
model_2_unnest <- model_2 %>% unnest_longer(drmod, indices_include = FALSE)

Related

Cannot use coxph.predict for type="expected" with newdata in Competing Risks context

I'm using a Cox Proportional Hazards (survival::coxph) model in a competing risks context- i.e. multiple event types with one endpoint for each observation. I'm having a hard time using the coxph.predict function to show an estimate of expected number of events given a supplied set of covariates and follow-up time.
Here is an example using the mgus2 dataset in the survival package:
library(survival)
#Modify data so each subject transitions only once to a state.
crdata <- mgus2
crdata$etime <- pmin(crdata$ptime, crdata$futime)
crdata$event <- ifelse(crdata$pstat==1, 1, 2*crdata$death)
crdata$event <- factor(crdata$event, 0:2, c("censor", "PCM", "death"))
cfit <- coxph(Surv(etime, event) ~ I(age/10) + sex + mspike,
id = id, crdata)
Once I fit a model, and create a "newdata" data frame, R throws an error.
I tried using a from-scratch dataframe but this results in an error suggesting that the column size or the number of rows does not mesh:
#providing both follow-up time and covariates
nd=data.frame(etime=81 ,sex= "M", age=60, mspike=1.2)
predict(cfit, newdata=nd ,type="expected")
> Data is not the same size as it was in the original fit
I get the same issue Using model.frame when extracting the same data.frame used fitting the model.
nd=model.frame(cfit)
predict(cfit,newdata=nd,type="expected")
> Data is not the same size as it was in the original fit
This results in the same error. Trying to use the original data frame to make predictions doesn't work either:
nd=crdata[1,]
predict(cfit,newdata=nd,type="expected")
> Data is not the same size as it was in the original fit
I'm wondering what I'm missing here. Thanks in advance!
I've updated my survival package from 2.7 to 3.1 and the error thrown states that "expected" predict type is not available for multistate coxph.
> predict(fit,type="expected",newdata=newdat)
Error in predict.coxphms(fit, type = "expected", newdata = newdat) :
predict method not yet available for multistate coxph

How do you get stargazer to recognise a model from lmList?

I have data for three different years and are running a regression for each seperate year with lmList(). When I try to get the LaTex code with stargazer, I get an error saying it doesn't recognize the object type. When running stargazer for a normal linear regression, it works just fine, even though the class for the objects are the same.
This is my regression with lmList
fit <- lmList((lndeltaoms) ~ size + factor(gender)| year, data = tser)
stargazer(fit[["2008"]])
% Error: Unrecognized object type.
Compare this to a normal regression, where it works.
fit2 <- lm((lndeltaoms) ~ size + factor(gender), data=tser)
stargazer(fit2)
But when i compare the classes, they're the same.
class(fit[["2008"]])
[1] "lm"
class(fit2)
[1] "lm"
Since they're the same class, it feels stargazer should recognize both of them in the same way, but there seems to be some issue when extracting a model from the lmList.
Is there any way I can work around this?
It should work fine with lmList() from the nlme package (not the one from the lme4). Try out:
fit1 <- nlme::lmList((lndeltaoms) ~ size + factor(gender)| year, data = tser)
stargazer(fit1[["2008"]]) # ok
fit2 <- lme4::lmList((lndeltaoms) ~ size + factor(gender)| year, data = tser)
stargazer(fit2[["2008"]]) # this does not work
It looks like stargazer() works fine with objects of class lmList but not with lmList4 object resulting from lme4::lmList().
Also, be careful while loading nlme since its function lmList() is masked from lme4::lmList().

Error when using predict() on a randomForest object trained with caret's train() using formula

Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.
When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error.
When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly.
Here is a working example:
library(randomForest)
library(caret)
data(imports85)
imp85 <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.
modRf1 <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2 <- caretRf$finalModel
modRf3 <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4 <- caretRf$finalModel
p1 <- predict(modRf1, newdata=imp85)
p2 <- predict(modRf2, newdata=imp85)
p3 <- predict(modRf3, newdata=imp85)
p4 <- predict(modRf4, newdata=imp85)
Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85) returns the following error:
Error in predict.randomForest(modRf2, newdata = imp85) :
variables in the training data missing in newdata
It seems that the reason for this error is that the predict.randomForest method uses rownames(object$importance) to determine the name of the variables used to train the random forest object. And when looking at
rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)
We see:
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelTypegas"
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelType"
So somehow, when using the caret train() function with a formula changes the name of the (factor) variables in the importance field of the randomForest object.
Is it really an inconsistency between the formula and and non-formula version of the caret train() function? Or am I missing something?
First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train if you want the same levels or use predict.train
There can be two reasons why you get this error.
1. The categories of the categorical variables in the train and test sets don't match. To check that, you can run something like the following.
Well, first of all, it is good practice to keep the independent variables/features in a list. Say that list is "vars". And say, you separated "Data" into "Train" and "Test". Let's go:
for (v in vars){
if (class(Data[,v]) == 'factor'){
print(v)
# print(levels(Train[,v]))
# print(levels(Test[,v]))
print(all.equal(levels(Train[,v]) , levels(Test[,v])))
}
}
Once you find the non-matching categorical variables, you can go back, and impose the categories of Test data onto Train data, and then re-build your model. In a loop similar to above, for each nonMatchingVar, you can do
levels(Test$nonMatchingVar) <- levels(Train$nonMatchingVar)
2. A silly one. If you accidentally leave the dependent variable in the set of independent variables, you may run into this error message. I have done that mistake. Solution: Just be more careful.
Another way is to explicitly code the testing data using model.matrix, e.g.
p2 <- predict(modRf2, newdata=model.matrix(~., imp85))

Extracting predictions from a GAM model with splines and lagged predictors

I have some data and am trying to teach myself about utilize lagged predictors within regression models. I'm currently trying to generate predictions from a generalized additive model that uses splines to smooth the data and contains lags.
Let's say I have the following data and have split the data into training and test samples.
head(mtcars)
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
Great, let's train the gam model on the training set.
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(lag(disp, 1), bs="cr"), data=mtcars[Train,])
summary(f_gam)
When I go to predict on the holdout sample, I get an error message.
f_gam.pred <- predict(f_gam, mtcars[-Train,]); f_gam.pred
Error in ExtractData(object, data, NULL) :
'names' attribute [1] must be the same length as the vector [0]
Calls: predict ... predict.gam -> PredictMat -> Predict.matrix3 -> ExtractData
Can anyone help diagnose the issue and help with a solution. I get that lag(__,1) leaves a data point as NA and that is likely the reason for the lengths being different. However, I don't have a solution to the problem.
I'm going to assume you're using gam() from the mgcv library. It appears that gam() doesn't like functions that are not defined in "base" in the s() terms. You can get around this by adding a column which include the transformed variable and then modeling using that variable. For example
tmtcars <- transform(mtcars, ldisp=lag(disp,1))
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(ldisp, bs="cr"), data= tmtcars[Train,])
summary(f_gam)
predict(f_gam, tmtcars[-Train,])
works without error.
The problem appears to be coming from the mgcv:::get.var function. It tires to decode the terms with something like
eval(parse(text = txt), data, enclos = NULL)
and because they explicitly set the enclosure to NULL, variable and function names outside of base cannot be resolved. So because mean() is in the base package, this works
eval(parse(text="mean(x)"), data.frame(x=1:4), enclos=NULL)
# [1] 2.5
but because var() is defined in stats, this does not
eval(parse(text="var(x)"), data.frame(x=1:4), enclos=NULL)
# Error in eval(expr, envir, enclos) : could not find function "var"
and lag(), like var() is defined in the stats package.

How to use lmer inside a function

I'm trying to write a function that collects some calls I use often in scripts
I use the sleepstudy data of the lme4 package in my examples
Here's (a simplified version of) the function I started with:
trimModel1 <- function(frm, df) {
require(LMERConvenienceFunctions)
require(lme4)
lm<-lmer(frm,data=df)
lm.trimmed = romr.fnc(lm, df)
df = lm.trimmed$data
# update initial model on trimmed data
lm<-lmer(frm,data=df)
# lm#call$formula<-frm
mcp.fnc(lm)
lm
}
When I call this function like below:
(fm1<-trimModel1(Reaction ~ Days + (Days|Subject),sleepstudy))
The first three lines of the output look like this:
Linear mixed model fit by REML
Formula: frm
Data: df
If I had called the commands of the trimModel1 function in the console the first three lines of the summary of the model look like this:
Linear mixed model fit by REML
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
The difference is a problem because several packages that use the lme4 package make use of the formula and data fields. For instance the effects package uses these fields and a command like below will not work when I use the trimModel1 function above:
library(effects)
plot(allEffects(fm1))
I looked around on stackoverflow and R discussion groups for a solution and saw that you could change the formula field of the model. If you uncomment the lm#call$formula<-frm line in the trimModel1 function the formula field in the summary is displayed correctly. Unfortunately when I run a function from the effects package now I still get the error:
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
This is because the data field is still incorrect.
Another possible solution I found is this function:
trimModel2 <- function(frm, df) {
require(LMERConvenienceFunctions)
require(lme4)
lm<-do.call("lmer",list(frm,data=df))
lm.trimmed = romr.fnc(lm, df)
df = lm.trimmed$data
# update initial model on trimmed data
lm<-do.call("lmer",list(frm,data=df))
mcp.fnc(lm)
lm
}
When I now type the following commands in the console I get no errors:
(fm2<-trimModel2(Reaction ~ Days + (Days|Subject),sleepstudy))
plot(allEffects(fm2))
The allEffects function works but now the problem is that the the summary of the fm2 model displays the raw sleepstudy data. That is not a big problem with the sleepstudy data but with very large datasets sometimes Rstudio crashed when displaying a model.
Does anyone know how to make one (or both) of these functions work correctly?
I think I have to change the fm1#call$data field but I don't know how.
Do it like this:
trimModel1 <- function(frm, df) {
require(LMERConvenienceFunctions)
require(lme4)
dfname <- as.name(deparse(substitute(df)))
lm<-lmer(frm,data=df)
lm.trimmed = romr.fnc(lm, df)
df = lm.trimmed$data
# update initial model on trimmed data
lm<-lmer(frm,data=df)
lm#call$formula <- frm
lm#call$data <- dfname
mcp.fnc(lm)
lm
}
It's the "deparse-substitute trick" to get an object name from the object itself.

Resources