Getting the class of the predictors from an lme4 model - r

After fitting an lme4 model, I was wondering how we get the class of the predictors from terms(fit)[[3]]?
Here is a simple example, but I appreciate a functional answer for any other model in lme4.
Note: Everything has to be extracted from the model.
library(lme4)
h <- read.csv('https://raw.githubusercontent.com/hkil/m/master/h.csv')
h$year <- as.factor(h$year)
m <- lmer(scale~ year*group + (1|stid), data = h)
terms(m)[[3]] ## What are the `class`es of the variables in here (e.g., `integer`, `factor` etc.)

Maybe not perfectly robust, but:
extract the names of variables from the terms object
av <- all.vars(terms(m)[[3]]) ## c("year","group")
Look them up in the data frame supplied as data=:
setNames(lapply(av, function(x) class(h[[x]])), av)
$year
[1] "factor"
$group
[1] "character"
If you want to get everything from the model this gets MUCH HARDER in general, because the original variables are not necessarily stored. In the example you gave this works:
setNames(lapply(av, function(x) class(model.frame(m)[[x]])), av)
$year
[1] "factor"
$group
[1] "factor"
You'll notice that group has been converted to a factor. You can break this, e.g., by using a term like log(x) in the model ...

Related

How can I prevent the glm() function from adding back-ticks to some of my variable names?

I'm using glm() to create some models. The variables in my input data.table have names like "day" and "day^2". The model generated by glm() puts back-ticks around any variable with a polynomial term, which throttles various components of my downstream workflow.
library(data.table)
# Generate some sample data
covariates <- data.table('day' = 1:10,
'day^2' = (1:10)^2)
theta <- sample(x = 0:1, size = 10, replace = TRUE)
# Produce a model from glm()
mod <- glm(formula = theta ~ .,
data = covariates,
family = 'binomial')
I want to be able to check the model coefficients and see "day" and "day^2", but instead I'm seeing "day" and "`day^2`":
# Check the coefficient names output from glm()
names(mod$coefficients)
# [1] "(Intercept)" "day" "`day^2`"
# Check the colnames(covariates) to verify we did not put in back-ticks
colnames(covariates)
# [1] "day" "day^2"
# Eliminate the "(Intercept)" term from mods names,
# then compare the outcome to the colnames of covariates
# to verify that we are not imagining the back-ticks
names(mod$coefficients[-1]) == colnames(covariates)
# [1] TRUE FALSE
Am I out of luck? Is glm() just not going to respect my polynomial variable names?
EDITED TO ADD:
Well, data.table is so fully entrenched in my workflow that I didn't even consider trying this out with a data.frame, but it just occurred to me. Here is what happens with data.frames:
# Generate some sample data using a traditional data.frame instead of data.table
covariates <- data.frame('day' = 1:10,
'day^2' = (1:10)^2)
# Check the column names; data.frame coerces '^' to '.':
colnames(covariates)
# [1] "day" "day.2"
Thus, perhaps the issue is with data.table not being watertight compatible with glm() in this case, given that data.table allows much more flexibility with my polynomial variable names, which data.frame would have coerced to '.' (and which glm ultimately ends up coercing by adding back-ticks!).

predict.lm with arbitrary coefficients r

I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...

What is the difference between type = "response" and type = "scores" in R

I'm trying to understand the predict function in R
There is a parameter called type which I can set to "response" or "scores"
I'm having difficulty understanding the difference.
Thanks.
This isn't precisely an answer, but it shows how I looked through all of the predict() methods available in base R to see what the possible values of type were for all of those methods ...
m <- methods("predict")
p <- lapply(m,getAnywhere)
tt <- function(x) {
obj <- formals(x$objs[[1]])
r <- eval(obj$type)
}
res <- setNames(lapply(p,tt),
sapply(p,"[[","name"))
res[!sapply(res,is.null)]
Results:
$predict.glm
[1] "link" "response" "terms"
$predict.lm
[1] "response" "terms"
So, you're going to have to tell us what S3 predict() method allows type="scores" as an option ...
Googling cran predict type="scores": maybe the pls package?
From ?predict.mvr:
When ‘type’ is ‘"scores"’, predicted score values are returned for
the components given in ‘comps’. If ‘comps’ is missing or ‘NULL’,
‘ncomps’ is used instead.
I believe that the score values are the predicted principal component scores for a given set of predictors (as opposed to the predicted values of the original predictor variables).

Error when using predict() on a randomForest object trained with caret's train() using formula

Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.
When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error.
When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly.
Here is a working example:
library(randomForest)
library(caret)
data(imports85)
imp85 <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.
modRf1 <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2 <- caretRf$finalModel
modRf3 <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4 <- caretRf$finalModel
p1 <- predict(modRf1, newdata=imp85)
p2 <- predict(modRf2, newdata=imp85)
p3 <- predict(modRf3, newdata=imp85)
p4 <- predict(modRf4, newdata=imp85)
Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85) returns the following error:
Error in predict.randomForest(modRf2, newdata = imp85) :
variables in the training data missing in newdata
It seems that the reason for this error is that the predict.randomForest method uses rownames(object$importance) to determine the name of the variables used to train the random forest object. And when looking at
rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)
We see:
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelTypegas"
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelType"
So somehow, when using the caret train() function with a formula changes the name of the (factor) variables in the importance field of the randomForest object.
Is it really an inconsistency between the formula and and non-formula version of the caret train() function? Or am I missing something?
First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train if you want the same levels or use predict.train
There can be two reasons why you get this error.
1. The categories of the categorical variables in the train and test sets don't match. To check that, you can run something like the following.
Well, first of all, it is good practice to keep the independent variables/features in a list. Say that list is "vars". And say, you separated "Data" into "Train" and "Test". Let's go:
for (v in vars){
if (class(Data[,v]) == 'factor'){
print(v)
# print(levels(Train[,v]))
# print(levels(Test[,v]))
print(all.equal(levels(Train[,v]) , levels(Test[,v])))
}
}
Once you find the non-matching categorical variables, you can go back, and impose the categories of Test data onto Train data, and then re-build your model. In a loop similar to above, for each nonMatchingVar, you can do
levels(Test$nonMatchingVar) <- levels(Train$nonMatchingVar)
2. A silly one. If you accidentally leave the dependent variable in the set of independent variables, you may run into this error message. I have done that mistake. Solution: Just be more careful.
Another way is to explicitly code the testing data using model.matrix, e.g.
p2 <- predict(modRf2, newdata=model.matrix(~., imp85))

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources