Error in Bagging with party::cforest - r

I'm trying to bag conditional inference trees following the advice of Kuhn et al in 'Applied Predictive Modeling', Ch.8:
Conditional inference trees can also be bagged using the cforest function > in the party package if the argument mtry is equal to the number of
predictors:
library(party)
The mtry parameter should be the number of predictors (the
number of columns minus 1 for the outcome).
bagCtrl <- cforest_control(mtry = ncol(trainData) - 1)
baggedTree <- cforest(y ~ ., data = trainData, controls = bagCtrl)
Note there may be a typo in the above code (and also in the package's help file), as discussed here:
R package 'partykit' unused argument in ctree_control
However when I try to replicate this code using a dataframe (and trainData in above code is also a dataframe) such that there is more than one independent/predictor variable, I'm getting an error though it works for just one independent variable:
Some dummy code for simulations:
library(party)
df = data.frame(y = runif(5000), x = runif(5000), z = runif(5000))
bagCtrl <- cforest_control(mtry = ncol(df) - 1)
baggedTree_cforest <- cforest(y ~ ., data = df, control = bagCtrl)
The error message is:
Error: $ operator not defined for this S4 class
Thanks for any help.

As suggested, posting my comment from above as an answer as a general R 'trick' if something expected doesn't work and the program has several libraries loaded:
but what solved it was adding the party namespace explicitly to the function > call, so party::cforest() instead of just cforest(). I've also got
library(partykit) loaded in my actual program which too has a cforest()
function and the error could be stemming from there though both functions are > essentially the same
caret::train() is another example where this often pops up

Related

Error in eval(parse()) - r unable to find argument input

I am very new to R, and this is my first time of encountering the eval() function. So I am trying to use the med and boot.med function from the following package: mma. I am using it to conduct mediation analysis. med and boot.med take in models such as linear models, and dataframes that specify mediators and predictors and then estimate the mediation effect of each mediator.
The author of the package gives the flexible option of specifying one's own custom.function. From the source code of med, it can be seen that the custom.function is passed to the eval(). So I tried insert the gbmt function as the custom function. However, R kept giving me error message: Error during wrapup: Number of trees to be used in prediction must be provided. I have been searching online for days and tried many ways of specifying the number of trees parameter n.trees, but nothing works (I believe others have raised similar issues: post 1, post 2).
The following codes are part of the source code of the med function:
cf1 = gsub("responseY", "y[,j]", custom.function[j])
cf1 = gsub("dataset123", "x2", cf1)
cf1 = gsub("weights123", "w", cf1)
full.model[[j]] <- eval(parse(text = cf1))
One custom function example the author gives in the package documentation is as follows:
temp1<-med(data=data.bin,n=2,custom.function = 'glm(responseY~.,data=dataset123,family="quasibinomial",
weights=weights123)')
Here the glm is the custom function. This example code works and you can replicate it easily (if you have mma installed and loaded). However when I am trying to use the gbmt function on a survival object, I got errors and here is what my code looks like:
temp1 <- med(data = data.surv,n=2,type = "link",
custom.function = 'gbmt(responseY ~.,
data = dataset123,
distribution = dist,
train_params = start_stop,
cv_folds=10,
keep_gbm_data = TRUE,
)')
Anyone has any idea how the argument about number of trees n.trees can be added somewhere in the above code?
Many thanks in advance!
Update: in order to replicate the example code, please install mma and try the following:
library("mma")
data("weight_behavior") ##binary x #binary y
x=weight_behavior[,c(2,4:14)]
pred=weight_behavior[,3]
y=weight_behavior[,15]
data.bin<-data.org(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10), binref=c(1,1),catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4)
temp1<-med(data=data.bin,n=2) #or use self-defined final function
temp1<-med(data=data.bin,n=2, custom.function = 'glm(responseY~.,data=dataset123,family="quasibinomial",
weights=weights123)')
I changed the custom.function to gbmt and used a survival object as responseY and the error occurs. When I use the gbmt function on my data outside the med function, there is no error.

Extracting predictions from a GAM model with splines and lagged predictors

I have some data and am trying to teach myself about utilize lagged predictors within regression models. I'm currently trying to generate predictions from a generalized additive model that uses splines to smooth the data and contains lags.
Let's say I have the following data and have split the data into training and test samples.
head(mtcars)
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
Great, let's train the gam model on the training set.
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(lag(disp, 1), bs="cr"), data=mtcars[Train,])
summary(f_gam)
When I go to predict on the holdout sample, I get an error message.
f_gam.pred <- predict(f_gam, mtcars[-Train,]); f_gam.pred
Error in ExtractData(object, data, NULL) :
'names' attribute [1] must be the same length as the vector [0]
Calls: predict ... predict.gam -> PredictMat -> Predict.matrix3 -> ExtractData
Can anyone help diagnose the issue and help with a solution. I get that lag(__,1) leaves a data point as NA and that is likely the reason for the lengths being different. However, I don't have a solution to the problem.
I'm going to assume you're using gam() from the mgcv library. It appears that gam() doesn't like functions that are not defined in "base" in the s() terms. You can get around this by adding a column which include the transformed variable and then modeling using that variable. For example
tmtcars <- transform(mtcars, ldisp=lag(disp,1))
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(ldisp, bs="cr"), data= tmtcars[Train,])
summary(f_gam)
predict(f_gam, tmtcars[-Train,])
works without error.
The problem appears to be coming from the mgcv:::get.var function. It tires to decode the terms with something like
eval(parse(text = txt), data, enclos = NULL)
and because they explicitly set the enclosure to NULL, variable and function names outside of base cannot be resolved. So because mean() is in the base package, this works
eval(parse(text="mean(x)"), data.frame(x=1:4), enclos=NULL)
# [1] 2.5
but because var() is defined in stats, this does not
eval(parse(text="var(x)"), data.frame(x=1:4), enclos=NULL)
# Error in eval(expr, envir, enclos) : could not find function "var"
and lag(), like var() is defined in the stats package.

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Naive bayes in R

I am getting an error while running naive bayes classifier in R. I am using the following code-
mod1 <- naiveBayes(factor(X20) ~ factor(X1) + factor(X2) +factor(X3) +factor(X4)+factor(X5)+factor(X6)+factor(X7)
+factor(X8)+factor(X9)
+factor(X10)+factor(X11)+ factor(X12)+factor(X13)+factor(X14)
+factor(X15)
+factor(X16)+factor(X17)
+factor(X18)+factor(X19),data=intent.test)
res1 <- predict(mod1)$posterior
First part of this code runs fine. But when it try to predict the posterior probability it throws following error-
**Error in as.data.frame(newdata) :
argument "newdata" is missing, with no default**
I tried running something like
res1 <- predict(mod1,new_data=intent.test)$posterior
but this also gives the same error.
You seem to be using the e1071::naiveBayes algorithm, which expects a newdata argument for prediction, hence the two errors raised when running your code. (You can check the source code of the predict.naiveBayes function on CRAN; the second line in the code is expecting a newdata, as newdata <- as.data.frame(newdata).) Also as pointed out by #Vincent, you're better off converting your variables to factor before calling the NB algorithm, although this has certainly nothing to do with the above errors.
Using NaiveBayes from the klar package, no such problem would happen. E.g.,
data(spam, package="ElemStatLearn")
library(klaR)
# set up a training sample
train.ind <- sample(1:nrow(spam), ceiling(nrow(spam)*2/3), replace=FALSE)
# apply NB classifier
nb.res <- NaiveBayes(spam ~ ., data=spam[train.ind,])
# predict on holdout units
nb.pred <- predict(nb.res, spam[-train.ind,])
# but this also works on the training sample, i.e. without using a `newdata`
head(predict(nb.res))

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources