SuperLearner Predict error - r

I am using SuperLearner R package.
I am trying to generate predicted y values for both train and test set.
After fitting a superlearner model without defining a "newX" to get predictions on the train set first so that I can compute MSE and plot predictions vs. actual Y values, I use "predict" command to predict Y values for the test set by running the following code:
sl.cv<-SuperLearner(Y = label, X = train,
SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"),
method = "method.NNLS", verbose=TRUE, cvControl=list(V=10))
pred.sl.cv <- predict(sl.cv, newdata=test, onlySL = T)
Then, I get the following error after "predict":
"Error in object$whichScreen : $ operator is invalid for atomic vectors"
I browsed many online sources to learn how to use "predict" after fitting a SuperLearner model, and I am doing just as what others do: That is, to put the object name of the fitted SuperLearner model (in this case, "sl.cv") followed by the new test set. I didn't even type $ operator.
Why am I getting this error message? How do I solve this problem?
Another question is: Does adding cvControl=list(V=10) as an option make any change? I think the default setting for SuperLearner model is to conduct 10-fold cross-validation. So, removing "cvControl=list(V=10)" will not change anything, right?
I would appreciate your advice. Thank you!

The problem is you are using matrices for your train and/or test data. You should use a data.frame. So change your code to the following:
sl.cv<-SuperLearner(Y = label, X = as.data.frame(train),
SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"),
method = "method.NNLS", verbose=TRUE, cvControl=list(V=10))
pred.sl.cv <- predict(sl.cv, newdata=as.data.frame(test), onlySL = T)
Also, make sure your labels are a list.

Related

Prediction Intervals for R tidymodels Stacked model from stacks()

Is it possible to calculate prediction intervals from a tidymodels stacked model?
Working through the example from the stacks() package here yields the stacked frog model (which can be downloaded here for reprex) and the testing data:
data("tree_frogs")
tree_frogs <- tree_frogs %>%
filter(!is.na(latency)) %>%
select(-c(clutch, hatched))
set.seed(1)
tree_frogs_split <- initial_split(tree_frogs)
tree_frogs_train <- training(tree_frogs_split)
tree_frogs_test <- testing(tree_frogs_split)
I tried to run something like this:
pi <- predict(tree_frogs_model_st, tree_frogs_test, type = "pred_int")
but this gives an error:
Error in UseMethod("stack_predict") : no applicable method for 'stack_predict' applied to an object of class "NULL"
Reading the documentation of stacks() I also tried passing "pred_int" in the opts list:
pi <- predict(tree_frogs_model_st, tree_frogs_test, opts = list(type = "pred_int"))
but this just gives: opts is only used with type = raw and was ignored.
For reference, I am trying to do a similar thing that is done in Ch.19 of Tidy Modeling with R book
lm_fit <- fit(lm_wflow, data = Chicago_train)
predict(lm_fit, Chicago_test, type = "pred_int")
which seems to work fine for a single model fit like lm_fit, but apparently not for a stacked model?
Am I missing something? Is it not possible to calculate prediction intervals for stacked models for some reason?
This is very difficult to do.
Even if glmnet produced a prediction interval, it would be a significant underestimate since it doesn’t know anything about the error in each of the ensemble members.
We would have to get the standard error of prediction from all of the models to compute it for the stacking model. A lot of these models don’t/can’t generate that standard error.
The alternative is the use bootstrapping to get the interval but you would have to bootstrap each model a large number of times to get the overall prediction interval.

Solving error for the good beginning of the day

Let's consider data following
library(plm)
library(pglm)
data("EmplUK", package="plm")
I will add new column with 0 and 1 randomly placed. After that I want to perform logit random effects model.
df1<-EmplUK
#adding 0's and 1's
df1<-cbind(df1,'binary'=sample(0:1,1031,replace=T))
#Performing logit regression
pglm(binary~output+wage, data=df1, family=quasibinomial(link='logit'), start = NULL, model = 'random')
And the following problem occurs :
Error in maxRoutine(fn = logLik, grad = grad, hess = hess, start = start, :
argument "start" is missing, with no default
I'm not sure exactly what's the reason, I've read about this error and it seems that there are some problems when you trying to estimate 'within' model, but I get this error for every model type. Could you please give me a hand pointing out reason of this error ?
I don't think the quasibinomial family is setup in this function. Inside pglm there is a function pglm:::starting.values that looks for specific families:
"binomial"
"ordinal"
"poisson"
"negbin"
"gaussian"
"tobit"
Negative binomial allows for modelling of the variance so that may suit your needs else binomial(link='logit') works ok if there's no evidence of overdispersion.
edit: happy to be corrected on this, I haven't worked with this package before :)

Get test error in a logistic regression model in R

I'm performing some experiments with logistic regression in R with the Auto dataset included in R.
I've get the training part (80%) and the test part (20%) normalizing each part individually.
I can create the model without any problem with the line:
mlr<-glm(mpg ~
displacement + horsepower + weight, data =train)
I can even predict train$mpg with the train set:
trainpred<-predict(mlr,train,type="response")
And with this calculate the sample error:
etab <- table(trainpred, train[,1])
insampleerror<-sum(diag(etab))/sum(etab)
The problem comes when I want predict with the test set. I use the following line:
testpred<-predict(model_rl,test,type="response")
Which gives me this warning:
'newdata' had 79 rows but variables found have 313 rows
but it doesn't work, because testpred have the same length of trainpred (should be less). When I want calculate the error in test using testpred with the following line:
etabtest <- table(testpred, test[,1])
I get the following error:
Error en table(testpred, test[, 1]) :
all arguments must have the same length
What I'm doing wrong?
I response my own question if someone have the same problem:
When I put the arguments in glm I'm saying what I want to predict, this is Auto$mpg labels with train data, hence, my glm call must be:
attach(Auto)
mlr<-glm(mpg ~
displacement + horsepower + weight, data=Auto, subset=indexes_train)
If now I call predict, table, etc there isn't any problem of structures sizes. Modifying this mistake it works for me.
As imo says:
"More importantly, you might check that this creates a logistic regression. I think it is actually OLS. You have to set the link and family arguments."
set familiy = 'binomial'

Trouble getting se.fit and confidence intervals using clmm2 from ordinal package

I'm using clmm function from ordinal package in R in order to fit cumulative mixed models to my data. It worked fine until I tried to get predicted probabilities. I can't get either SE or confidence intervals by specifying se.fit=TRUE and interval=TRUE. It looks like this:
mod1<-clmm2(response~X0+X1+X2+X3+X4+X5+X7+X0*X2*X3+X2*X3*X4+X0:X4, random=X6,
data=df,link ="logistic", threshold ="flexible",
Hess=TRUE, nAGQ=7)
As you can see there a bunch of interaction there (all important). I've tried to create a dummy dataset for my problem to be reproducible but clmm can't achieve convergence with a simpler dataset. I took the wine dataset included in the package ordinal and did some changes with the formula to mimic my own (I don't think it makes any sense though):
library(ordinal)
data(wine)
fm1 <- clmm2(rating ~ temp + contact+bottle+temp:contact:bottle+temp:contact+ temp:bottle+bottle:contact,random=judge, data=wine,link ="logistic", threshold ="flexible",
Hess=TRUE, nAGQ=7)
head(do.call("cbind", predict(fm1, se.fit=TRUE, interval=TRUE)))
And then I get this error:
Error in head(do.call("cbind", predict(fm1, se.fit = TRUE, interval = TRUE))) :
error in evaluating the argument 'x' in selecting a method for function 'head' : Erreur dans do.call("cbind", predict(fm1, se.fit = TRUE, interval = TRUE)) : second argument must be a list
My guess is that predict does'nt even compute SE and IC in a case like this. Does anybody knows why? Is there anyway to get those values?
Thanks a lot!
The predict method for clmm2 objects does not offer std-errors. See its help page. This is in keeping with the usual practice of R package authors when dealing with mixed effects models.

Getting predicted classes from R glmnet object

I am trying to build simple multi-class logistic regression models using glmnet in R. However when I try to predict the test data and obtain contingency table I get an error. A sample session is reproduced below.
> mat = matrix(1:100,nrow=10)
> test = matrix(1:50,nrow=5)
> classes <- as.factor(11:20)
> model <- glmnet(mat, classes, family="multinomial", alpha=1)
> pred <- predict(model, test)
> table(pred, as.factor(11:15))
Error in table(pred, as.factor(11:15)) :
all arguments must have the same length
Any help will be appreciated. R noob here.
Thanks.
The predict method for a glmnet object requires that you specify a value for the argument s, which indicates which values of the regularization parameter for which you want predictions.
(glmnet fits the model for several values of this regularization parameter simultaneously.)
So if you don't specify a value for s, predict.glmnet returns predictions for all the values. If you want just a single set of predictions, you need to either set a value for s when you call predict, or you need to extract the relevant column after the fact.

Resources