Unable to conduct variable importance in r - r

I'm trying to test the variable importance before running the actual regression. But, when I attempt to do so, I get this error:
Error in varImp(regressor, scale = FALSE) :
trying to get slot "responses" from an object (class "randomForest.formula") that is not an S4 object
I've tried looking up the error, but there wasn't much information available online. What can I do to fix this?
all = read.csv('https://raw.githubusercontent.com/bandcar/massShootings/main/all.csv')
# Check Variable importance with randomForest
regressor <- randomForest::randomForest(total_victims ~ . , data = all, importance=TRUE) # fit the random forest with default parameter
caret::varImp(regressor, scale = FALSE) # get variable importance, based on mean decrease in accuracy

Related

ggcoef_model error when two random intercepts

When trying to graph the conditional fixed effects of a glmmTMB model with two random intercepts in GGally I get the error:
There was an error calling "tidy_fun()". Most likely, this is because the
function supplied in "tidy_fun=" was misspelled, does not exist, is not
compatible with your object, or was missing necessary arguments (e.g. "conf.level=" or "conf.int="). See error message below.
Error: Error in "stop_vctrs()":
! Can't recycle "..1" (size 3) to match "..2" (size 2).`
I have tinkered with figuring out the issue and it seems to be related to the two random intercepts included in the model. I have also tried extracting the coefficient and standard error information separately through broom.mixed::tidy and then feeding the data frame into GGally:ggcoef() with no avail. Any suggestions?
# Example with built-in randu data set
data(randu)
randu$A <- factor(rep(c(1,2), 200))
randu$B <- factor(rep(c(1,2,3,4), 100))
# Model
test <- glmmTMB(y ~ x + z + (0 +x|A) + (1|B), family="gaussian", data=randu)
# A few of my attempts at graphing--works fine when only one random effects term is in model
ggcoef_model(test)
ggcoef_model(test, tidy_fun = broom.mixed::tidy)
ggcoef_model(test, tidy_fun = broom.mixed::tidy, conf.int = T, intercept=F)
ggcoef_model(test, tidy_fun = broom.mixed::tidy(test, effects="fixed", component = "cond", conf.int = TRUE))
There are some (old!) bugs that have recently been fixed (here, here) that would make confidence interval reporting on RE parameters break for any model with multiple random terms (I think). I believe that if you are able to install updated versions of both glmmTMB and broom.mixed:
remotes::install_github("glmmTMB/glmmTMB/glmmTMB#ci_tweaks")
remotes::install_github("bbolker/broom.mixed")
then ggcoef_model(test) will work.

Error when calculating variable importance with categorical variables using the caret package (varImp)

I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
The following dummy example should illustrate my point (edited to reflect #StupidWolf's correction):
library(caret)
#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))
# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data,
method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)
The issue persists when using different models (like randomForest and SVM).
If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.
Thanks!
When you call varImp on lvq , it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page:
For two class problems, a series of cutoffs is applied to the
predictor data to predict the class. The sensitivity and specificity
are computed for each cutoff and the ROC curve is computed.
Now if you read the source code of varImp.train() that feeds the data into filterVarImp(), it is the original dataframe and not whatever comes out of the preprocess.
This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:
filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
So using my example and like you have pointed out, you need to onehot encode it:
set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
ohe_data = data.frame(
Label = dummy_data$Label,
model.matrix(Label ~ 0+.,data=dummy_data))
model.lvq <- caret::train(Label~., data=ohe_data,
method="lvq", preProcess="scale",
trControl=control.lvq)
caret::varImp(model.lvq, scale=FALSE)
ROC curve variable importance
Importance
pred1 0.6575
pred20 0.6000
pred21 0.6000
If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.
Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:
model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]
However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.

Automate Selection of Optimum GARCH model in R

I've created 7 GARCH models in R (CIGarch1, CIGarch2, etc.) using the functions ugarchspec and ugarchfit. I would like to automate the selection of the GARCH model with the lowest Akaike score and the model name to appear in my ugarchboot function. I tried the following code:
# Create Akaike value table to find model with minimum score ----
CIGarch_Akaike_tbl <- data.frame(model = c("CIGarch1","CIGarch2","CIGarch3","CIGarch4","CIGarch5","CIGarch6","CIGarch7"),akaike_score = c(CIGarch1_Akaike, CIGarch2_Akaike, CIGarch3_Akaike,
CIGarch4_Akaike, CIGarch5_Akaike, CIGarch6_Akaike,
CIGarch7_Akaike))
# Looks for minimum Akaike value and returns the model name to be used in the following predict function
min_Akaike_Garch_model <- as.character(CIGarch_Akaike_tbl[which(CIGarch_Akaike_tbl$akaike_score==min(CIGarch_Akaike_tbl$akaike_score)),1])
# Predicting stock price with ugarchboot ----
# Update model based on best_Garch_model
CIpredict <- ugarchboot(min_Akaike_Garch_model, n.ahead = 10,
method = c("Partial", "Full")[1])
I get the following error message:
"Error in UseMethod("ugarchboot") :
no applicable method for 'ugarchboot' applied to an object of class "character".
I've tried other as.#### functions, but to no avail. Is there a way to do what I'm trying to do?

Extracting predictions from a GAM model with splines and lagged predictors

I have some data and am trying to teach myself about utilize lagged predictors within regression models. I'm currently trying to generate predictions from a generalized additive model that uses splines to smooth the data and contains lags.
Let's say I have the following data and have split the data into training and test samples.
head(mtcars)
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
Great, let's train the gam model on the training set.
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(lag(disp, 1), bs="cr"), data=mtcars[Train,])
summary(f_gam)
When I go to predict on the holdout sample, I get an error message.
f_gam.pred <- predict(f_gam, mtcars[-Train,]); f_gam.pred
Error in ExtractData(object, data, NULL) :
'names' attribute [1] must be the same length as the vector [0]
Calls: predict ... predict.gam -> PredictMat -> Predict.matrix3 -> ExtractData
Can anyone help diagnose the issue and help with a solution. I get that lag(__,1) leaves a data point as NA and that is likely the reason for the lengths being different. However, I don't have a solution to the problem.
I'm going to assume you're using gam() from the mgcv library. It appears that gam() doesn't like functions that are not defined in "base" in the s() terms. You can get around this by adding a column which include the transformed variable and then modeling using that variable. For example
tmtcars <- transform(mtcars, ldisp=lag(disp,1))
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(ldisp, bs="cr"), data= tmtcars[Train,])
summary(f_gam)
predict(f_gam, tmtcars[-Train,])
works without error.
The problem appears to be coming from the mgcv:::get.var function. It tires to decode the terms with something like
eval(parse(text = txt), data, enclos = NULL)
and because they explicitly set the enclosure to NULL, variable and function names outside of base cannot be resolved. So because mean() is in the base package, this works
eval(parse(text="mean(x)"), data.frame(x=1:4), enclos=NULL)
# [1] 2.5
but because var() is defined in stats, this does not
eval(parse(text="var(x)"), data.frame(x=1:4), enclos=NULL)
# Error in eval(expr, envir, enclos) : could not find function "var"
and lag(), like var() is defined in the stats package.

Issues with predict function when building a CART model via CrossValidation using the train command

I am trying to build a CART model via cross validation using the train function of "caret" package.
My data is 4500 x 110 data frame, where all the predictor variables (except the first two, UserId and YOB (Year of Birth) which I am not using for model building) are factors with 2 levels except the dependent variable which is of type integer (although has only two values 1 and 0). Gender is one of the independent variables.
When I ran rpart command to get CART model (using the package "rpart"), i didn't have any problem with the predict function. However, I wanted to improve the model via cross validation, and so used the train function from the package "caret" with the following command:
tr = train(y ~ ., data = subImpTrain, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
This build the model with the following warning
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
But it did give me a final model (best.tree). However, when I am trying to run the predict function using the following command:
best.tree.pred = predict(best.tree, newdata = subImpTest)
on the test data, it is giving me the following error:
Error in eval(expr, envir, enclos) : object 'GenderMale' not found
The Gender variable has two values: Female, Male
Can anybody help me understand the error
As #lorelai suggested, caret dummy-codes your variables if you supply it a formula. An alternative is to provide it the variables themselves, like so:
tr = train(y = subImpTrain$y, x = subImpTrain[, -subImpTrain$y],
method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
More importantly, however, you shouldn't use predict.rpart and instead use predict.train, like so:
predict(tr, subImpTest)
In which case it would work just fine with the formula interface.
I have had a similar problem in the past, although concerning another algorithm.
Basically, some algorithms transform the factor variables into dummy variables and rename them accordingly.
My solution was to create my own dummies and leave them in numerical format.
I read that decision trees manage to work properly even so.

Resources