How can I get the R IML FeatureImp() function to work? - r

I am trying to get the FeatureImp function from the IML package to work, but it keeps throwing an error. Below is an example from the diamonds dataset, on which I train a random forest model.
library(iml)
library(caret)
library(randomForest)
data(diamonds)
# create some binary classification target (without specific meaning)
diamonds$target <- as.factor(ifelse(diamonds$color %in% c("D", "E", "F"), "X", "Y"))
# drop categorical variables (to keep it simple for demonstration purposes)
diamonds <- subset(diamonds, select = -c(color, clarity, cut))
# train model
mdl_diamonds <- train(target ~ ., method = "rf", data = diamonds)
# create iml predictor
x_pred <- Predictor$new(model = mdl_diamonds, data = diamonds[, 1:7], y = diamonds$target, type = "prob")
# calculate feature importance
x_imp <- FeatureImp$new(x_pred, loss = "mae")
This ends with the following error:
Error in if (self$original.error == 0) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
I don't understand what I'm doing wrong. Can anyone give me clue?
I'm working on R version 3.5.1, iml package version 0.9.0.

I have found the problem. I was using "mae" as the loss function, which is - I could have known - not applicable for a classification target. Using "ce" or "f1" returns output as expected.

because it's random forest. so try loss = 'ce'.

Related

Changing glmnet from binomial to multinomial gives error

I'm trying to adapt code for a binomial glmnet to make it work for a multinomial problem, but for some reason I keep getting an error code.
Here's the original code for the binomial model that works perfectly:
traininglasso <- stratified(sp_lasso, group = "Cat",
select = list(Cat = c("A","B", "C")),
size = c(86), replace=FALSE)
traininglasso[,Cat:=factor(Cat, labels = c("B", "B", "C") )]
check_lasso <- anti_join(sp_lasso, traininglasso, by=c("Accepted Symbol"))
check_lasso[,Cat:=factor(Cat, labels = c("B", "B", "C") )]
use_for_lasso <- within(for_lasso, Cat <- relevel(Cat, ref="C"))
lassod <- model.matrix(Cat~., use_for_lasso)[,-1]
cv.lassod <- cv.glmnet(lassod, use_for_lasso$Cat, alpha =1, family= "binomial")
lambdad <- cv.lassod
lasso_modeld <- glmnet(lassod, use_for_lasso$Cat, alpha =1, family = "binomial",
lambda = lambdad$lambda.1se)
coefd <- coef(lasso_modeld)
check_lasso_matrix <- model.matrix(Cat~., check_lasso)[,-1]
probslasso4 <- as.data.frame(predict.glmnet(lasso_modeld, type="response", newx = check_lasso_matrix))
Sorry that it's so wordy, but basically my steps are this:
Conduct stratified random sample of original dataset to get 86 observations of each of three categories (Cat): "A", "B", and "C"
Join categories A and B together so that the outcome is binary (two categories, just B and C)
Assemble all observations not in the random sample to use for checking model accuracy at the end and recategorize those as well.
Run the steps for a LASSO glm as recommended
Then, in the last line, generate predictions for checking the accuracy of the model using the non-training data.
Again, all of this works perfectly fine. However, when I leave my data as three categories and change the family to multinomial (those are quite literally the only changes I've made in the code below, everything else including the data is the same) I get this error message:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': requires numeric/complex matrix/vector arguments
I've read about other people getting this error and simply needing to reformat their matrices, but I suspect that's not my issue since the binomial code works with the matrix I used for that.
Here's the code that I've tried for the multinomial version that isn't working. I ran the entire code chunk above again, but I'm only including here the 4 lines that I edited to go from binomial to multinomial:
traininglasso[,Cat:=factor(Cat, labels = c("A", "B", "C") )]
check_lasso[,Cat:=factor(Cat, labels = c("A", "B", "C") )]
cv.lassod<- cv.glmnet(lassod, use_for_lasso$Cat, alpha =1, family= "multinomial")
lasso_modeld <- glmnet(lassod, use_for_lasso$Cat, alpha =1, family = "multinomial",
lambda = lambdad$lambda.1se)
Figured it out. For whatever reason, when you create a multinomial model with glmnet, you need to use the regular predict() function instead of the predict.glmnet() function. Using the same model matrix for both multinomial and binomial models works fine - seems like the error actually has nothing to do with the matrix format.

How to write custom predict function for classification model in R?

I am trying to use the flashlight package with the h2o package. An example of doing this on a regression model can be found here. However, I am trying to make it work for a classification model... to achieve this I was following the example given in the link. flashlight will work with h2o if you provide your own custom predict function. However, the predict function that is in the example below does not work for classification.
Here is the code I'm using:
library(flashlight)
library(h2o)
h2o.init()
h2o.no_progress()
iris_hf <- as.h2o(iris)
iris_dl <- h2o.deeplearning(x = 1:4, y = "Species", training_frame = iris_hf, seed=123456)
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))))
fl_NN <- flashlight(model = iris_dl, data = iris, y = "Species", label = "NN",
predict_function = pred_fun)
But when I try and check the importance or interactions, I get an error.... for example:
light_interaction(fl_NN, type = "H",
pairwise = TRUE)
Throws back the error:
Error: Assigned data predict(x, data = X[, cols, drop = FALSE]) must
be compatible with existing data. Existing data has 22500 rows.
Assigned data has 90000 rows. ℹ Only vectors of size 1 are recycled.
I need to change the predict function somehow to make it work... but I have had no success yet... any suggestion as to how I could change the predict function to work?
EDIT UPDATE: So, I found a custom predict function that works with the light_interaction function. That is:
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))[,2]))
Where the above is indexed for the specific category. However, The above doesn't work for calculating the importance. For example:
light_importance(fl_NN)
Gives the error:
Warning messages:
1: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
2: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
3: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
4: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
5: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
So, Im still trying to figure this out!?

Make predictions on new data after training the GLM Lasso model

I have trained a classfication model on 13,000 rows of labels with lasso in r's glmnet library. I have checked my accuracy and it looks decent, now I want to make predictions for rest of the dataset, which is 300,000 rows. My approach was to label rest of the rows using the trained model. I'm not sure if that's the most effective strategy to do approximate labeling.
But, when I'm trying to label rest of the data, I'm running into this error:
Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Even if I break the dataset to 5000 rows for predictions, I still get the same error.
Here's my code:
library(glmnet)
#the subset of original dataset
data.text <- data.text_filtered %>% filter(!label1 == "NA")
#Quanteda corpus
data_corpus <- corpus(data.text$text, docvars = data.frame(labels = data.text$label1))
set.seed(1234)
dataShuffled <- corpus_sample(data_corpus, size = 12845)
dataDfm <- dfm_trim( dfm(dataShuffled, verbose = FALSE), min_termfreq = 10)
#model to train the classifier
lasso <- cv.glmnet(x = dataDfm[1:10000,], y = trainclass[1:10000],
alpha = 1, nfolds = 5, family = "binomial")
#plot the lasso plot
plot(lasso)
#predictions
dataPreds <- predict(lasso, dataDfm[10000:2845,], type="class")
(movTable <- table(dataPreds, docvars(dataShuffled, "labels")[10000:2845]))
make predictions on rest of the dataset. This dataset has 300,000 rows.
data.text_NAs <- data.text_filtered %>% filter(label1 == "NA")
data_NADfm <- dfm_trim( dfm(corpus(data.text_NAs$text), verbose = FALSE), min_termfreq = 10)
data.text_filtered <- data.text_filtered %>% mutate(label = predict(lasso, as.matrix(data_NADfm), type="class", s="lambda.1se")
Thanks much for any help.
The problem lies in the as.matrix(data_NADfm) - this makes the dfm into a dense matrix, which makes it too large to handle.
Solution: Keep it sparse: either remove the as.matrix() wrapper, or if it does not like a raw dfm input, you can coerce it to a plain sparse matrix (from the Matrix package) using as(data_NADfm, "dgCMatrix"). This should be fine since both cv.glmnet() and its predict() method can handle sparse matrix inputs.

Fail to predict woe in R

I used this formula to get woe with
library("woe")
woe.object <- woe(data, Dependent="target", FALSE,
Independent="shop_id", C_Bin=20, Bad=0, Good=1)
Then I want to predict woe for the test data
test.woe <- predict(woe.object, newdata = test, replace = TRUE)
And it gives me an error
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "data.frame"
Any suggestions please?
For prediction, you cannot do it with the package woe. You need to use the package. Take note of the masking of the function woe, see below:
#let's say we woe and then klaR was loaded
library(klaR)
data = data.frame(target=sample(0:1,100,replace=TRUE),
shop_id = sample(1:3,100,replace=TRUE),
another_var = sample(letters[1:3],100,replace=TRUE))
#make sure both dependent and independent are factors
data$target=factor(data$target)
data$shop_id = factor(data$shop_id)
data$another_var = factor(data$another_var)
You need two or more dependent variables:
woemodel <- klaR::woe(target~ shop_id+another_var,
data = data)
If you only provide one, you have an error:
woemodel <- klaR::woe(target~ shop_id,
data = data)
Error in woe.default(x, grouping, weights = weights, ...) : All
factors with unique levels. No woes calculated! In addition: Warning
message: In woe.default(x, grouping, weights = weights, ...) : Only
one single input variable. Variable name in resulting object$woe is
only conserved in formula call.
If you want to predict the dependent variable with only one independent, something like logistic regression will work:
mdl = glm(target ~ shop_id,data=data,family="binomial")
prob = predict(mdl,data,type="response")
predicted_label = ifelse(prob>0.5,levels(data$target)[1],levels(data$target)[0])

Calculating prediction accuracy of a tree using rpart's predict method

I have constructed a decision tree using rpart for a dataset.
I have then divided the data into 2 parts - a training dataset and a test dataset. A tree has been constructed for the dataset using the training data. I want to calculate the accuracy of the predictions based on the model that was created.
My code is shown below:
library(rpart)
#reading the data
data = read.table("source")
names(data) <- c("a", "b", "c", "d", "class")
#generating test and train data - Data selected randomly with a 80/20 split
trainIndex <- sample(1:nrow(x), 0.8 * nrow(x))
train <- data[trainIndex,]
test <- data[-trainIndex,]
#tree construction based on information gain
tree = rpart(class ~ a + b + c + d, data = train, method = 'class', parms = list(split = "information"))
I now want to calculate the accuracy of the predictions generated by the model by comparing the results with the actual values train and test data however I am facing an error while doing so.
My code is shown below:
t_pred = predict(tree,test,type="class")
t = test['class']
accuracy = sum(t_pred == t)/length(t)
print(accuracy)
I get an error message that states -
Error in t_pred == t : comparison of these types is not implemented In
addition: Warning message: Incompatible methods ("Ops.factor",
"Ops.data.frame") for "=="
On checking the type of t_pred, I found out that it is of type integer however the documentation
(https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/predict.rpart.html)
states that the predict() method must return a vector.
I am unable to understand why is the type of the variable is an integer and not a list. Where have I made the mistake and how can I fix it?
Try calculating the confusion matrix first:
confMat <- table(test$class,t_pred)
Now you can calculate the accuracy by dividing the sum diagonal of the matrix - which are the correct predictions - by the total sum of the matrix:
accuracy <- sum(diag(confMat))/sum(confMat)
My response is very similar to #mtoto's one but a bit more simply... I hope it also helps.
mean(test$class == t_pred)

Resources