How to write custom predict function for classification model in R? - r

I am trying to use the flashlight package with the h2o package. An example of doing this on a regression model can be found here. However, I am trying to make it work for a classification model... to achieve this I was following the example given in the link. flashlight will work with h2o if you provide your own custom predict function. However, the predict function that is in the example below does not work for classification.
Here is the code I'm using:
library(flashlight)
library(h2o)
h2o.init()
h2o.no_progress()
iris_hf <- as.h2o(iris)
iris_dl <- h2o.deeplearning(x = 1:4, y = "Species", training_frame = iris_hf, seed=123456)
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))))
fl_NN <- flashlight(model = iris_dl, data = iris, y = "Species", label = "NN",
predict_function = pred_fun)
But when I try and check the importance or interactions, I get an error.... for example:
light_interaction(fl_NN, type = "H",
pairwise = TRUE)
Throws back the error:
Error: Assigned data predict(x, data = X[, cols, drop = FALSE]) must
be compatible with existing data. Existing data has 22500 rows.
Assigned data has 90000 rows. ℹ Only vectors of size 1 are recycled.
I need to change the predict function somehow to make it work... but I have had no success yet... any suggestion as to how I could change the predict function to work?
EDIT UPDATE: So, I found a custom predict function that works with the light_interaction function. That is:
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))[,2]))
Where the above is indexed for the specific category. However, The above doesn't work for calculating the importance. For example:
light_importance(fl_NN)
Gives the error:
Warning messages:
1: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
2: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
3: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
4: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
5: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
So, Im still trying to figure this out!?

Related

Error when trying to fit Hierarchical GAMs (Model GS or S) using mgcv

I have a large dataset (~100k observations) of presence/absence data that I am trying to fit a Hierarchical GAM with individual effects that have a Shared penalty (e.g. 'S' in Pedersen et al. 2019). The data consists of temp as numeric, region (5 groups) as a factor.
Here is a simple version of the model that I am trying to fit.
modS1 <- gam(occurrence ~ s(temp, region), family = binomial,
data = df, method = "REML")
modS2 <- gam(occurrence ~ s(temp, region, k= c(10,4), family = binomial,
data = df, method = "REML")
In the first case I received the following error:
Which I assumed it because k was set too high for region given there are only 5 different regions in the data set.
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
In the second case I attempt to lower k for region and receive this error:
Error in if (k < M + 1) { : the condition has length > 1
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
I can fit Models G and GI and I from Pedersen et al. 2019 with no issues. It is models GS and S where I run into issues.
If anyone has any insights I would really appreciate it!
The bs = "fs" argument in the code you're using as a guide is important. If we start at the ?s help page and click on the link to the ?smooth.terms help page, we see:
Factor smooth interactions
bs="fs" Smooth factor interactions are often produced using by variables (see gam.models), but a special smoother class (see factor.smooth.interaction) is available for the case in which a smooth is required at each of a large number of factor levels (for example a smooth for each patient in a study), and each smooth should have the same smoothing parameter. The "fs" smoothers are set up to be efficient when used with gamm, and have penalties on each null space component (i.e. they are fully ‘random effects’).
You need to use a smoothing basis appropriate for factors.
Notably, if you take your source code and remove the bs = "fs" argument and attempt to run gam(log(uptake) ∼ s(log(conc), Plant_uo, k=5, m=2), data=CO2, method="REML"), it will produce the same error that you got.

How do you compute average marginal effects for glm.cluster models?

I am looking for a way to compute average marginal effects with clustered standard errors which i seem to be having a few problems with. My model is as follows:
cseLogit <- miceadds::glm.cluster(data = data_long,
formula = follow ~ f1_distance + f2_distance + PolFol + MediaFol,
cluster = "id",
family = binomial(link = "logit"))
Where the dependent variable is binary (0/1) and all explanatory variables are numeric. I've tried to different ways of getting average marginal effects. The first one is:
marginaleffects <- margins(cseLogit, vcov = your_matrix)
Which gives me the following error:
Error in find_data.default(model, parent.frame()) :
'find_data()' requires a formula call
I've also tried this:
marginaleffects <- with(cseLogit, margins(glm_res, vcov=vcov))
which gives me this error:
Error in eval(predvars, data, env) :
object 'f1_distance' was not found
In addition: warnings:
1: In dydx.default(X[[i]], ...) :
Class of variable, f1_distance, is unrecognized. Returning NA.
2: In dydx.default(X[[i]], ...) :
Class of variable, f2_distance, is unrecognized. Returning NA.
Can you tell me what i'm doing wrong? If i haven't provided enough information, please let me know. Thanks in advance.

Fail to predict woe in R

I used this formula to get woe with
library("woe")
woe.object <- woe(data, Dependent="target", FALSE,
Independent="shop_id", C_Bin=20, Bad=0, Good=1)
Then I want to predict woe for the test data
test.woe <- predict(woe.object, newdata = test, replace = TRUE)
And it gives me an error
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "data.frame"
Any suggestions please?
For prediction, you cannot do it with the package woe. You need to use the package. Take note of the masking of the function woe, see below:
#let's say we woe and then klaR was loaded
library(klaR)
data = data.frame(target=sample(0:1,100,replace=TRUE),
shop_id = sample(1:3,100,replace=TRUE),
another_var = sample(letters[1:3],100,replace=TRUE))
#make sure both dependent and independent are factors
data$target=factor(data$target)
data$shop_id = factor(data$shop_id)
data$another_var = factor(data$another_var)
You need two or more dependent variables:
woemodel <- klaR::woe(target~ shop_id+another_var,
data = data)
If you only provide one, you have an error:
woemodel <- klaR::woe(target~ shop_id,
data = data)
Error in woe.default(x, grouping, weights = weights, ...) : All
factors with unique levels. No woes calculated! In addition: Warning
message: In woe.default(x, grouping, weights = weights, ...) : Only
one single input variable. Variable name in resulting object$woe is
only conserved in formula call.
If you want to predict the dependent variable with only one independent, something like logistic regression will work:
mdl = glm(target ~ shop_id,data=data,family="binomial")
prob = predict(mdl,data,type="response")
predicted_label = ifelse(prob>0.5,levels(data$target)[1],levels(data$target)[0])

How can I get the R IML FeatureImp() function to work?

I am trying to get the FeatureImp function from the IML package to work, but it keeps throwing an error. Below is an example from the diamonds dataset, on which I train a random forest model.
library(iml)
library(caret)
library(randomForest)
data(diamonds)
# create some binary classification target (without specific meaning)
diamonds$target <- as.factor(ifelse(diamonds$color %in% c("D", "E", "F"), "X", "Y"))
# drop categorical variables (to keep it simple for demonstration purposes)
diamonds <- subset(diamonds, select = -c(color, clarity, cut))
# train model
mdl_diamonds <- train(target ~ ., method = "rf", data = diamonds)
# create iml predictor
x_pred <- Predictor$new(model = mdl_diamonds, data = diamonds[, 1:7], y = diamonds$target, type = "prob")
# calculate feature importance
x_imp <- FeatureImp$new(x_pred, loss = "mae")
This ends with the following error:
Error in if (self$original.error == 0) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
I don't understand what I'm doing wrong. Can anyone give me clue?
I'm working on R version 3.5.1, iml package version 0.9.0.
I have found the problem. I was using "mae" as the loss function, which is - I could have known - not applicable for a classification target. Using "ce" or "f1" returns output as expected.
because it's random forest. so try loss = 'ce'.

using ksvm of kernlab package for predicting has an error

I use ksvm function to train the data, but in predicting I have an error,here is the code:
svmmodel4 <- ksvm(svm_train[,1]~., data=svm_train,kernel = "rbfdot",C=2.4,
kpar=list(sigma=.12),cross=5)
Warning message:
In .local(x, ...) : Variable(s) `' constant. Cannot scale data.
pred <- predict(svmmodel4, svm_test[,-1])
Error in eval(expr, envir, enclos) : object 'res_var' not found.
If I add the response variable, it works:
pred <- predict(svmmodel4, svm_test)
But if you add the response variable,how can it be "predict"? what is wrong with my code? Thanks for your help!
The complete code:
library(kernlab)
svmData <- read.csv("svmData.csv",header=T,stringsAsFactors = F)
svmData$res_var <- as.factor(svmData$res_var)
svm_train <- svmData1[1:2110,]
svm_test <- svmData1[2111:2814,]
svmmodel4 <- ksvm(svm_train[,1]~.,data = svm_train,kernel = "rbfdot",C=2.4,
kpar=list(sigma=.12),cross=5)
pred1 <- predict(svmmodel4,svm_test[,-1])
You can not remove your response column from your test dataset. You simply divide your data horizontally, meaning the response column must be in your training and testing datasets, or even validation dataset if you have one.
your function
pred <- predict(svmmodel4, svm_test)
is working just fine, the predict function will take your data, knowing your factored column, and test the rest against the model. Your training and testing datasets must have the same number of columns, but the number of rows could be different.

Resources