Write custom classifier in R and predict function - r

I would like to implement my own custom classifier in R, e.g., myClassifier(trainingSet, ...) which returns the learnt model m from a specified training set. I would like to call it just like any other classifier in r:
m <- myClassifier(trainingSet)
and then I want to overload (I don't know if this is the correct word) the generic function predict()
result <- predict(m, myNewData)
I have just a basic knowledge in R. I don't know which resources I should read in order to accomplish the desired task. In order for this to work, do I need to create a package?. I am looking for some initial directions.
Does the model m contain information about the overriden predict method?, or how does R knows which predict.* method corresponds to model m?

Here is some code that shows how to write a method for your own class for a generic function.
# create a function that returns an object of class myClassifierClass
myClassifier = function(trainingData, ...) {
model = structure(list(x = trainingData[, -1], y = trainingData[, 1]),
class = "myClassifierClass")
return(model)
}
# create a method for function print for class myClassifierClass
predict.myClassifierClass = function(modelObject) {
return(rlogis(length(modelObject$y)))
}
# test
mA = matrix(rnorm(100*10), nrow = 100, ncol = 10)
modelA = myClassifier(mA)
predict(modelA)

Related

how does parsnip know how to match `fit` arguments to function arguments for a model?

I am trying to create a new model for the parsnip package from an existing modeling function foo.
I have followed the tutorial in building new models in parsnip and followed the README on Github, but I still cannot figure out some things.
How does the fit function in parsnip know how to assign its input data (e.g. a matrix) to my idiosyncratic function call?
Imagine if there was an idiosyncratic model function foo where the conventional roles of x and y arguments were reversed: i.e. foo(x,y) where x should be an outcome vector and y should be a predictor matrix, bizarrely.
For example: suppose a is a matrix of predictors and b is a vector of outcomes. Then I call fit_xy(object=my_model, x=a, y=b). Internally, how does fit_xy() know to call foo(x=y,y=x) ?
The function to validate the input is check_final_param, which require that each argument e.g. have to be named. That is why order is not important.
https://github.com/tidymodels/parsnip/blob/f7ba069671684f61af0ca1eadb1927fedec8a9c6/R/misc.R#L235
The README file linked by you pointing out:
"To create the model fit call, the protect arguments are populated with the appropriate objects (usually from the data set), and rlang::call2 is used to create a call that can be executed. "
Example of randomForest which using ntree instead of default trees argument.
They created a translation calls which will be used during evaluation.
https://github.com/tidymodels/parsnip/blob/228a6dc6975fc91562b63d191e43d2164cc78e3d/R/rand_forest_data.R#L339
If we use call2 and unpack the named args the order does not matter. And as we know that args will be properly named because of additional translation step.
args <- list(na.rm = TRUE, trim = 0)
rlang::call2("mean", 1:10, !!!args)
The way we do this is through the set_fit() function. Most models are pretty sensible and we can use default mappings (for example, from data argument to data argument or x to x) but you are right that some models use different norms. An example of this are the Spark models that use x to mean what we might normally call data with a formula method.
The random forest set_fit() function for Spark looks like this:
set_fit(
model = "rand_forest",
eng = "spark",
mode = "classification",
value = list(
interface = "formula",
data = c(formula = "formula", data = "x"),
protect = c("x", "formula", "type"),
func = c(pkg = "sparklyr", fun = "ml_random_forest"),
defaults = list(seed = expr(sample.int(10 ^ 5, 1)))
)
)
Notice especially the data element of the value argument. You can read a bit more here.

Can't give a subset when using randomForest inside a function

I'm wanting to create a function that uses within it the randomForest function from the randomForest package. This takes the "subset" argument, which is a vector of row numbers of the data frame to use for training. However, if I use this argument when calling the randomForest function in another defined function, I get the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
Here is a reproducible example, where we attempt to train a random forest to classify a response "type" either "A" or "B", based on three numerical predictors:
library(randomForest)
# define a random data frame to train with
test.data = data.frame(
type = rep(NA, times = 500),
x = runif(500),
y = runif(500),
z = runif(500)
)
train.data$type[runif(500) >= 0.5] = "A"
train.data$type[is.na(test.data$type)] = "B"
train.data$type = as.factor(test.data$type)
# define the training range
training.range = sample(500)[1:300]
# formula to use
tr_form = formula(type ~ x + y + z)
# Function that includes the randomForest function
train_rf = function(form, all_data, tr_subset) {
p = randomForest(
formula = form,
data = all_data,
subset = tr_subset,
na.action = na.omit
)
return(p)
}
# test the new defined function
test_tree = train_rf(form = tr_form, all_data = train.data, tr_subset = training.range)
Running this gives the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
If, however, subset = tr_subset is removed from the randomForest function, and tr_subset is removed from the train_rf function, this code runs fine, however the whole data set is used for training!
It should be noted that using the subset argument in randomForest when not defined in another function works completely fine, and is the intended method for the function, as described in the vignette linked above.
I know in the mean time I could just define another training set that has just the row numbers required, and train using all of that, but is there a reason why my original code doesn't work please?
Thanks.
EDIT: I conjecture that, as subset() is a base R function, R is getting confused and thinking you're wanting to use the base R function rather than defining an argument of the randomForest function. I'm not an expert, though, so I may be wrong.

Create list of predefined S3 objects

I am busy with comparing different machine learning techniques in R. This is the case: I made several functions that, in an automated way are able to create each a different prediction model (e.g: logistic regression, random forest, neural network, hybrid ensemble , etc.) , predictions, confusion matrices, several statistics (e.g AUC and Fscore) ,and different plots.
I managed to create an S3 object that is able to store the required data.
However, when I try to create a list of my defined object, this fails and all data is stored sequentially in 1 big list.
This is my S3 object (as this is the first time that I create S3, I am not really sure that the code is a 100% correct):
modelObject <- function(modelName , modelObject, modelPredictions , rocCurve , aUC , confusionMatrix )
{
modelObject <- list(
model.name = modelName,
model.object = modelObject,
model.predictions = modelPredictions,
roc.curve = rocCurve,
roc.auc = aUC,
confusion.matrix = confusionMatrix
)
## Set the name for the class
class(modelObject) <- "modelObject"
return(modelObject)
}
at the end of each machine learning function, I define and return the object:
shortened example:
NeuralNetworkAnalysis<- function() {
#I removed the unnecessary code, as only the end of the code is relevant
nn.model <- modelObject(modelName = "Neural.Network" , modelObject = NN , modelPredictions = predNN , rocCurve = roc , aUC = auc , confusionMatrix = confu )
return(nn.model)
}
At last, in my 'script' function, I create an empty list and try to append the different objects
#function header and arguments before this part are irrelevant
# Build predictive model(s)
modelList = list("model" = modelObject)
modelList <- append(modelList , NeuralNetworkAnalysis())
modelList <- append(modelList, RandomForestAnalysis())
mod <<- RandomForestAnalysis() #this is to test what the outcome is when I do not put it in a list
return(modelList) } #end of the function ModelBuilding
models <- ModelBuilding( '01/01/2013' , '01/01/2014' , '02/01/2014' , '02/01/2015' )
Now, when I take a look at the models list, I don't have a list of objects, I just have a list with all the data of each algorithm.
class(models) [1] "list"
class(mod) [1] "modelObject"
How can i fix this problem, so that i can have a list that contains for example:
list$random.forest$variable.I.want.to.access (most favorable)
or
list[i]$variable.of.random.forest.that.I.want.to.access
thx in advance!
Olivier
Not sure if I understand correctly, but maybe the issue is only how your model list is built. If you try
modelList[["neural.network"]] <- NeuralNetworkAnalysis()
modelList[["random.forest"]] <- RandomForestAnalysis()
etc., does that give you the access methods you are looking for?

Change SMOTE parameters inside CARET k-fold cross-validation classification

I have a classification problem with a very skewed class to predict (e.g. 90% / 10% unbalanced binary variable to predict).
In order to deal with that issue, I want to use the SMOTE method to oversample this class variable. However, as I read here (http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation) it is best practice to use SMOTE inside the k-fold loop to avoid overfitting.
As I'm using the caret package to perform my analysis, I'm referring to this link (http://topepo.github.io/caret/sampling.html). I undestand everything perfectly but the last part where it explains how to change the SMOTE parameters:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I simply don't understand this. Someone care to explain? Let's say I want to include the SMOTE parameters perc.over, k and perc.under, how would I do that?
Thank you very much.
EDIT:
Actually I realized I could probably just add these parameters inside the "SMOTE" expression in the above function, this would for instance give something like:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10, perc.over = 1200, perc.under = 100)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I am not sure to have understood what you do not understand but here is an attempt to clarify what is done in this piece of code.
The smotest object is created as list because it is the way the argument sampling of trainControl function must be represented. The first element of this list is a name used only for display purposes. The second, func, is the actual sampling function. The third, first, is a logical value indicating whether samplin must be done before or after the pre-processing step.
The element func is here only a wrapper of SMOTE function. In this wrapper, line 3 is here because only a data.frame can be passed to SMOTE function. Line 4 is added because a formula combined to a data.frame is used in SMOTE rather than a couple x y. Line 6 is here to ensure that the appropriate format is returned to trainControl.
And, to answer you last question: yes, you can do what you have proposed to set additional parameters to SMOTE.

access tuneValue from the model object when it is a S4 object - caret, custom model

I am using a custom model in caret which basically builds on the vanilla "cforest" method.
To build my model, I fetch the modelInfo for the cforest model:
newModel <- getModelInfo("cforest", regex=F)[[1]]
I need to implement a custom predict function so I do:
out$predict = function(modelFit, newdata, submodels = NULL) {
# new predict function which needs the tuned parameters
best.params <- modelFit$tuneValue
# rest of the code using best params
}
The content of the new predict function in itself is irrelevant. The point is, I need the tuned values from within the predict function.
While the code works perfectly fine with other models, this won't work with cforest because in this case, modelFit is a "RandomForest" S4 object and I cannot access tuneValue. (The exact error being "Error in modelFit$tuneValue : $ operator not defined for this S4 class")
I explored the "RandomForest" object and it does not appear to contain the tuned values in any slot.
My guess is that, since it is a S4 object, the caret code which stores the tuned values into $tuneValue does not work in this particular case.
Maybe I can save the tuned values manually at some point during the fitting process, but I don't know
1 - when I should do it (when are the tuned values selected?)
2 - where I should save them to have access to them during predict
Does anyone have an idea how I could do this?
Here is a minimal code to generate a RandomForest S4 object:
x <- matrix(rnorm(20*10), 20, 10)
y <- x %*% rnorm(10)
y <- factor(y<mean(y), levels=c(T,F), labels=c("high", "low"))
new_model <- getModelInfo("cforest", regex=F)[[1]]
fit <- train(x=x, y=y, method = new_model)
# this is a RandomForest S4 object
fit$finalModel
Took me a while to figure out but it was actually kind of straightforward. Since the model is a S4 object and I want to add informations in it... I built my own S4 object inheriting from this model!
In order to do this, I had to change the "fit" function.
# loading vanilla model
newModel <- getModelInfo("cforest", regex=F)[[1]]
# backing up old fit fun
newModel$old_fit <- out$fit
# editing fit function to wrap the S4 model into my custom S4 object
newModel$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
tmp <- newModel$old_fit(x, y, wts, param, lev, last, classProbs, ...)
if(isS4(tmp)) {
old_class <- as.character(class(tmp))
# creating custom class with the slots I need
setClass(".custom", slots=list(.threshold="numeric", .levels="character"), contains=old_class)
# instanciating the new class with values taken from the argument of fit()
tmp <- new(".custom", .threshold=param$threshold, .levels=lev, tmp)
}
tmp
}
And now, the model objects are consistently of class ".custom" so I can do:
newModel$predict = function(modelFit, newdata, submodels = NULL) {
if(isS4(modelFit)){
if(class(modelFit)!=".custom")
error("predict() received a S4 object whose class was not '.custom'")
obsLevels <- modelFit#.levels
threshold <- modelFit#.threshold
} else {
obsLevels <- modelFit$obsLevels
threshold <- modelFit$tuneValue$threshold
}
# rest of the code
}
This is great, now my custom model can extend any caret model, regardless on whether it relies on S4 objects, like cforest or svm!

Resources