R - Top x Important Variable Each Individual Sample Data in Classification - r

I'm building a churn model using C5 algorithm in R. After finishing the model and successfully predicting the data, how do I know the top 3 important predictors for each of customer that will churn? So I know the reason why the model classifies -for example- cust A,B,D,F as positive and the others as negative. Is it possible?
Thanks.

Many models have built-in approaches for measuring the aggregate effect of the predictors on the model. The caret package contains a general class for calculating or returning these values including C5.0, JRip, PART, RRF, RandomForest, bagEarth, classbagg, cubist, dsa, earth, fda, gam, gbm, glm, glmnet, lm, multinom, mvr, nnet, pamrtrained, plsda, randomForest, regbagg, rfe, rpart, sbf, and train.
For example,
> library(caret)
> set.seed(1401)
> ctrl <- trainControl(method = 'repeatedcv' , number = 6, , repeats = 2 , classProbs = TRUE)
> C5fit <- train(x = iris[, 1:4], y = iris$Species, method = "C5.0", metric = "ROC", trControl = ctrl)
> varImp(C5fit, scale = FALSE)
C5.0 variable importance
Overall
Petal.Width 100
Sepal.Width 0
Petal.Length 0
Sepal.Length 0

You can plot the trees within the model. If you use a single C5.0 tree, this gives you an easy way to provide the exact reasoning of the tree.
library(C50)
set.seed(1401)
C5tree <- C5.0(x = iris[, 1:4], y = iris$Species, trials = 1) # A single C50 tree
C5imp(C5tree)
plot(C5tree, trial = 0)
If you use boosting (i.e. trials > 1 when you train the trees), then this approach is likely too complicated due to the number of trees.
C5boosted <- C5.0(x = iris[, 1:4], y = iris$Species, trials = 3) # Boost three trees
C5imp(C5boosted)
# Plot each of the trees
for(i in 0:2){ # trials starts counting at 0, see ?plot.C5.0
plot(C5boosted, trial = i)
}
Instead, you can rely on the variable importance for a general report of important variables or use partial dependence plots that show the (non-linear) effect of one variable relative to all other variables. I suggest having a look at package pdp on CRAN.

Related

R feature selection with LASSO

I have a small data set (37 observations x 23 features) and want to perform feature selection with LASSO regression in order to its reduce dimensionality. To achieve this, I designed the below code based on online tutorials
#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)
#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso', trControl=cv_5)
#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop)
In most cases the code runs without errors but it occasionally throws the following:
Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I sense this happens because my data has few rows and some variables have low variance.
Is there a way I can bypass or fix this issue (e.g. setting a parameter in the flow)?
You have a low number of observations, so there's a good chance in some training set, that some of your columns will be all zero, or very low variance. For example:
library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_5)
Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) :
Some of the columns of x have zero variance
Before running below, check that for categorical columns, all of them don't have only 1 positive label..
One way is to increase the cv fold, if you set 5, you are using 80% of the data. Try 10 to use 90% of the data:
cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_10)
And as you might have seen.. since the dataset is so small, cross-validation might not offer you that much advantage, you can also do leave one out cross-validation:
tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=tr)
You can use the FSinR package to perform feature selection. It is in R and accessible from CRAN. It has a wide variety of filter and wrapper methods that you can combine with search methods. The interface to generate the wrapper evaluator follows the caret interface. For example:
# Load the library
library(FSinR)
# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')
# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)
# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)

Custom Performance Function in caret Package using predicted Probability

This SO post is about using a custom performance measurement function in the caret package. You want to find the best prediction model, so you build several and compare them by calculating a single metric that is drawn from comparing the observation and the predicted value. There are default functions to calculate this metric, but you can also define your own metric-function. This custom functions must take obs and predicted values as input.
In classification problems (let's say only two classes) the predicted value is 0 or 1. However, what I need to evaluate is also the probability calculated in the model. Is there any way to achieve this?
The reason is that there are applications where you need to know whether a 1 prediction is actually with a 99% probability or with a 51% probability - not just if the prediction is 1 or 0.
Can anyone help?
Edit
OK, so let me try to explain a little bit better. In the documentation of the caret package under 5.5.5 (Alternate Performance Metrics) there is a description how to use your own custom performance function like so
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
twoClassSummary is the custom performance function in this example. The function provided here needs to take as input a dataframe or matrix with obs and pred. And here's the point - I want to use a function that does not take observerd and predicted, but observed and predicted probability.
One more thing:
Solutions from other packages are also welcome. The only thing I am not looking for is "This is how you write your own cross-validation function."
Caret does support passing class probabilities to custom summary functions when you specify classProbs = TRUE in trainControl. In that case the data argument when creating a custom summary function will have additional two columns named as classes containing the probability of each class. Names of these classes will be in the lev argument which is a vector of length 2.
See the Example:
library(caret)
library(mlbench)
data(Sonar)
Custom summary LogLoss:
LogLoss <- function (data, lev = NULL, model = NULL){
obs <- data[, "obs"] #truth
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name to extract probs for 2nd clas
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability, this line and bellow is just logloss calculation, irrelevant for your question
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss") #important since this is specified in call to train. Output can be a named vector of multiple values.
out
}
fitControl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = LogLoss)
fit <- train(Class ~.,
data = Sonar,
method = "rpart",
metric = "LogLoss" ,
tuneLength = 5,
trControl = fitControl,
maximize = FALSE) #important, depending on calculated performance measure
fit
#output
CART
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 166, 166, 166, 167, 167
Resampling results across tuning parameters:
cp LogLoss
0.00000000 1.1220902
0.01030928 1.1220902
0.05154639 1.1017268
0.06701031 1.0694052
0.48453608 0.6405134
LogLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.4845361.
Alternatively use the lev argument which contains the class levels and define some error checking
LogLoss <- function (data, lev = NULL, model = NULL){
if (length(lev) > 2) {
stop(paste("Your outcome has", length(lev), "levels. The LogLoss() function isn't appropriate."))
}
obs <- data[, "obs"] #truth
probs <- data[, lev[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss")
out
}
Check out this section of caret book: https://topepo.github.io/caret/model-training-and-tuning.html#metrics
for additional info. Great book to read if you plan on using caret and even if you are not its a good read.
Sadly, I just found the answer to my question. There is this one little sentence in the caret documentation...
"...If none of these parameters are satisfactory, the user can also compute custom performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:
data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values (either numeric data for regression or character values for classification). Currently, class probabilities are not passed to the function. The values in data are the held-out predictions (and their associated reference values) for a single combination of tuning..."
For the sake of documentation: This is written on 2020-07-03 with caret package documentation from 2019-03-27.
I am not really sure I understand your question correctly:
To receive predicted probabilities from a model mdl, you can use predict(mdl, type = "prob").
I.e.,
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
df <- iris
df$isSetosa <- factor(df$Species == "setosa", levels = c(FALSE, TRUE), labels = c("not-setosa", "is-setosa"))
df$Species <- NULL
mdl <- train(isSetosa ~ ., data = df, method = "glm",
family = "binomial",
trControl = trainControl(method = "cv"))
preds <- predict(mdl, newdata = df, type = "prob")
head(preds)
#> not-setosa is-setosa
#> 1 2.220446e-16 1
#> 2 2.220446e-16 1
#> 3 2.220446e-16 1
#> 4 1.875722e-12 1
#> 5 2.220446e-16 1
#> 6 2.220446e-16 1
Created on 2020-07-02 by the reprex package (v0.3.0)
I.e., we see that case 4 is predicted to be a setosa with ~100% (tbh, this toy model is way too good to be true)...
Now we can create a custom function that collapses the values to a single metric.
true <- df$isSetosa
# very basic model metrics that just sums the absolute differences in true - probability
custom_model_metric <- function(preds, true) {
d <- data.frame(true = true)
tt <- predict(dummyVars(~true, d), d)
colnames(tt) <- c("not-setosa", "is-setosa")
sum(abs(tt - preds))
}
custom_model_metric(preds, true)
#> [1] 3.294029e-09
Created on 2020-07-02 by the reprex package (v0.3.0)

R: how to improve gradient boosting model fit

I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris data set using gbm in the gbm package. I set the number of iterations to M = 1000 with a learning rate of learning.rate = 0.001. I then compared the results to those of a regression tree (using rpart). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.
library(rpart)
library(gbm)
data(iris)
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M,
interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)
tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)
> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438
In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.
library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()
Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:
set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)
gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm",
distribution="gaussian",tuneGrid=tg,trControl=tr)
Then we use the same samples to fit rpart:
#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)
Finally we compare them, and they are very similar:
gbm_fit$resample
RMSE Rsquared MAE Resample
1 0.3459311 0.5000575 0.2585884 0
2 0.3421506 0.4536114 0.2631338 1
3 0.3428588 0.5600722 0.2693837 2
RMSE Rsquared MAE Resample
1 0.3492542 0.3791232 0.2695451 0
2 0.3320841 0.4276960 0.2550386 1
3 0.3284239 0.4343378 0.2570833 2
So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.

R: svm from e1071 predictions differ based on "probability" argument setting

Under certain circumstances, there are differences in predictions from e1071 package svm models depending on the setting of the probability input argument. This code example:
rm(list = ls())
data(iris)
## Training and testing subsets
set.seed(73) # For reproducibility
ri = sample(seq(1, nrow(iris)), round(nrow(iris)*0.8))
train = iris[ri, ]
test = iris[-ri,]
## Models and predictions with probability setting F or T
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m1 <- svm(Species ~ ., data = train, probability = F)
pred1 = predict(m1, newdata = test, probability = F)
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m2 <- svm(Species ~ ., data = train, probability = T)
pred2 = predict(m2, newdata = test, probability = T)
## Accuracy
acc1 = sum(test$Species == pred1)/nrow(iris)
acc2 = sum(test$Species == pred2)/nrow(iris)
will give
acc1 = 0.18666...
acc2 = 0.19333...
My conclusion is that svm() performs calculations differently based on the setting of the probability parameter.
Is that correct?
If so, why and how does it differ?
I haven't seen anything about this in the docs for the package or function.
The reason I bother with this is that I have found the performance of the classification to be not only different, but consistently slightly worse when probability = T in a project where I do classification based on ~800 observations of ~250 gene abundances (bioinformatics stuff). The code from that project contains data cleaning and uses cross-validation, making it a bit bulky to include here, so you'll have to take my word for it.
Any ideas folks?

Differences when tuning neural network with two output variables using caret and neuralnet packages

I'm using caret package and 'neuralnet' model so as to find the best tuning parameters for a neural network based on a data set which contains several predictors transformed by PCA. This data set also contains two output numeric variables, so I want to model these two variables against the predictors. Thus, I'm performing regression.
When using the package 'neuralnet', I get the desired output: a network whose output layer consists of two neurons, corresponding to the two output variables that I want to model, as you can see from the following code.
library(neuralnet)
neuralnet.network <- neuralnet(x + y ~ PC1 + PC2, train.pca.groundTruth, hidden=2, rep=5, algorithm = "rprop+", linear.output=T)
> head(compute(neuralnet.network, test.pca[,c(1,2)])$net.result)
[,1] [,2]
187 0.5890781796 0.3481661367
72 0.7182396668 0.4330461404
107 0.5854193907 0.3446555435
228 0.6114171607 0.3648684296
262 0.6727465772 0.4035759540
135 0.5559830113 0.3288717153
However, when using the same model with train function from caret package, the output consists of just one single variable, named '.outcome', which is in fact the sum of the two variables. This is the code:
paramGrid <- expand.grid(.layer1 = c(2), .layer2 = 0, .layer3 = 0)
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
set.seed(23)
caret.neuralnet <- train(x + y ~ PC1 + PC2, data = train.pca.groundTruth, method = "neuralnet", metric = "RMSE", tuneGrid = paramGrid, trControl = ctrl, algorithm = "rprop+", linear.output = T)
> head(predict(caret.neuralnet, test.pca[,c(1,2)]))
[1] 0.9221328635 1.1953289038 1.0333353272 0.9561434406 1.0409961115 0.8834807926
Is there any possibility to prevent caret train function from interpreting the symbol '+' in a formula as summation but as the specification of several output variables, just as neuralnet does? I've tried the x-y form, though it doesn't work.
I would like to know whether there is any form to do that without training separate models for each output variable.
Thank you so much!
train doesn't support multiple outcomes so the intended symbolic formula x + y resolves to a literal one adding x and y.
Max

Resources