Bernoulli vs Adaboost GBM? - r

I don't really understand the difference in practical terms of distribution = Adaboost or bernoulli
library(MASS)
library(gbm)
data=Boston
data$chas = factor(data$chas)
ada_model = gbm(chas~ . , data, distribution ='adaboost')
bern_model = gbm(chas ~ . , data, distribution = 'bernoulli')
ada_model
bern_model
I don't understand why bernoulli doesn't give any results? I guess I have a fundamental mis-understanding of how this works?
I'm looking for:
1. explanation why bernoulli doesn't work. I thought documentation said this can be used for classification?
2. if they can both be used for classification, what are the practical differences?

Bernoulli is breaking for you because the factor call recodes the 0/1s to 1/2s:
> str(factor(data$chas[350:400]))
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...

> str(data$chas)
int [1:506] 0 0 0 0 0 0 0 0 0 0 ...
> sum(data$chas==0) + sum(data$chas==1)
[1] 506
There are currently 506 integers which are all either zero or one. Nothing to do. Remove line 4 as #Neal Fultz recommended in his original comment and explained in his answer. If you want to explicitly bound the variable to {0,1}, you can use as.logical and your code becomes:
library(MASS)
library(gbm)
data=Boston
data$chas = as.logical(data$chas) # optionally cast as logical to force range into 0 or 1
ada_model = gbm(chas~ . , data, distribution ='adaboost')
bern_model = gbm(chas ~ . , data, distribution = 'bernoulli')
ada_model
bern_model
Reading between the lines a little, I'm guessing that your real problem is that your production dataset has values other than {0,1}. Casting them to logical will convert them to TRUE (1), and you're ready to go. If that's not what you want, then use this to find them and examine them case-by-case:
which((data$chas != 0) & (data$chas != 1))

Related

R glmnet classification - family = 'binomial', type = 'class', no errors, why am I still getting regression predictions?

I've been mucking around for about a week now, trying to figure this one out, so any help would be greatly appreciated.
I've got a data set with a binary target, and continuous predictors.
The input looks like this (with more variables, but you get the idea - it's pretty sparse):
18.425 0 0 0 0
0.000 0 0 0 0
0.000 0 0 0 0
0.000 0 0 3.234 0
0.000 0 0 0 0
The target is binary, 0 or 1, and also quite sparse:
0 1 0 0 0
I'm trying the following code:
ridge_fit <- glmnet(x = as.matrix(train_input),
y = as.factor(train_target),
family="binomial")
ridge_predict <- predict.glmnet(ridge_fit,
newx = test_input,
type = 'class')
And getting output like this:
s0 s1 s2 s3 s4
-3.391069 -3.396630 -3.400896 -3.404444 -3.407538
-3.391069 -3.388934 -3.388549 -3.388796 -3.389314
-3.391069 -3.396621 -3.400882 -3.404427 -3.407517
-3.391069 -3.396630 -3.400896 -3.404444 -3.407538
-3.391069 -3.396630 -3.400896 -3.404444 -3.407538
I've tried playing around with the family in fitting, the type in predicting, run things as factor, as matrix, played around with different alpha values (aiming for ridge, but willing to try anything that works at this point) and different lambda sequences, tried some smaller data sets (then I'd get entire variables that were null values, and some errors cropped up).
Super, super confused about what else I can try. The data set works fine for regression, but keep spitting out regression-ish values when I'm trying it with a classification variable.
No idea what to do next . . . thanks in advance for any feedback!
There are several things here:
use predict S3 generic instead of predict.glmnet, because class(ridge_fit) = c("lognet" "glmnet"). So predict() will first pick predict.lognet. If you need probabilities, use type = 'response'.
You got answer as matrix. Each column corresponds to particular lambda value. You can get lambda values from ridge_fit object.
If you need single prediction, consider to use cv.glmnet() function to pick optimal lambda, based on cross-validation.

How to change data of a corpus to appropriate format for training with 'caret' package in R?

Q-1. How to change data of a corpus to appropriate format for training with 'caret' package?
First of all, i would like to give you some environments for this question and i will be show you where i am stuck.
Environments
This is corpus that is called rt. (R Code)
require(tm)
require(tm.corpus.Reuters21578) # to load data
data(Reuters21578)
rt<-Reuters21578
And the training Document-Term-Matrix is created from training corpus called dtmTrain.
(R Code)
dtmTrain <- DocumentTermMatrix(rtTrain)
I have totally 10 classes for this project. The classes are in the
metadatas of each document.
c("earn","acq","money-fx","grain","crude","trade","interest","ship","wheat","corn")
I have created a data frame from rt which has (documents x classes). It is called
docLabels.
Docs earn acq money-fx grain crude trade interest ship wheat corn
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 1 0 0 0 0 1 1
6 0 0 0 1 0 0 0 0 1 1
I assume that everything is clear so far.
Problem
I have a document-term-matrix which has datas and a data frame which has classes as you can see. Eventually, How can i merge these two data objects for training with 'caret' package?
Q-2. How to train multiclass data with 'caret' package?
If we change the data appropriately, after that, how to train the data with caret package?
This is from caret package documentation.
## S3 method for class 'formula'
train(form, data, ..., weights, subset, na.action, contrasts = NULL)
So, what should be the form ?
Since you are working with matrices, you should consider the default method for caret::train rather than the formula interface. Note under ?train that you can pass the arguments like:
x: an object where samples are in rows and features are in columns. This could be a simple matrix...
y: a numeric or factor vector containing the outcome for each sample.
This will be simpler than building a formula. So let's discuss how to obtain x and y.
Getting x: We want to pass caret::train an x matrix with only those terms we want to use in the model. So we have to narrow the DocumentTermMatrix, which is a sparse matrix, down to those terms:
# You need to tell people where to find the file so your example is reproducible
install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)
rt <- Reuters21578
dtm <- DocumentTermMatrix(rt)
# these are the terms you care about
your_terms <- c("earn","acq","money-fx","grain","crude","trade",
"interest","ship","wheat","corn")
your_columns <- which(tolower(dtm$dimnames$Terms) %in% your_terms) # only 8 are found
your_dtm <- as.matrix(dtm[,your_columns]) # unpack selected columns of sparse matrix
Getting y:
Your question is not at all clear in terms of what your dependent variable is -- the thing you are trying to predict. For this answer I will show you how to predict whether the document includes one or more uses of the word "debt." If one of the classes in your_terms is actually your dependent variable, then remove it from your_terms and use it instead of "debt" in this example:
your_target <- as.integer(as.matrix(dtm[,'debt'])[,1] > 0) # returns array
Training a model in caret.
First we will create split the target vector and the explanatory matrix into 60/40 train/test sets.
library('caret')
set.seed(123)
train_rows <- createDataPartition(your_target, p=0.6) # for 60% training set
dtm_train <- your_dtm[train_rows,]
y_train <- your_target[train_rows]
dtm_test <- your_dtm[-train_rows,]
y_test <- your_target[-train_rows]
Now you need to decide kind of model(s) you want to try. For our example, we will use a lasso/ridge regression glmnet model. You should also try tree-based approaches such as rf or gbm.
Using the parallel backend is not strictly necessary but will speed up large jobs. Feel free to try this example without it.
tr_ctrl <- trainControl(method='repeatedcv', number=8, # train using 8-fold CV w/ 3 reps
repeats=3, returnResamp='none')
library(parallel)
library(doParallel) # if using Windows, but for Linux/OSX use library(doMC) instead
use_cores <- detectCores()-1
cl <- makeCluster(use_cores)
registerDoParallel(cl) # if using Windows, but for Linux/OSX use registerDoMC(cl)
set.seed(123)
glm <- train(x = dtm_train, y = y_train, # You can ignore the warning about
method='glmnet', trControl = t_ctrl)# classification vs. regression.
stopCluster(cl)
Of course there is a lot more tuning you could do here.
Testing the model. You can use AUC here.
library('pROC')
auc_train <- roc(y_train,
predict(glm, newdata = dtm_train, type='raw') )
auc_test <- roc(y_test,
predict(glm, newdata = dtm_test, type='raw') )
writeLines(paste('AUC using glm:', round(auc_train$auc,4),'on training/validation set',
round(auc_test$auc,4),'on test set.'))
Running this I get AUC using glm: 0.6389 on training/validation set 0.6552 on test set. So make sure to try other models and see if you can improve performance.

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Logistic regression - defining reference level in R

I am going nuts trying to figure this out. How can I in R, define the reference level to use in a binary logistic regression? What about the multinomial logistic regression? Right now my code is:
logistic.train.model3 <- glm(class~ x+y+z,
family=binomial(link=logit), data=auth, na.action = na.exclude)
my response variable is "YES" and "NO". I want to predict the probability of someone responding with "YES".
I DO NOT want to recode the variable to 0 / 1. Is there a way I can tell the model to predict "YES" ?
Thank you for your help.
Note that, when using auth$class <- relevel(auth$class, ref = "YES"), you are actually predicting "NO".
To predict "YES", the reference level must be "NO". Therefore, you have to use auth$class <- relevel(auth$class, ref = "NO").
It's a common mistake people do since most the time their oucome variable is a vector of 0 and 1, and people want to predict 1.
But when such a vector is considered as a factor variable, the reference level is 0 (see below) so that people effectively predict 1. Likewise, your reference level must be "NO" so that you will predict "YES".
set.seed(1234)
x1 <- sample(c(0, 1), 50, replace = TRUE)
x2 <- factor(x1)
str(x2)
#Factor w/ 2 levels "0","1": 1 2 2 2 2 2 1 1 2 2 ...You can see that reference level is 0
Assuming you have class saved as a factor, use the relevel() function:
auth$class <- relevel(auth$class, ref = "YES")

R decision tree using all the variables

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
I also need to plot the decision tree. How can I do that in R?
This is a sample of my dataset
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
I would like to use the formula
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
Note that all the variables are categorical.
EDIT:
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The problem is that you'll get into risk of overfitting.
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
The easiest way is to use the rpart package that is part of the core R.
library(rpart)
model <- rpart( wheeze3 ~ ., data=d )
summary(model)
plot(model)
text(model)
The . in the formula argument means use all the other variables as independent variables.
plot(ctree(myFormula~., data=sta))

Resources