How to apply machine learning techniques / how to use model outputs - r

I am a plant scientist new to machine learning. I have had success writing code and following tutorials of machine learning techniques. My issue is trying to understand how to actually apply these techniques to answer real world questions. I don't really understand how to use the model outputs to answer questions.
I recently followed a tutorial creating an algorithm to detect credit card fraud. All of the models ran nicely and I understand how to build them; but, how in the world do I take this information and translate it into a definitive answer? Following the same example, lets say I wrote this code for my job how would I then take real credit card data and screen it using this algorithm? I really want to establish a link between running these models and generating a useful output from real data.
Thank you all.
In the name of being concise I will highlight some specific examples using the same data set found here:
https://drive.google.com/file/d/1CTAlmlREFRaEN3NoHHitewpqAtWS5cVQ/view
# Import
creditcard_data <- read_csv('PATH')
# Restructure
creditcard_data$Amount=scale(creditcard_data$Amount)
NewData=creditcard_data[,-c(1)]
head(NewData)
#Split
library(caTools)
set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)
1) Decision Tree
library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class')
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model)
2) Artificial Neural Network
library(neuralnet)
ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE)
plot(ANN_model)
predANN=compute(ANN_model,test_data)
resultANN=predANN$net.result
resultANN=ifelse(resultANN>0.5,1,0)
3) Gradient Boosting
library(gbm, quietly=TRUE)
# train GBM model
system.time(
model_gbm <- gbm(Class ~ .
, distribution = "bernoulli"
, data = rbind(train_data, test_data)
, n.trees = 100
, interaction.depth = 2
, n.minobsinnode = 10
, shrinkage = 0.01
, bag.fraction = 0.5
, train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)
# best iteration
gbm.iter = gbm.perf(model_gbm, method = "test")
model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE)
# plot
plot(model_gbm)
# plot
gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter)
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)

You develop your model with, preferably, three data sets.
Training, Testing and Validation. (Sometimes different terminology is used.)
Here, Train and Test sets are used to develop the model.
The model you decide upon must never see any of the Validation set. This set is used to see how good your model is, in effect it would simulate real-world new data that may come to you in the future. Once you decide your model does perform to an acceptable level you can then go back to running all your data to produce the final operational model. Then any new 'live' data of interest is fed to the model and produces an output. In the case of the fraud detection it would output some probability: here you need human input to decide at what level you would flag the event as fraudulent enough to warrant further investigation.
At periodic intervals or as you data arrives or your model performance weakens (fraudsters may become more cunning!) you would repeat the whole process.

Related

Consisten results with Multiple runs of h2o deeplearning

For a certain combination of parameters in the deeplearning function of h2o, I get different results each time I run it.
args <- list(list(hidden = c(200,200,200),
loss = "CrossEntropy",
hidden_dropout_ratio = c(0.1, 0.1,0.1),
activation = "RectifierWithDropout",
epochs = EPOCHS))
run <- function(extra_params) {
model <- do.call(h2o.deeplearning,
modifyList(list(x = columns, y = c("Response"),
validation_frame = validation, distribution = "multinomial",
l1 = 1e-5,balance_classes = TRUE,
training_frame = training), extra_params))
}
model <- lapply(args, run)
What would I need to do in order to get consistent results for the model each time I run this?
Deeplearning with H2O will not be reproducible if it is run on more than a single core. The results and performance metrics may vary slightly from what you see each time you train the deep learning model. The implementation in H2O uses a technique called "Hogwild!" which increases the speed of training at the cost of reproducibility on multiple cores.
So if you want reproducible results you will need to restrict H2O to run on a single core and make sure to use a seed in the h2o.deeplearning call.
Edit based on comment by Darren Cook:
I forgot to include the reproducible = TRUE parameter that needs to be set in combination with the seed to make it truly reproducible. Note that this will make it a lot slower to run. And is is not advisable to do this with a large dataset.
More information on "Hogwild!"

R e1071 SVM leave one out cross validation function result differ from manual LOOCV

I'm using e1071 svm function to classify my data.
I tried two different ways to LOOCV.
First one is like that,
svm.model <- svm(mem ~ ., data, kernel = "sigmoid", cost = 7, gamma = 0.009, cross = subSize)
svm.pred = data$mem
svm.pred[which(svm.model$accuracies==0 & svm.pred=='good')]=NA
svm.pred[which(svm.model$accuracies==0 & svm.pred=='bad')]='good'
svm.pred[is.na(svm.pred)]='bad'
conMAT <- table(pred = svm.pred, true = data$mem)
summary(svm.model)
I typed cross='subject number' to make LOOCV, but the result of classification is different from my manual version of LOOCV, which is like...
for (i in 1:subSize){
data_Tst <- data[i,1:dSize]
data_Trn <- data[-i,1:dSize]
svm.model1 <- svm(mem ~ ., data = data_Trn, kernel = "linear", cost = 2, gamma = 0.02)
svm.pred1 <- predict(svm.model1, data_Tst[,-dSize])
conMAT <- table(pred = svm.pred1, true = data_Tst[,dSize])
CMAT <- CMAT + conMAT
CORR[i] <- sum(diag(conMAT))
}
In my opinion, through LOOCV, accuracy should not vary across many runs of code because SVM makes model with all the data except one and does it until the end of the loop. However, with the svm function with argument 'cross' input, the accuracy differs across every runs of code.
Which way is more accurate? Thanks for read this post! :-)
You are using different hyper-parameters (cost, gamma) and different kernels (linear, sigmoid). If you want identical results, then these should be the same each run.
Also, it depends how Leave One Out (LOO) is implemented:
Does your LOO method leave one out randomly or as a sliding window over the dataset?
Does your LOO method leave one out from one class at a time or both classes at the same time?
Is the training set always the same, or are you using a randomisation procedure before splitting between a training and testing set (assuming you have a separate independent testing set)? In which case, the examples you are cross-validating would change each run.

The xgboost package and the random forests regression

The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used.
To ensure that, consider just a standard toy example.
library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
dtrain = xgb.DMatrix(agaricus.train$data,
label = agaricus.train$label)
bst = xgb.train(data = dtrain,
nround = 1,
subsample = 0.8,
colsample_bytree = 0.5,
num_parallel_tree = 100,
verbose = 2,
max_depth = 12)
answer1 = predict(bst, dtrain);
(answer1 - agaricus.train$label) %*% (answer1 - agaricus.train$label)
forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)
answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 - agaricus.train$label)
Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well.
Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected.
#
Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function.
The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid?
Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens.
You may specify the 'objective' option like this:
mse = function(predict, dtrain)
{
real = getinfo(dtrain, 'label')
return(list(grad = 2 * (predict - real),
hess = rep(2, length(real))))
}
This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest.
Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree').
So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.
xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore.
I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences.
Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does.
true model: $y = x_1 * x_2$ + noise
If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained.
Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin".
I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2
I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees.
NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear
library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure
fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))
#1 single deep tree
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree
#100 trees (gbm boosting)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!
#20 shallow trees (bagging)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250,
num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)
#20 deep tree (bagging), no column limitation
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=500,
num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data

R Neural Net Issues

Here is the updated code. My issue is with the output of "results". I'll post below as the format for readability.
library("neuralnet")
library("ggplot2")
setwd("C:/Users/Aaron/Documents/UMUC/R/Data For Assignments")
trainset <- read.csv("SOTS.csv")
head(trainset)
## val data classification
str(trainset)
## building the neural network
risknet <- neuralnet(Overall.Risk.Value ~ Finance + Personnel + Information.Dissemenation.C, trainset, hidden = 10, lifesign = "minimal", linear.output = FALSE, threshold = 0.1)
##plot nn
plot(risknet, rep="best")
##import scoring set
score_set <- read.csv("SOSS.csv")
##select subsets-training and scoring match
score_test <- subset(score_set, select = c("Finance", "Personnel", "Information.Dissemenation.C"))
##display values of score_test
head(score_test)
##neural network compute function score_test and the neural net "risknet"
risknet.results <- compute(risknet, score_test)
##Actual value of Overall.Risk.Value variable wanting to predict. net.result = a matrix containing the overall result of the neural network
results <- data.frame(Actual = score_set$Overall.Risk.Value, Prediction = risknet.results$net.result)
results[1:14, ]
The output of results is not as expected. For instance, the actual data is a number between 5 and 8, whereas "Prediction" displays outputs of .9995...for each result.
Thanks again for the help.
This is how you train and predict:
Use training data to learn model parameters (the variable risknet in your case)
Use parameters to predict scores on test data
Here is an example very much similar to yours that explains how this is done.
The default activation function in neuralnet is "logistic". When linear.output is set as FALSE, it ensures that the output is mapped by the activation function to the interval [0,1].(R_Journal (neuralnet)- Frauke Günther)
I just updated linear.output=TRUE in your code and final result looks much better.
Thanks for the help!

Using r and weka. How can I use meta-algorithms along with nfold evaluation method?

Here is an example of my problem
library(RWeka)
iris <- read.arff("iris.arff")
Perform nfolds to obtain the proper accuracy of the classifier.
m<-J48(class~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(e)
The results provided here are obtained by building the model with a part of the dataset and testing it with another part, therefore provides accurate precision
Now I Perform AdaBoost to optimize the parameters of the classifier
m2 <- AdaBoostM1(class ~. , data = temp ,control = Weka_control(W = list(J48, M = 30)))
summary(m2)
The results provided here are obtained by using the same dataset for building the model and also the same ones used for evaluating it, therefore the accuracy is not representative of real life precision in which we use other instances to be evaluated by the model. Nevertheless this procedure is helpful for optimizing the model that is built.
The main problem is that I can not optimize the model built, and at the same time test it with data that was not used to build the model, or just use a nfold validation method to obtain the proper accuracy.
I guess you misinterprete the function of evaluate_Weka_classifier. In both cases, evaluate_Weka_classifier does only the cross-validation based on the training data. It doesn't change the model itself. Compare the confusion matrices of following code:
m<-J48(Species~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(m)
e
m2 <- AdaBoostM1(Species ~. , data = iris ,
control = Weka_control(W = list(J48, M = 30)))
e2 <- evaluate_Weka_classifier(m2,numFolds = 5)
summary(m2)
e2
In both cases, the summary gives you the evaluation based on the training data, while the function evaluate_Weka_classifier() gives you the correct crossvalidation. Neither for J48 nor for AdaBoostM1 the model itself gets updated based on the crossvalidation.
Now regarding the AdaBoost algorithm itself : In fact, it does use some kind of "weighted crossvalidation" to come to the final classifier. Wrongly classified items are given more weight in the next building step, but the evaluation is done using equal weight for all observations. So using crossvalidation to optimize the result doesn't really fit into the general idea behind the adaptive boosting algorithm.
If you want a true crossvalidation using a training set and a evaluation set, you could do the following :
id <- sample(1:length(iris$Species),length(iris$Species)*0.5)
m3 <- AdaBoostM1(Species ~. , data = iris[id,] ,
control = Weka_control(W = list(J48, M=5)))
e3 <- evaluate_Weka_classifier(m3,numFolds = 5)
# true crossvalidation
e4 <- evaluate_Weka_classifier(m3,newdata=iris[-id,])
summary(m3)
e3
e4
If you want a model that gets updated based on a crossvalidation, you'll have to go to a different algorithm, eg randomForest() from the randomForest package. That collects a set of optimal trees based on crossvalidation. It can be used in combination with the RWeka package as well.
edit : corrected code for a true crossvalidation. Using the subset argument has effect in the evaluate_Weka_classifier() as well.

Resources