R Neural Net Issues - r

Here is the updated code. My issue is with the output of "results". I'll post below as the format for readability.
library("neuralnet")
library("ggplot2")
setwd("C:/Users/Aaron/Documents/UMUC/R/Data For Assignments")
trainset <- read.csv("SOTS.csv")
head(trainset)
## val data classification
str(trainset)
## building the neural network
risknet <- neuralnet(Overall.Risk.Value ~ Finance + Personnel + Information.Dissemenation.C, trainset, hidden = 10, lifesign = "minimal", linear.output = FALSE, threshold = 0.1)
##plot nn
plot(risknet, rep="best")
##import scoring set
score_set <- read.csv("SOSS.csv")
##select subsets-training and scoring match
score_test <- subset(score_set, select = c("Finance", "Personnel", "Information.Dissemenation.C"))
##display values of score_test
head(score_test)
##neural network compute function score_test and the neural net "risknet"
risknet.results <- compute(risknet, score_test)
##Actual value of Overall.Risk.Value variable wanting to predict. net.result = a matrix containing the overall result of the neural network
results <- data.frame(Actual = score_set$Overall.Risk.Value, Prediction = risknet.results$net.result)
results[1:14, ]
The output of results is not as expected. For instance, the actual data is a number between 5 and 8, whereas "Prediction" displays outputs of .9995...for each result.
Thanks again for the help.

This is how you train and predict:
Use training data to learn model parameters (the variable risknet in your case)
Use parameters to predict scores on test data
Here is an example very much similar to yours that explains how this is done.

The default activation function in neuralnet is "logistic". When linear.output is set as FALSE, it ensures that the output is mapped by the activation function to the interval [0,1].(R_Journal (neuralnet)- Frauke Günther)
I just updated linear.output=TRUE in your code and final result looks much better.
Thanks for the help!

Related

Output is lagging when trying to get lambda and alpha values after running Elastic-Net Regression Model

I am new to R and Elastic-Net Regression Model. I am running Elastic-Net Regression Model on the default dataset, titanic. I am trying to obtain the Alpha and Lambda values after running the train function. However when I run the train function, the output keeps on lagging and I had to wait for the output but there is no output at all. it is empty.... I am trying Tuning Parameters.
data(Titanic)
example<- as.data.frame(Titanic)
example['Country'] <- NA
countryunique <- array(c("Africa","USA","Japan","Australia","Sweden","UK","France"))
new_country <- c()
#Perform looping through the column, TLD
for(loopitem in example$Country)
{
#Perform random selection of an array, countryunique
loopitem <- sample(countryunique, 1)
#Load the new value to the vector
new_country<- c(new_country,loopitem)
}
#Override the Country column with new data
example$Country<- new_country
example$Class<- as.factor(example$Class)
example$Sex<- as.factor(example$Sex)
example$Age<- as.factor(example$Age)
example$Survived<- as.factor(example$Survived)
example$Country<- as.factor(example$Country)
example$Freq<- as.numeric(example$Freq)
set.seed(12345678)
trainRowNum <- createDataPartition(example$Survived, #The outcome variable
#proportion of example to form the training set
p=0.3,
#Don't store the result in a list
list=FALSE);
# Step 2: Create the training mydataset
trainData <- example[trainRowNum,]
# Step 3: Create the test mydataset
testData <- example[-trainRowNum,]
alphas <- seq(0.1,0.9,by=0.1);
lambdas <- 10^seq(-3,3,length=100)
#Logistic Elastic-Net Regression
en <- train(Survived~. ,
data = trainData,
method = "glmnet",
preProcess = NULL,
trControl = trainControl("repeatedcv",
number = 10,
repeats = 5),
tuneGrid = expand.grid(alpha = alphas,
lambda = lambdas)
)
Could you please kindly advise on what values are recommended to assign to Alpha and lambda?
Thank you
I'm not quite sure what the problem is. Your code runs fine for me. If I look at the en object it says:
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were alpha = 0.1 and lambda
= 0.1.
It didn't take long to run for me. Do you have a lot stored in your R session memory that could be slowing down your system and causing it to lag? Maybe try re-starting RStudio and running the above code from scratch.
To see the full results table with Accuracy for all combinations of Alpha and Lambda, look at en$results
As a side-note, you can easily carry out cross-validation directly in the glmnet package, using the cv.glmnet function. A helper package called glmnetUtils is also available, that lets you select the optimal Alpha and Lambda values simultaneously using the cva.glmnet function. This allows for parallelisation, so may be quicker than doing the cross-validation via caret.

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

How to apply machine learning techniques / how to use model outputs

I am a plant scientist new to machine learning. I have had success writing code and following tutorials of machine learning techniques. My issue is trying to understand how to actually apply these techniques to answer real world questions. I don't really understand how to use the model outputs to answer questions.
I recently followed a tutorial creating an algorithm to detect credit card fraud. All of the models ran nicely and I understand how to build them; but, how in the world do I take this information and translate it into a definitive answer? Following the same example, lets say I wrote this code for my job how would I then take real credit card data and screen it using this algorithm? I really want to establish a link between running these models and generating a useful output from real data.
Thank you all.
In the name of being concise I will highlight some specific examples using the same data set found here:
https://drive.google.com/file/d/1CTAlmlREFRaEN3NoHHitewpqAtWS5cVQ/view
# Import
creditcard_data <- read_csv('PATH')
# Restructure
creditcard_data$Amount=scale(creditcard_data$Amount)
NewData=creditcard_data[,-c(1)]
head(NewData)
#Split
library(caTools)
set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)
1) Decision Tree
library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class')
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model)
2) Artificial Neural Network
library(neuralnet)
ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE)
plot(ANN_model)
predANN=compute(ANN_model,test_data)
resultANN=predANN$net.result
resultANN=ifelse(resultANN>0.5,1,0)
3) Gradient Boosting
library(gbm, quietly=TRUE)
# train GBM model
system.time(
model_gbm <- gbm(Class ~ .
, distribution = "bernoulli"
, data = rbind(train_data, test_data)
, n.trees = 100
, interaction.depth = 2
, n.minobsinnode = 10
, shrinkage = 0.01
, bag.fraction = 0.5
, train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)
# best iteration
gbm.iter = gbm.perf(model_gbm, method = "test")
model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE)
# plot
plot(model_gbm)
# plot
gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter)
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)
You develop your model with, preferably, three data sets.
Training, Testing and Validation. (Sometimes different terminology is used.)
Here, Train and Test sets are used to develop the model.
The model you decide upon must never see any of the Validation set. This set is used to see how good your model is, in effect it would simulate real-world new data that may come to you in the future. Once you decide your model does perform to an acceptable level you can then go back to running all your data to produce the final operational model. Then any new 'live' data of interest is fed to the model and produces an output. In the case of the fraud detection it would output some probability: here you need human input to decide at what level you would flag the event as fraudulent enough to warrant further investigation.
At periodic intervals or as you data arrives or your model performance weakens (fraudsters may become more cunning!) you would repeat the whole process.

How to wrap the following code into a for loop in R?

I'm new in R and I'm currently working with RandomForest Analysis.
I need to create at least 100 replicates of a RF model, each one with different test/train data.
I would like to automate the task wrapping the code into a loop if that's possible, and save the results of every model.
Without a loop, I have to run the code every time and manually write the output.
This is my code:
#split data into 80 for training/20 for testing
obs_split <- obs_split %>%
split(if_else(runif(nrow(.)) <= 0.8, "train", "test"))
map_int(obs_split, nrow)
# grow random forest with ranger package
detection_freq <- mean(obs_split$train$species_observed)
# ranger requires a factor response to do classification
obs_split$train$species_observed <- factor(obs_split$train$species_observed)
rf <- ranger(formula = species_observed ~ .,
data = obs_split$train,
importance = "impurity",
probability = TRUE,
replace = TRUE,
sample.fraction = c(detection_freq, detection_freq))
I would appreciate any solution! Thank you

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

Resources