how to calculate Probability for CNN model in R? - r

I have built and trained CNN model for Image classification using MXNET package and I predicted Test result against model data using below snippet of code.
pred_test <- predict(model,test_array)
pred_test_label <- max.col(t(pred_test))-1
print(pred_test_label)
Along with this I wanted to know what is the probability that Test Result matching with Model data, is there any way I can check on this?

You can do something like this:
# Prediction of test set
preds <- predict(model, test.array)
pred.label = max.col(t(preds))-1
accuracy <- function(label, pred) {
ypred = max.col(t(as.array(pred)))
return(sum((as.array(label) + 1) == ypred) / length(label))
}
print(paste0("Finish prediction...accuracy=", accuracy(test.y, preds)))

Add all the elements of pred_test column variable to get say out_sum and then divide every element of pred_text by out_sum. This way now output will sum to one and can be taken it as probability of each output node of CNN.
Alternatively, you can also get probability, if you could configure CNN model as below (note use of out_activation="softmax" below) at the time of model initialization:
model <- mx.mlp(train.x, train.y, hidden_node=10, out_node=5, out_activation="softmax")
Using this configuration, CNN model bound to give output sum to be 1 and thus can be taken each node of output as probability of each class corresponding to the each node of output.

Related

Output is lagging when trying to get lambda and alpha values after running Elastic-Net Regression Model

I am new to R and Elastic-Net Regression Model. I am running Elastic-Net Regression Model on the default dataset, titanic. I am trying to obtain the Alpha and Lambda values after running the train function. However when I run the train function, the output keeps on lagging and I had to wait for the output but there is no output at all. it is empty.... I am trying Tuning Parameters.
data(Titanic)
example<- as.data.frame(Titanic)
example['Country'] <- NA
countryunique <- array(c("Africa","USA","Japan","Australia","Sweden","UK","France"))
new_country <- c()
#Perform looping through the column, TLD
for(loopitem in example$Country)
{
#Perform random selection of an array, countryunique
loopitem <- sample(countryunique, 1)
#Load the new value to the vector
new_country<- c(new_country,loopitem)
}
#Override the Country column with new data
example$Country<- new_country
example$Class<- as.factor(example$Class)
example$Sex<- as.factor(example$Sex)
example$Age<- as.factor(example$Age)
example$Survived<- as.factor(example$Survived)
example$Country<- as.factor(example$Country)
example$Freq<- as.numeric(example$Freq)
set.seed(12345678)
trainRowNum <- createDataPartition(example$Survived, #The outcome variable
#proportion of example to form the training set
p=0.3,
#Don't store the result in a list
list=FALSE);
# Step 2: Create the training mydataset
trainData <- example[trainRowNum,]
# Step 3: Create the test mydataset
testData <- example[-trainRowNum,]
alphas <- seq(0.1,0.9,by=0.1);
lambdas <- 10^seq(-3,3,length=100)
#Logistic Elastic-Net Regression
en <- train(Survived~. ,
data = trainData,
method = "glmnet",
preProcess = NULL,
trControl = trainControl("repeatedcv",
number = 10,
repeats = 5),
tuneGrid = expand.grid(alpha = alphas,
lambda = lambdas)
)
Could you please kindly advise on what values are recommended to assign to Alpha and lambda?
Thank you
I'm not quite sure what the problem is. Your code runs fine for me. If I look at the en object it says:
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were alpha = 0.1 and lambda
= 0.1.
It didn't take long to run for me. Do you have a lot stored in your R session memory that could be slowing down your system and causing it to lag? Maybe try re-starting RStudio and running the above code from scratch.
To see the full results table with Accuracy for all combinations of Alpha and Lambda, look at en$results
As a side-note, you can easily carry out cross-validation directly in the glmnet package, using the cv.glmnet function. A helper package called glmnetUtils is also available, that lets you select the optimal Alpha and Lambda values simultaneously using the cva.glmnet function. This allows for parallelisation, so may be quicker than doing the cross-validation via caret.

R - Determine goodness of fit of new data with predict function based on existing lm

I am trying to apply an existing model to a new data set. I try to explain it with an example. I am wondering what an elegant way to determine the goodness of the fit would look like.
Basically, I run a regression and obtain a model. With the summary function I obtain the usual output such as adjusted R-squared, p-value etc.
model.lm <- lm(Sepal.Length ~ Petal.Length, data = iris[1:75,])
summary(model.lm)
Now I want to run the predict function on new data and I am curious to know how the model performs on the new data.
pred.dat <- predict(model.lm, newdata = iris[76:150,])
I wanted to ask how I can for instance get an adjusted R-squared for the predicted values with the new data. For instance, is there something similar like the summary function? Ideally, I would like to find out what the best practice of obtaining the goodness of fit of a an existing model based on new data looks like.
Many thanks
You can translate the formula of R-squared into a function, such as:
r_squared <- function(vals, preds) {
1 - (sum((vals - preds)^2) / sum((vals - mean(preds))^2))
}
# Test
> r_squared(iris[76:150,]$Sepal.Length, pred.dat)
#[1] 0.5675686
Building upon this function, and using the correct formula we can also define adjusted R-squared as:
r_squared_a <- function(vals, preds, k) {
1 - ((1-r_squared(vals, preds))*(length(preds)-1))/(length(preds) - k - 1)
}
Where k is the number of predictors, thus:
> r_squared_a(iris[76:150,]$Sepal.Length, pred.dat, 1)
#[1] 0.5616448

Predictive model decision tree

I want to build a predictive model using decision tree classification in R. I used this code:
library(rpart)
library(caret)
DataYesNo <- read.csv('DataYesNo.csv', header=T)
summary(DataYesNo)
worktrain <- sample(1:50, 40)
worktest <- setdiff(1:50, worktrain)
DataYesNo[worktrain,]
DataYesNo[worktest,]
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- “YesNo”
tree <- rpart(YesNo~Var1+Var2+Var3+Var4+Var5,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
summary(tree)
plot(tree)
text(tree)
I got just one root (Var3) and two leafs (yes, no). I'm not sure about this result. How can I get the confusion matrix, accuracy, sensitivity, and specificity?
Can I get them with the caret package?
If you use your model to make predictions on your test set, you can use confusionMatrix() to get the measures you're looking for.
Something like this...
predictions <- predict(tree, worktest)
cmatrix <- confusionMatrix(predictions, worktest$YesNo)
print(cmatrix)
Once you create a confusion matrix, other measures can also be obtained - I don't remember them at the moment.
According to your example, the confusion matrix can be obtained as following.
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
actual <- DataYesNo[worktest, c(target)]
confusion <- table(data.frame(fitted = fitted, actual = actual))

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

Resources