Output is lagging when trying to get lambda and alpha values after running Elastic-Net Regression Model - r

I am new to R and Elastic-Net Regression Model. I am running Elastic-Net Regression Model on the default dataset, titanic. I am trying to obtain the Alpha and Lambda values after running the train function. However when I run the train function, the output keeps on lagging and I had to wait for the output but there is no output at all. it is empty.... I am trying Tuning Parameters.
data(Titanic)
example<- as.data.frame(Titanic)
example['Country'] <- NA
countryunique <- array(c("Africa","USA","Japan","Australia","Sweden","UK","France"))
new_country <- c()
#Perform looping through the column, TLD
for(loopitem in example$Country)
{
#Perform random selection of an array, countryunique
loopitem <- sample(countryunique, 1)
#Load the new value to the vector
new_country<- c(new_country,loopitem)
}
#Override the Country column with new data
example$Country<- new_country
example$Class<- as.factor(example$Class)
example$Sex<- as.factor(example$Sex)
example$Age<- as.factor(example$Age)
example$Survived<- as.factor(example$Survived)
example$Country<- as.factor(example$Country)
example$Freq<- as.numeric(example$Freq)
set.seed(12345678)
trainRowNum <- createDataPartition(example$Survived, #The outcome variable
#proportion of example to form the training set
p=0.3,
#Don't store the result in a list
list=FALSE);
# Step 2: Create the training mydataset
trainData <- example[trainRowNum,]
# Step 3: Create the test mydataset
testData <- example[-trainRowNum,]
alphas <- seq(0.1,0.9,by=0.1);
lambdas <- 10^seq(-3,3,length=100)
#Logistic Elastic-Net Regression
en <- train(Survived~. ,
data = trainData,
method = "glmnet",
preProcess = NULL,
trControl = trainControl("repeatedcv",
number = 10,
repeats = 5),
tuneGrid = expand.grid(alpha = alphas,
lambda = lambdas)
)
Could you please kindly advise on what values are recommended to assign to Alpha and lambda?
Thank you

I'm not quite sure what the problem is. Your code runs fine for me. If I look at the en object it says:
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were alpha = 0.1 and lambda
= 0.1.
It didn't take long to run for me. Do you have a lot stored in your R session memory that could be slowing down your system and causing it to lag? Maybe try re-starting RStudio and running the above code from scratch.
To see the full results table with Accuracy for all combinations of Alpha and Lambda, look at en$results
As a side-note, you can easily carry out cross-validation directly in the glmnet package, using the cv.glmnet function. A helper package called glmnetUtils is also available, that lets you select the optimal Alpha and Lambda values simultaneously using the cva.glmnet function. This allows for parallelisation, so may be quicker than doing the cross-validation via caret.

Related

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

Using the caret::train package for calculating prediction error (MdAE) of glmms with beta-binomial errors

The question is more or less as the title indicates. I would like to use the caret::train function with beta-binomial models made with glmmTMB package (although I am not opposed to other functions capable of fitting beta-binomial models) to calculate median absolute error (MdAE) estimates through jack-knife (leave-one-out) cross-validation. The glmmTMBControl function is already capable of estimating the optimal dispersion parameter but I was hoping to retain this information somehow as well... or having caret do the calculation possibly?
The dataset I am working with looks like this:
df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))
Ideally I would be able to pass the glmmTMB function to trainControl like so:
BB.glmm1 <- train(Time ~ Effect,
data = df, method = "glmmTMB",
method = "", metric = "MAD")
The output would be as per the examples contained in train, although possibly with estimates for the dispersion parameter.
Although I am in no way opposed to work arounds - Thank you in advance!
I am unsure how to perform the required operation with caret without creating a custom method but I trust it is fairly easy to implement it with a for (lapply) loop.
In the example I will use the sleepstudy data set since your example data throws a bunch of warnings.
library(glmmTMB)
to perform LOOCV - for every row, create a model without that row and predict on that row:
data(sleepstudy,package="lme4")
LOOCV <- lapply(1:nrow(sleepstudy), function(x){
m1 <- glmmTMB(Reaction ~ Days + (Days|Subject),
data = sleepstudy[-x,])
return(predict(m1, sleepstudy[x,], type = "response"))
})
get the median of the residuals (I think this is MdAE? if not post a comment on how its calculated):
median(abs(unlist(LOOCV) - sleepstudy$Reaction))

R e1071 SVM leave one out cross validation function result differ from manual LOOCV

I'm using e1071 svm function to classify my data.
I tried two different ways to LOOCV.
First one is like that,
svm.model <- svm(mem ~ ., data, kernel = "sigmoid", cost = 7, gamma = 0.009, cross = subSize)
svm.pred = data$mem
svm.pred[which(svm.model$accuracies==0 & svm.pred=='good')]=NA
svm.pred[which(svm.model$accuracies==0 & svm.pred=='bad')]='good'
svm.pred[is.na(svm.pred)]='bad'
conMAT <- table(pred = svm.pred, true = data$mem)
summary(svm.model)
I typed cross='subject number' to make LOOCV, but the result of classification is different from my manual version of LOOCV, which is like...
for (i in 1:subSize){
data_Tst <- data[i,1:dSize]
data_Trn <- data[-i,1:dSize]
svm.model1 <- svm(mem ~ ., data = data_Trn, kernel = "linear", cost = 2, gamma = 0.02)
svm.pred1 <- predict(svm.model1, data_Tst[,-dSize])
conMAT <- table(pred = svm.pred1, true = data_Tst[,dSize])
CMAT <- CMAT + conMAT
CORR[i] <- sum(diag(conMAT))
}
In my opinion, through LOOCV, accuracy should not vary across many runs of code because SVM makes model with all the data except one and does it until the end of the loop. However, with the svm function with argument 'cross' input, the accuracy differs across every runs of code.
Which way is more accurate? Thanks for read this post! :-)
You are using different hyper-parameters (cost, gamma) and different kernels (linear, sigmoid). If you want identical results, then these should be the same each run.
Also, it depends how Leave One Out (LOO) is implemented:
Does your LOO method leave one out randomly or as a sliding window over the dataset?
Does your LOO method leave one out from one class at a time or both classes at the same time?
Is the training set always the same, or are you using a randomisation procedure before splitting between a training and testing set (assuming you have a separate independent testing set)? In which case, the examples you are cross-validating would change each run.

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Resources