R: Crossvalidation of a lasso parameter and comparison to cv.glmnet - r

I've two questions regarding the Cross-Validation. I tried to create a code for Cross-Validation by myself to improve my intuition. In particular I find the optimal penalize parameter lambda for the Lasso.
Here is my Code for the CV:
CV_lambda <- matrix(0, length(lambda_grid), 2)
colnames(CV_lambda) <- c("lambda", "CV-est")
#Loop for a vector of Lambda values
for(l in 1:length(lambda_grid)){
#Number of folds
nfolds <- 5
#Fold IDs
foldid <- sample(rep(seq(nfolds), length = N))
#Output Store
cv_out <- as.list(seq(nfolds))
mspe_out <- matrix(0, 1, nfolds)
#Cross-Validation
for(i in seq(nfolds)){
y_train <- y[!foldid == i,]
X_train <- X[!foldid == i,]
#Lasso is a self-written function, estimating a lasso-regression
cv_out[i] <- lasso(y_train, X_train, lambda=lambda_grid[l])
predict <- X[foldid == i,] %*% cv_out[[i]]
mspe_out[i] <- mean((y[foldid == i,] - predict)^2)
}
CV_lambda[l,] <- c(lambda_grid[l],mean(mspe_out))
}
lambda_opt <- CV_lambda[which.min(CV_lambda[,2]),1]
I've two questions regarding the CV. I am sitting here for more than hours to find a solution to both questions. I would be very grateful, if some of you can help me. The Code is not very well and sparse, since I am a beginner. However, my questions are the following:
For the general understanding. LASSO-Regression requires a penalize parameter, which has a strong influence on the results. The goal is to choose this lambda, that reduces the mean squared prediction error.
1) Should the split into the sample be equal in every lambda-sequence or should it differ (as in my code). Im a bit skeptical about this.
2) Since I am very uncertain about my code, I want to compare my results to the glmnet. Is there a way to this ? The glmnet is also taking a random sample. A comparison isnt then possible. Has someone an idea how to "reproduce" the the cv.glmnet results with my code to find mistakes in my code ?

Related

Perceptron in R not converging

I am trying to understand Neural Networks better so I am trying to implement a simple perceptron from scratch in R. I know that this is very inefficient as there are many libraries that do this extemely well optimized but my goal is to understand the basics of neural networks better and work my way forward to more complex models.
I have created some artificial test data with a very simple linear decision boundary and split this into a training set and a test set. I then ran a logistic regression on the training data and checked the predictions from the test-set and got +99% accuray, which was to be expected given the simple nature of the data. I then tried implementing a perceptron with 2 inputs, 1 neuron, 1000 iterations, a learning rate of 0.1 and a sigmoid activation function.
I would expect to get very similar accuracy to the logistic regression model but my results are a lot worse (around 70% correct classifications in the training set). so I definitly did something wrong. The predictions only seem to get better after the first couple of iterations and then just go back and forth around a specific value (I tried with many different learning rates, no success). I'm attaching my script and I#m thankful for any advice! I think the problem lies in the calculation of the error or the weight adjustment but I can't put my finger on it...
### Reproducible Example for StackOverflow
#### Setup
# loading libraries
library(data.table)
#remove scientifc notation
options(scipen = 999)
# setting seed for random number generation
seed <- 123
#### Selfmade Test Data
# input points
x1 <- runif(10000,-100,100)
x2 <- runif(10000,-100,100)
# setting decision boundary to create output
output <- vector()
output[0.5*x1 + -1.2*x2 >= 50] <- 0
output[0.5*x1 + -1.2*x2 < 50] <- 1
# combining to dataframe
points <- cbind.data.frame(x1,x2,output)
# plotting all data points
plot(points$x1,points$x2, col = as.factor(points$output), main = "Self-created data", xlab = "x1",ylab = "x2")
# split into test and training sets
trainsize = 0.2
set.seed(seed)
train_rows <- sample(1:dim(points)[1], size = trainsize * dim(points)[1])
train <- points[train_rows,]
test <- points[-c(train_rows),]
# plotting training set only
plot(train$x1,train$x2, col = as.factor(train$output), main = "Self-created data (training set)", xlab = "x1",ylab = "x2")
#### Approaching the problem with logistic regression
# building model
train_logit <- glm(output ~ x1 + x2, data = train, family = "binomial", maxit = 10000)
summary(train_logit)
# testing performance in training set
table(round(train_logit$fitted.values) == train$output)
# testing performance of train_logit model in test set
table(test$output == round(predict(train_logit,test[,c(1,2)], type = "response")))
# We get 100% accuracy in the training set and near 100% accuracy in the test set
#### Approaching Problem with a Perceptron from scratch
# setting inputs, outputs and weights
inputs <- as.matrix(train[,c(1,2)])
output <- as.matrix(train[,3])
set.seed(123456)
weights <- as.matrix(runif(dim(inputs)[2],-1,1))
## Defining activation function + derivative
# defining sigmoid and it's derivative
sigmoid <- function(x) {1 / (1 + exp(-x))}
sig_dir <- function(x){sigmoid(x)*(1 - sigmoid(x))}
## Perceptron nitial Settings
bias <- 1
# number of iterations
iterations <- 1000
# setting learning rate
alpha <- 0.1
## Perceptron
# creating vectors for saving results per iteration
weights_list <- list()
weights_list[[1]] <- weights
errors_vec <- vector()
outputs_vec <- vector()
# saving results across iterations
weights_list_all <- list()
outputs_list <- list()
errors_list <- list()
# looping through the backpropagation algorithm "iteration" # times
for (j in 1:iterations) {
# Loop for backpropagation with updating weights after every datapoint
for (i in 1:dim(train)[1]) {
# taking the weights from the last iteration of the outer loop as a starting point
if (j > 1) {
weights_list[[1]] <- weights
}
# Feed Forward (Should we really round this?!)
output_pred <- round(sigmoid(sum(inputs[i,] * as.numeric(weights)) + bias))
error <- output_pred - output[i]
# Backpropagation (Do I need the sigmoid derivative AND a learning rate? Or should I only take one of them?)
weight_adjustments <- inputs[i,] * (error * sig_dir(output_pred)) * alpha
weights <- weights - weight_adjustments
# saving progress for later plots
weights_list[[i + 1]] <- weights
errors_vec[i] <- error
outputs_vec[[i]] <- output_pred
}
# saving results for each iteration
weights_list_all[[j]] <- weights_list
outputs_list[[j]] <- outputs_vec
errors_list[[j]] <- errors_vec
}
#### Formatting Diagnostics for easier plotting
# implementing empty list to transform weightslist
WeightList <- list()
# collapsing individual weightslist into datafames
for (i in 1:iterations) {
WeightList[[i]] <- t(data.table::rbindlist(weights_list_all[i]))
}
# pasting dataframes together
WeightFrame <- do.call(rbind.data.frame, WeightList)
colnames(WeightFrame) <- paste("w",1:dim(WeightFrame)[2], sep = "")
# pasting dataframes together
ErrorFrame <- do.call(rbind.data.frame, errors_list)
OutputFrame <- do.call(rbind.data.frame, outputs_list)
##### Plotting Results
# Development of Mean Error per iteration
plot(rowMeans(abs(ErrorFrame)),
type = "l",
xlab = "Sum of absolute Error terms")
# Development of Weights over time
plot(WeightFrame$w1, type = "l",xlim = c(1,dim(train)[1]), ylim = c(min(WeightFrame),max(WeightFrame)), ylab = "Weights", xlab = "Iterations")
lines(WeightFrame$w2, col = "green")
# lines(WeightFrame$w3, col = "blue")
# lines(WeightFrame$w4, col = "red")
# lines(WeightFrame$w5, col = "orange")
# lines(WeightFrame$w6, col = "cyan")
# lines(WeightFrame$w7, col = "magenta")
# Empty vector for number of correct categorizations per iteration
NoCorr <- vector()
# Computing percentage of correct predictions per iteration
colnames(OutputFrame) <- paste("V",1:dim(OutputFrame)[2], sep = "")
Output_mat <- as.matrix(OutputFrame)
for (i in 1:iterations) {
NoCorr[i] <- sum(output == Output_mat[i,]) / nrow(train)
}
# plotting number of correct predictions per iteration
plot(NoCorr, type = "l")
# Performance in training set after last iteration
table(output,round(OutputFrame[iterations,]))
First of all, welcome to the world of Neural Networks :).
Secondly, I want to recommend a great article to you, which I personally used to get a better understanding of backtracking and the whole NN learning stuff: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/. Might be a bit rough to get through sometimes, and for the general implementation I think it is much easier to follow pseudocode from a NN book. However, to understand what is going on this is article is very nice!
Thirdly, I will hopefully solve your problem :)
You comment yourself already with whether you should really round that output_pred. Yes you should.. if you want to use that output_pred to make a prediction! However, if you want to use it for learning it is generally not good! The reason for this is that if you round it for learning, than an output which was rounded up from 0.51 to 1 with target output 1 will not learn anything as the output was the same as the target and thus is perfect. However, 0.99 would have been a lot better of a prediction than 0.51 and thus there is definitely something to learn!
I am not 100% sure if this solves all your problems (im not an R programmer) and gets your accuracy up to 99%, but it should solve some of it, and hopefully the intuition is also clear :)

Random data generation leading to good prediction on random labels

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.
First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?
Here is a code snippet that shows my issue.
library(randomForest)
library(pROC)
n=30
p=900
set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
t.test(resp~YY)$p.value
roc(YY, resp)$auc
I tried multiple ways of generating data all of which result in the same thing
XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
and
random_data=matrix(0, n, p)
for(i in 1:n){
random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)
Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?
This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...
library(ROCR)
library(randomForest)
library(pROC)
n=30
p=900
pvs=vector()
aucs=vector()
for (j in seq(20)){
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('C', 'P'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
pvs[j]=t.test(resp~YY)$p.value
aucs[j]=roc(YY, resp, direction='>')$auc
}

R: PDA on 2 manners but with different results?

I'm working on a project where I have to classify data about breast cancer. I want to use PDA. I'm trying to found the optimal value for lambda by 10-fold cross validation.
I've started with:
breast[,1] <- as.numeric(breast[,1]) #Because this code only works when the classes are numeric (so not factors like they were)
I found that this could be done by:
library(penalizedLDA)
PenalizedLDA.cv(breast[,-1], breast[,1], lambdas = seq(0,0.5,by=.01), nfold = 10)
But I can also do this manually:
lam <- seq(0,1,by=.01)
length(lam)
error.lam <- rep(0, length(lammie))
for (la in 1:101){
error.pda <- rep(0,10)
for (fold in 1:k){
currentFold <- folds[fold][[1]]
breast.pda = fda(Diagnosis~., data=breast[-currentFold,], method=gen.ridge, lambda = lam[la])
pred.pda = predict(breast.pda, newdata= breast[currentFold,-1], type="class")
CV.table=table(breast[currentFold,1],pred.pda)
error.pda[fold] <- CV.table[1,2]+CV.table[2,1]
}
error.lam[la] <- sum(error.pda)/nrow
}
The strange thing now is that I become two different results. For the first I've got a CV-error of 4.4% and value for lambda 0.055. While for the second one I become CV-error 6.32% and lamba 0.55. Could someone explain me the differences between those two methods?
Silke

Cross validation for glm() models

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:
library(boot)
cv.glm(data, glmfit, K=10)
Does the "data" argument here refer to the whole dataset or only to the test set?
The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).
Unfortunately ?cv.glm explains it in a foggy way:
data: A matrix or data frame containing the data. The rows should be
cases and the columns correspond to variables, one of which is the
response
My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?
Here's what my script looks like:
##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]
##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
family = "binomial", data = training)
##cross-validation
cv.glm(testing, model, K=10)
I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data partitions however you desire...
}
#Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:
I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.
delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n #create weighted average for later
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i # add weighted average error to running total
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}

Feature selection + cross-validation, but how to make ROC-curves in R

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?
Silke
You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:
library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])
all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = aSAH[indices==i,]
learn = aSAH[indices!=i,]
model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}
roc(all.response, all.predictor)
mean(aucs)
The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

Resources