MM Estimation in Robust Regression - r

I am working with different linear regression models in R. I used the DATASET, which has 21263 rows and 82 columns.
All of the regression models have acceptable time consumption except the MM-estimate regression using the R function lmrob.
I was waiting for more than 10 hours to run the first for loop (#Block A), and it does not work. By "does not work", I mean It may give me an output after two days. I tried this code with a smaller DATASET which has 9568 rows, 5 columns and it runs in a one minute.
I am using my standard Laptop.
The steps of my analysis as follows
Uploading and scaling the dataset and then used k-folds split with k=30 because I want to calculate the variance of coefficients for each variable within the k split.
Could you please provide me with any guide?
wdbc = read.csv("train.csv") #critical_temp is the dependent varaible.
wdbcc=as.data.frame(scale(wdbc)) # scaling the variables
### k-folds split ###
set.seed(12345)
k = 30
folds <- createFolds(wdbcc$critical_temp, k = k, list = TRUE, returnTrain = TRUE)
############ Start of MM Regression Model #################
#Block A
lmrob = list()
for (i in 1:k) {
lmrob[[i]] = lmrob(critical_temp~ .,
data = wdbcc[folds[[i]],],setting="KS2014")
}
#Block B
lmrob_coef = list()
lmrob_coef_var = list()
for(j in 1:(lmrob[[1]]$coefficients %>% length())){
for(i in 1:k){
lmrob_coef[[i]] = lmrob[[i]]$coefficients[j]
lmrob_coef_var[[j]] = lmrob_coef %>% unlist() %>% var()
}
}
#Block C
lmrob_var = unlist(lmrob_coef_var)
lmrob_df = cbind(coefficients = lmrob[[1]]$coefficients %>% names() %>% as.data.frame()
, variance = lmrob_var %>% as.data.frame())
colnames(lmrob_df) = c("coefficients", "variance_lmrob")
#Block D
lmrob_var_sum = sum(lmrob_var)

Not an answer, but some code to help you test this for yourself. I didn't run lmrob() on the full dataset, but everything I show below suggests that one full realization of the model (all observations, all predictors) should run in about 10-20 minutes [on a 10-year old MacOS desktop machine], which would extrapolate to approximately 5 hours for 30-fold cross-validation. (It looks like the time scales a little worse than the square root of the number of observations, and nonlinearly even on the log scale with the number of predictors ...) You can try the code below to see if things are much slower on your machine, and to predict how long you think it should take to do the whole problem. Other general suggestions:
is there a chance you're running out of memory? Memory constraints can make things run much slower
if the problem is just that things are too slow, you can easily parallelize across folds if you have access to multiple cores (probably don't do this on a laptop, you'll burn it up)
AWS and other cloud services can be very useful
I set up a test function to record the time taken by lmrob() running on a random subset of predictors and observations from your data set.
Extract data, load packages:
unzip("superconduct.zip")
xx <- read.csv("train.csv")
library(robustbase)
library(ggplot2); theme_set(theme_bw())
library(cowplot)
Define a test function for timing lmrob runs for different numbers of observations and predictors:
nc <- ncol(xx) ## response vble is last column, "critical_temp"
test <- function(nobs=1000,npred=10,seed=NULL, ...) {
if (!is.null(seed)) set.seed(seed)
dd <- xx[sample(nrow(xx),size=nobs),
c(sample(nc-1,size=npred),nc)]
tt <- system.time(fit <- lmrob(critical_temp ~ ., data=dd, ...))
tt[c("user.self","sys.self","elapsed")]
}
t0 <- test()
The minimal example here (1000 observations, 10 predictors) is very fast (0.2 seconds).
This is the basic loop I ran:
res <- expand.grid(nobs=seq(1000,10000,by=1000), npred=seq(10,30,by=2))
res$user.self <- res$sys.self <- res$elapsed <- NA
for (i in seq(nrow(res))) {
cat(res$nobs[i],res$npred[i],"\n")
res[i,-(1:2)] <- test(res$nobs[i],res$npred[i],seed=101)
}
(As you can see in the plot below, I did this again with larger numbers of observations and predictors and used rbind() to combine the results into a single data frame.) I also tried fitting linear models to make predictions of the time taken to do the full data set with all predictors. (Plotting [see below] suggests that the time is log-log-linear in number of observations but nonlinear in number of predictors ...)
m1 <- lm(log10(elapsed)~poly(log10(npred),2)*log10(nobs), data=resc)
pp <- predict(m1, newdata=data.frame(npred=ncol(xx)-1,nobs=nrow(xx)),
interval="confidence")
10^pp ## convert from log10(predicted seconds) to seconds
Test the full data set.
t_all <- test(nobs=nrow(xx),npred=ncol(xx)-1)
I then realized that you were using setting = "KS2014" (as suggested in the documentation) rather than the default: this is at least 5x slower, as suggested by the following comparison:
test(nobs=10000,npred=30)
test(nobs=10000,npred=30,setting = "KS2014")
I re-ran some of the stuff above with setting="KS2014". Making the prediction for the full data set suggested a run-time of about 700 seconds (CI from 300 to 2000 seconds) - still nowhere near as slow as you're suggesting.
gg0 <- ggplot(resc2,aes(x=npred,y=elapsed,colour=nobs,linetype=setting))+
geom_point()+geom_line(aes(group=interaction(nobs,setting)))+
scale_x_log10()+scale_y_log10()
gg1 <- ggplot(resc2,aes(x=nobs,y=elapsed,colour=npred, linetype=setting))+
geom_point()+geom_line(aes(group=interaction(npred,setting)))+
scale_x_log10()+scale_y_log10()
plot_grid(gg0,gg1,nrow=1)
ggsave("lmrob_times.pdf")

Related

How to apply a long list of functions automatically in 10 imputed datasets in R

I have 10 datasets that are the result of multiple imputation, which i named: data1, data2, ..., data10. For each of them, I want to do:
Create a logistic regression model
Do multiple steps which include creating a LASSO model, resampling 200 times from my imputed dataset, recreate LASSO model in each resampling, evaluate measures of performance.
I'm able to do it separately for each dataset but I was wondering if there was a way to automatically do all of the steps for each imputed dataset. Below, I included an example of all the steps I do to get results separately for each imputation.
To do it automatically, i first thought about using lapply to create regressions for every imputation:
log01.1 <- lapply(paste0("data",1:10), function(x){lrm(y ~ x1 + x2 + x3, data=eval(parse(text = x)), x=T, y=T)})
Then I wanted to use lapply again on the whole block of code below with something like :
lapply(log01.1,fun(x){*All the steps following the regression*}
But I realized it doesn't work since lapply can only be applied to one function at a time as I understand it + at model.L1 <- glmnet(x=log01.1$x, y=log01.1$y, alpha=1, lambda=cv.glmmod$lambda.1se, family="binomial")
it wouldn't work since my lambda would come from a list. And I can't use lapply both on log01.1 and on cv.glmmod at the same time. Add to that the resampling with the 200 repetitons and I'm sure I would run into other problems I can't even think of right now.
And that's about the extent of my knowledge on lapply and other functions that could do similar things. Is there a way to take the chunk of code I wrote below and tell R to repeat it for every one of my 10 imputations and then store into separate lists the objects that would have been created? Or maybe not in lists but I would get for example App1, App2, App3, etc.?
Or am I better off just repeating it 10 times and storing the results?
log01.1 <- lrm(y ~ x1 + x2 + x3 , data=data1, x=T, y=T)})
reps <- 200;App=numeric(reps);Test=numeric(reps)
for(i in 1:reps){
#1.Construct LASSO model in sample i
cv.glmmod <- cv.glmnet(x=log01.1$x, y=log01.1$y, alpha=1, family="binomial")
model.L1 <- glmnet(x=log01.1$x, y=log01.1$y, alpha=1,
lambda=cv.glmmod$lambda.1se, family="binomial") #use optimum penalty
lp1 <- log01.1$x %*% model.L1$beta #for apparent performance
#2. Draw bootstrap sample with replacement from sample i
j <- sample(nrow(data1), replace=T) #for sample Bi
#3. Construct a model in sample Bi replaying every step that was done in the imputed sample
#I, especially model specification steps such as selection of predictors.
#Determine the bootstrap performance as the apparent performance in sample Bi.
#3 Construct LASSO model in sample i replaying every step done in imputed sample i
cv.j <- cv.glmnet (x=log01.1$x[j,], y=log01.1$y[j,], alpha = 1, family="binomial")
model.L1j <- glmnet (x=log01.1$x[j,], y=log01.1$y[j,], alpha=1,
lambda=cv.j$lambda.1se, family="binomial") #use optimum penalty for Bi
lp1j <- log01.1$x[j,] %*% model.L1j$beta #apparent performance in Bi
App[i] <- lrm.fit(y=log01.1$y[j,], x=lp1j)$stats[6] #apparent c for Bi
#4. Apply model from Bi to the original sample i without any modification to determine the test performance
lp1 <- log01.1$x %*% model.L1j$beta #Validated performance in I
Test[i] <- lrm.fit(y=log01.1$y, x=lp1)$stats[6]} #Test c in I
That is the code I would like to repeat automatically for every imputed set.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Perceptron in R not converging

I am trying to understand Neural Networks better so I am trying to implement a simple perceptron from scratch in R. I know that this is very inefficient as there are many libraries that do this extemely well optimized but my goal is to understand the basics of neural networks better and work my way forward to more complex models.
I have created some artificial test data with a very simple linear decision boundary and split this into a training set and a test set. I then ran a logistic regression on the training data and checked the predictions from the test-set and got +99% accuray, which was to be expected given the simple nature of the data. I then tried implementing a perceptron with 2 inputs, 1 neuron, 1000 iterations, a learning rate of 0.1 and a sigmoid activation function.
I would expect to get very similar accuracy to the logistic regression model but my results are a lot worse (around 70% correct classifications in the training set). so I definitly did something wrong. The predictions only seem to get better after the first couple of iterations and then just go back and forth around a specific value (I tried with many different learning rates, no success). I'm attaching my script and I#m thankful for any advice! I think the problem lies in the calculation of the error or the weight adjustment but I can't put my finger on it...
### Reproducible Example for StackOverflow
#### Setup
# loading libraries
library(data.table)
#remove scientifc notation
options(scipen = 999)
# setting seed for random number generation
seed <- 123
#### Selfmade Test Data
# input points
x1 <- runif(10000,-100,100)
x2 <- runif(10000,-100,100)
# setting decision boundary to create output
output <- vector()
output[0.5*x1 + -1.2*x2 >= 50] <- 0
output[0.5*x1 + -1.2*x2 < 50] <- 1
# combining to dataframe
points <- cbind.data.frame(x1,x2,output)
# plotting all data points
plot(points$x1,points$x2, col = as.factor(points$output), main = "Self-created data", xlab = "x1",ylab = "x2")
# split into test and training sets
trainsize = 0.2
set.seed(seed)
train_rows <- sample(1:dim(points)[1], size = trainsize * dim(points)[1])
train <- points[train_rows,]
test <- points[-c(train_rows),]
# plotting training set only
plot(train$x1,train$x2, col = as.factor(train$output), main = "Self-created data (training set)", xlab = "x1",ylab = "x2")
#### Approaching the problem with logistic regression
# building model
train_logit <- glm(output ~ x1 + x2, data = train, family = "binomial", maxit = 10000)
summary(train_logit)
# testing performance in training set
table(round(train_logit$fitted.values) == train$output)
# testing performance of train_logit model in test set
table(test$output == round(predict(train_logit,test[,c(1,2)], type = "response")))
# We get 100% accuracy in the training set and near 100% accuracy in the test set
#### Approaching Problem with a Perceptron from scratch
# setting inputs, outputs and weights
inputs <- as.matrix(train[,c(1,2)])
output <- as.matrix(train[,3])
set.seed(123456)
weights <- as.matrix(runif(dim(inputs)[2],-1,1))
## Defining activation function + derivative
# defining sigmoid and it's derivative
sigmoid <- function(x) {1 / (1 + exp(-x))}
sig_dir <- function(x){sigmoid(x)*(1 - sigmoid(x))}
## Perceptron nitial Settings
bias <- 1
# number of iterations
iterations <- 1000
# setting learning rate
alpha <- 0.1
## Perceptron
# creating vectors for saving results per iteration
weights_list <- list()
weights_list[[1]] <- weights
errors_vec <- vector()
outputs_vec <- vector()
# saving results across iterations
weights_list_all <- list()
outputs_list <- list()
errors_list <- list()
# looping through the backpropagation algorithm "iteration" # times
for (j in 1:iterations) {
# Loop for backpropagation with updating weights after every datapoint
for (i in 1:dim(train)[1]) {
# taking the weights from the last iteration of the outer loop as a starting point
if (j > 1) {
weights_list[[1]] <- weights
}
# Feed Forward (Should we really round this?!)
output_pred <- round(sigmoid(sum(inputs[i,] * as.numeric(weights)) + bias))
error <- output_pred - output[i]
# Backpropagation (Do I need the sigmoid derivative AND a learning rate? Or should I only take one of them?)
weight_adjustments <- inputs[i,] * (error * sig_dir(output_pred)) * alpha
weights <- weights - weight_adjustments
# saving progress for later plots
weights_list[[i + 1]] <- weights
errors_vec[i] <- error
outputs_vec[[i]] <- output_pred
}
# saving results for each iteration
weights_list_all[[j]] <- weights_list
outputs_list[[j]] <- outputs_vec
errors_list[[j]] <- errors_vec
}
#### Formatting Diagnostics for easier plotting
# implementing empty list to transform weightslist
WeightList <- list()
# collapsing individual weightslist into datafames
for (i in 1:iterations) {
WeightList[[i]] <- t(data.table::rbindlist(weights_list_all[i]))
}
# pasting dataframes together
WeightFrame <- do.call(rbind.data.frame, WeightList)
colnames(WeightFrame) <- paste("w",1:dim(WeightFrame)[2], sep = "")
# pasting dataframes together
ErrorFrame <- do.call(rbind.data.frame, errors_list)
OutputFrame <- do.call(rbind.data.frame, outputs_list)
##### Plotting Results
# Development of Mean Error per iteration
plot(rowMeans(abs(ErrorFrame)),
type = "l",
xlab = "Sum of absolute Error terms")
# Development of Weights over time
plot(WeightFrame$w1, type = "l",xlim = c(1,dim(train)[1]), ylim = c(min(WeightFrame),max(WeightFrame)), ylab = "Weights", xlab = "Iterations")
lines(WeightFrame$w2, col = "green")
# lines(WeightFrame$w3, col = "blue")
# lines(WeightFrame$w4, col = "red")
# lines(WeightFrame$w5, col = "orange")
# lines(WeightFrame$w6, col = "cyan")
# lines(WeightFrame$w7, col = "magenta")
# Empty vector for number of correct categorizations per iteration
NoCorr <- vector()
# Computing percentage of correct predictions per iteration
colnames(OutputFrame) <- paste("V",1:dim(OutputFrame)[2], sep = "")
Output_mat <- as.matrix(OutputFrame)
for (i in 1:iterations) {
NoCorr[i] <- sum(output == Output_mat[i,]) / nrow(train)
}
# plotting number of correct predictions per iteration
plot(NoCorr, type = "l")
# Performance in training set after last iteration
table(output,round(OutputFrame[iterations,]))
First of all, welcome to the world of Neural Networks :).
Secondly, I want to recommend a great article to you, which I personally used to get a better understanding of backtracking and the whole NN learning stuff: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/. Might be a bit rough to get through sometimes, and for the general implementation I think it is much easier to follow pseudocode from a NN book. However, to understand what is going on this is article is very nice!
Thirdly, I will hopefully solve your problem :)
You comment yourself already with whether you should really round that output_pred. Yes you should.. if you want to use that output_pred to make a prediction! However, if you want to use it for learning it is generally not good! The reason for this is that if you round it for learning, than an output which was rounded up from 0.51 to 1 with target output 1 will not learn anything as the output was the same as the target and thus is perfect. However, 0.99 would have been a lot better of a prediction than 0.51 and thus there is definitely something to learn!
I am not 100% sure if this solves all your problems (im not an R programmer) and gets your accuracy up to 99%, but it should solve some of it, and hopefully the intuition is also clear :)

Feature selection + cross-validation, but how to make ROC-curves in R

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?
Silke
You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:
library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])
all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = aSAH[indices==i,]
learn = aSAH[indices!=i,]
model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}
roc(all.response, all.predictor)
mean(aucs)
The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources