Type measure differences in glmnet package? - r

What is the difference between using "mse" and "class" in the glmnet package?
log_x <- model.matrix(response~.,train)
log_y <- ifelse(train$response=="good",1,0)
log_cv <- cv.glmnet(log_x,log_y,alpha=1,family="binomial", type.measure = "class")
summary(log_cv)
plot(log_cv)
vs.
log_x <- model.matrix(response~.,train)
log_y <- ifelse(train$response=="good",1,0)
log_cv <- cv.glmnet(log_x,log_y,alpha=1,family="binomial", type.measure = "mse")
summary(log_cv)
plot(log_cv)
I'm noticing that I'm getting a slightly different curve, or smootness in my plot, and a few % difference in accuracy. But for predicting a binnomial class response is one type measure more appropriate than the other?

It depends on your case study and what you want to learn from your model. From the help files
The default is type.measure="deviance", which uses squared-error
for gaussian models (a.k.a type.measure="mse" there) [...]. type.measure="class"
applies to binomial and multinomial logistic regression only, and gives misclassification
error
Therefore, you have to ask yourself whether, in your problem, you want to minimize misclassification error or the mean squared error.
There is no straight forward answer to which is best. They are two different statistics from which the model decides what is the best penalization parameter to go for given the different models generated by the cross validation.

Related

R language, how to use bootstraps to generate maximum likelihood and AICc?

Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

Feature Selection for Regression Models in R

I’m trying to find a Feature Selection Package in R that can be used for Regression most of the packages implement their methods for classification using a factor or class for the response variable. In particular I’m interested if there’s a method using Random Forest for that purpose. Also a good paper in this field would be helpfull.
IIRC the randomForest package also does regression trees. You could start with the Breiman paper and go from there.
There are many ways you can use randomforest for calculating variable importance.
I. Mean Decrease Impurity (MDI) / Gini Importance :
This makes use of a random forest model or a decision tree. When training a tree, it is measured by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Here is an example of the same using R.
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Gini Impurity Index Plot")
And the output will look like this: Gini Importance Plot
II. Permutation Importance or Mean Decrease in Accuracy (MDA) : Permutation Importance or Mean Decrease in Accuracy (MDA) is assessed for each feature by removing the association between that feature and the target. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed. Example in R:
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=1))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Permutation Importance Plot")
This two are are the ones which use Random Forest directly. There are some more easy to use metrics for variable importance calculation purpose. 'Boruta' method and Weight of evidence (WOE) and Information Value (IV) might also be helpful.

Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code:
# de <- data imported from sql connection
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)
**Co <- coef(?reg or reg1?,s=???)**
summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]
The beginning imports a large amount of data from my database in SQL. I then put it in matrix format and separate the response from the predictors.
This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two.
I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out?
Thanks in advance!
glmnet() is a R package which can be used to fit Regression models,lasso model and others. Alpha argument determines what type of model is fit. When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit.
cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. Bias Variance advantages is usually the motivation behind using such model validation methods. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda.
In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. You can then derive the Test MSE for that value of lambda. By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. Hope this helps!
Hope this helps!
Between reg$lambda.min and reg$lambda.1se ; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value.

Resources