glmnet: at what lambda is each coefficient shrunk to 0? - r

I am using LASSO (from package glmnet) to select variables. I have fitted a glmnet model and plotted coefficients against lambda's.
library(glmnet)
set.seed(47)
x = matrix(rnorm(100 * 3), 100, 3)
y = rnorm(100)
fit = glmnet(x, y)
plot(fit, xvar = "lambda", label = TRUE)
Now I want to get the order in which coefficients become 0. In other words, at what lambda does each coefficient become 0?
I don't find a function in glmnet to extract such result. How can I get it?

Function glmnetPath in my initial answer is now in an R package called solzy.
## you may need to first install package "remotes" from CRAN
remotes::install_github("ZheyuanLi/solzy")
## Zheyuan Li's R functions on Stack Overflow
library(solzy)
## use function `glmnetPath` for your example
glmnetPath(fit)
#$enter
# i j ord var lambda
#1 3 2 1 V3 0.15604809
#2 2 19 2 V2 0.03209148
#3 1 24 3 V1 0.02015439
#
#$leave
# i j ord var lambda
#1 1 23 1 V1 0.02211941
#2 2 18 2 V2 0.03522036
#3 3 1 3 V3 0.17126258
#
#$ignored
#[1] i var
#<0 rows> (or 0-length row.names)
Interpretation of enter
As lambda decreases, variables (see i for numeric ID and var for variable names) enter the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 3 enters the model at lambda = 0.15604809, the 2nd value in fit$lambda;
variable 2 enters the model at lambda = 0.03209148, the 19th value in fit$lambda;
variable 1 enters the model at lambda = 0.02015439, the 24th value in fit$lambda.
Interpretation of leave
As lambda increases, variables (see i for numeric ID and var for variable names) leave the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 1 leaves the model at lambda = 0.02211941, the 23rd value in fit$lambda;
variable 2 leaves the model at lambda = 0.03522036, the 18th value in fit$lambda;
variable 3 leaves the model at lambda = 0.17126258, the 1st value in fit$lambda.
Interpretation of ignored
If not an empty data.frame, it lists variables that never enter the model. That is, they are effectively ignored. (Yes, this can happen!)
Note: fit$lambda is decreasing, so j is in ascending order in enter but in descending order in leave.
To further explain indices i and j, take variable 2 as an example. It leaves the model (i.e., its coefficient becomes 0) at j = 18 and enters the model (i.e., its coefficient becomes non-zero) at j = 19. You can verify this:
fit$beta[2, 1:18]
## all zeros
fit$beta[2, 19:ncol(fit$beta)]
## all non-zeros
See Obtain variable selection order from glmnet for a more complicated example.

Related

cor function with NA values due to 0 variance

Beginner R user here. I am using the cor function to get the Kendal's tau-b rank correlation coefficient between 2 columns of a dataframe. Examples of such columns are as folows:
A B
1 1
1 2
1 3
when I use cor(d,method="kendall")
The result is NA for the correlation between A and B. Shouldnt it be 0? And if not is there a way that I can replace this NA result with 0 using a parameter in the cor function?
Consider what would happen if we slightly perturb the constant column. We get vastly different solutions depending on the particular perturbation used. In fact we can get any correlation we like with different perturbations. As a result it really makes no sense to use any particular value for the correlation and it would be best left as NA.
x <- c(1, 1, 1)
y <- 1:3
cor(x + (1:3) * 1e-10, y, method = "spearman")
## [1] 1
cor(x - (1:3) * 1e-10, y, method = "spearman")
## [1] -1

How to interpret cross validation output from cv.kknn (kknn package)

I'm trying to interpret the results I'm getting when trying to cross validate my data for a k-nearest neighbors model. My data set is set up like
variable1(int) | variable2(int) | variable3(int) | variable4 (int) | Response (factor)
I split my data 80% into cvdata and 20% for testing once I choose my model.
A single iteration for my code is below:
cv <- cv.kknn(formula = Response~., cvdata, kcv = 10, k = 7, kernel = 'optimal', scale = TRUE)
cv
When I run 'cv' it just returns a list() containing some seemingly random numbers as the rownames, the observed outcome variable (y) and predicted outcome variable (yhat). I'm trying to calculate some sort of accuracy to the test set. Should I be comparing y to yhat to validate?
EDIT: output added below
[[1]]
y yhat
492 1 0.724282776
654 0 0.250394372
427 0 0.125159894
283 0 0.098561768
218 1 0.409990851
[[2]]
[1] 0.2267058 0.1060212
The first element in [[2]] is mean absolute error, the second is mean squared error.
Suppose df is your dataframe, then those values can be easily tested by mean(abs(df$y - df$yhat)) and mean((df$y - df$yhat)^2).

Minimal depth interaction from randomForestExplainer package

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks
My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so:
table(predicted,actual)
However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not include the missing outcome and looks like this:
A B C D
A n1 n2 n2 n4
B n5 n6 n7 n8
D n9 n10 n11 n12
# Note how there is no corresponding row for `C`.
The confusionMatrix function can't handle the missing outcome and gives the error:
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type
Is there a way I can use the table function differently to get the missing rows with zeros or use the confusionMatrix function differently so it will view missing outcomes as zero?
As a note: Since I am randomly selecting my data to test with, there are times that a category is also not represented in the actual result as opposed to just the predicted. I don't believe this will change the solution.
You can use union to ensure similar levels:
library(caret)
# Sample Data
predicted <- c(1,2,1,2,1,2,1,2,3,4,3,4,6,5) # Levels 1,2,3,4,5,6
reference <- c(1,2,1,2,1,2,1,2,1,2,1,3,3,4) # Levels 1,2,3,4
u <- union(predicted, reference)
t <- table(factor(predicted, u), factor(reference, u))
confusionMatrix(t)
First note that confusionMatrix can be called as confusionMatrix(predicted, actual) in addition to being called with table objects. However, the function throws an error if predicted and actual (both regarded as factors) do not have the same number of levels.
This (and the fact that the caret package spit an error on me because they don't get the dependencies right in the first place) is why I'd suggest to create your own function:
# Create a confusion matrix from the given outcomes, whose rows correspond
# to the actual and the columns to the predicated classes.
createConfusionMatrix <- function(act, pred) {
# You've mentioned that neither actual nor predicted may give a complete
# picture of the available classes, hence:
numClasses <- max(act, pred)
# Sort predicted and actual as it simplifies what's next. You can make this
# faster by storing `order(act)` in a temporary variable.
pred <- pred[order(act)]
act <- act[order(act)]
sapply(split(pred, act), tabulate, nbins=numClasses)
}
# Generate random data since you've not provided an actual example.
actual <- sample(1:4, 1000, replace=TRUE)
predicted <- sample(c(1L,2L,4L), 1000, replace=TRUE)
print( createConfusionMatrix(actual, predicted) )
which will give you:
1 2 3 4
[1,] 85 87 90 77
[2,] 78 78 79 95
[3,] 0 0 0 0
[4,] 89 77 82 83
I had the same problem and here is my solution:
tab <- table(my_prediction, my_real_label)
if(nrow(tab)!=ncol(tab)){
missings <- setdiff(colnames(tab),rownames(tab))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(tab))
tab <- as.table(rbind(as.matrix(tab), missing_mat))
rownames(tab) <- colnames(tab)
}
my_conf <- confusionMatrix(tab)
Cheers
Cankut

How to use glmnet in R for classification problems

I want to use the glmnet in R to do classification problems.
The sample data is as follows:
y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
1,0.766126609,45,2,0.802982129,9120,13,0,6,0,2
0,0.957151019,40,0,0.121876201,2600,4,0,0,0,1
0,0.65818014,38,1,0.085113375,3042,2,1,0,0,0
y is a binary response (0 or 1).
I used the following R code:
prr=cv.glmnet(x,y,family="binomial",type.measure="auc")
yy=predict(prr,newx, s="lambda.min")
However, the predicted yy by glmnet is scattered between [-24,5].
How can I restrict the output value to [0,1] thus I use it to do classification problems?
I have read the manual again and found that type="response" in predict method will produce what I want:
lassopre2=predict(prr,newx, type="response")
will output values between [0,1]
A summary of the glmnet path at each step is displayed if we just enter the object name or use the print function:
print(fit)
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## [1,] 0 0.0000 1.63000
## [2,] 2 0.0553 1.49000
## [3,] 2 0.1460 1.35000
## [4,] 2 0.2210 1.23000
It shows from left to right the number of nonzero coefficients (Df), the percent (of null) deviance explained (%dev) and the value of λ
(Lambda). Although by default glmnet calls for 100 values of lambda the program stops early if `%dev% does not change sufficently from one lambda to the next (typically near the end of the path.)
We can obtain the actual coefficients at one or more λ
’s within the range of the sequence:
coef(fit,s=0.1)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.150928
## V1 1.320597
## V2 .
## V3 0.675110
## V4 .
## V5 -0.817412
Here is the original explanation for more information by Hastie

Resources