How to use glmnet in R for classification problems - r

I want to use the glmnet in R to do classification problems.
The sample data is as follows:
y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
1,0.766126609,45,2,0.802982129,9120,13,0,6,0,2
0,0.957151019,40,0,0.121876201,2600,4,0,0,0,1
0,0.65818014,38,1,0.085113375,3042,2,1,0,0,0
y is a binary response (0 or 1).
I used the following R code:
prr=cv.glmnet(x,y,family="binomial",type.measure="auc")
yy=predict(prr,newx, s="lambda.min")
However, the predicted yy by glmnet is scattered between [-24,5].
How can I restrict the output value to [0,1] thus I use it to do classification problems?

I have read the manual again and found that type="response" in predict method will produce what I want:
lassopre2=predict(prr,newx, type="response")
will output values between [0,1]

A summary of the glmnet path at each step is displayed if we just enter the object name or use the print function:
print(fit)
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## [1,] 0 0.0000 1.63000
## [2,] 2 0.0553 1.49000
## [3,] 2 0.1460 1.35000
## [4,] 2 0.2210 1.23000
It shows from left to right the number of nonzero coefficients (Df), the percent (of null) deviance explained (%dev) and the value of λ
(Lambda). Although by default glmnet calls for 100 values of lambda the program stops early if `%dev% does not change sufficently from one lambda to the next (typically near the end of the path.)
We can obtain the actual coefficients at one or more λ
’s within the range of the sequence:
coef(fit,s=0.1)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.150928
## V1 1.320597
## V2 .
## V3 0.675110
## V4 .
## V5 -0.817412
Here is the original explanation for more information by Hastie

Related

glmnet: at what lambda is each coefficient shrunk to 0?

I am using LASSO (from package glmnet) to select variables. I have fitted a glmnet model and plotted coefficients against lambda's.
library(glmnet)
set.seed(47)
x = matrix(rnorm(100 * 3), 100, 3)
y = rnorm(100)
fit = glmnet(x, y)
plot(fit, xvar = "lambda", label = TRUE)
Now I want to get the order in which coefficients become 0. In other words, at what lambda does each coefficient become 0?
I don't find a function in glmnet to extract such result. How can I get it?
Function glmnetPath in my initial answer is now in an R package called solzy.
## you may need to first install package "remotes" from CRAN
remotes::install_github("ZheyuanLi/solzy")
## Zheyuan Li's R functions on Stack Overflow
library(solzy)
## use function `glmnetPath` for your example
glmnetPath(fit)
#$enter
# i j ord var lambda
#1 3 2 1 V3 0.15604809
#2 2 19 2 V2 0.03209148
#3 1 24 3 V1 0.02015439
#
#$leave
# i j ord var lambda
#1 1 23 1 V1 0.02211941
#2 2 18 2 V2 0.03522036
#3 3 1 3 V3 0.17126258
#
#$ignored
#[1] i var
#<0 rows> (or 0-length row.names)
Interpretation of enter
As lambda decreases, variables (see i for numeric ID and var for variable names) enter the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 3 enters the model at lambda = 0.15604809, the 2nd value in fit$lambda;
variable 2 enters the model at lambda = 0.03209148, the 19th value in fit$lambda;
variable 1 enters the model at lambda = 0.02015439, the 24th value in fit$lambda.
Interpretation of leave
As lambda increases, variables (see i for numeric ID and var for variable names) leave the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 1 leaves the model at lambda = 0.02211941, the 23rd value in fit$lambda;
variable 2 leaves the model at lambda = 0.03522036, the 18th value in fit$lambda;
variable 3 leaves the model at lambda = 0.17126258, the 1st value in fit$lambda.
Interpretation of ignored
If not an empty data.frame, it lists variables that never enter the model. That is, they are effectively ignored. (Yes, this can happen!)
Note: fit$lambda is decreasing, so j is in ascending order in enter but in descending order in leave.
To further explain indices i and j, take variable 2 as an example. It leaves the model (i.e., its coefficient becomes 0) at j = 18 and enters the model (i.e., its coefficient becomes non-zero) at j = 19. You can verify this:
fit$beta[2, 1:18]
## all zeros
fit$beta[2, 19:ncol(fit$beta)]
## all non-zeros
See Obtain variable selection order from glmnet for a more complicated example.

Generating data from correlation matrix: the case of bivariate distributions [duplicate]

This question already has an answer here:
Simulating correlated Bernoulli data
(1 answer)
Closed 1 year ago.
An apparently simple problem: I want to generate 2 (simulated) variables (x, y) from a bivariate distribution with a given matrix of correlation between them. In other wprds, I want two variables/vectors with values of either 0 or 1, and a defined correlations between them.
The case of normal distribution is easy with the MASS package.
df_norm = mvrnorm(
100, mu = c(x=0,y=0),
Sigma = matrix(c(1,0.5,0.5,1), nrow = 2),
empirical = TRUE) %>%
as.data.frame()
cor(df_norm)
x y
x 1.0 0.5
y 0.5 1.0
Yet, how could I generate binary data from the given matrix correlation?
This is not working:
df_bin = df_norm %>%
mutate(
x = ifelse(x<0,0,1),
y = ifelse(y<0,0,1))
x y
1 0 1
2 0 1
3 1 1
4 0 1
5 1 0
6 0 0
7 1 1
8 1 1
9 0 0
10 1 0
Although this creates binary variables, but the correlation is not (even close to) 0.5.
cor(df_bin)
x y
x 1.0000000 0.2994996
y 0.2994996 1.0000000
Ideally I would like to be able to specify the type of distribution as an argument in the function (as in the lm() function).
Any idea?
I guessed that you weren't looking for binary, as in values of either zero or one. If that is what you're looking for, this isn't going to help.
I think what you want to look at is the construction of binary pair-copula.
You said you wanted to specify the distribution. The package VineCopula would be a good start.
You can use the correlation matrix to simulate the data after selecting the distribution. You mentioned lm() and Gaussian is an option - (normal distribution).
You can read about this approach through Lin and Chagnaty (2021). The package information isn't based on their work, but that's where I started when I looked for your answer.
I used the correlation of .5 as an example and the Gaussian copula to create 100 sets of points in this example:
# vine-copula
library(VineCopula)
set.seed(246543)
df <- BiCopSim(100, 1, .5)
head(df)
# [,1] [,2]
# [1,] 0.07585682 0.38413426
# [2,] 0.44705686 0.76155029
# [3,] 0.91419758 0.56181837
# [4,] 0.65891869 0.41187594
# [5,] 0.49187672 0.20168128
# [6,] 0.05422541 0.05756005

Why are my estimated and theoretical results for sobol sensitivity analysis different?

I am working on the sobol sensitivity analysis. I am trying to compute the first order effect and total effect indices in both an estimated and theoretical way.
Firstly, I computed the estimated values by following the steps in Wikipedia "Variance-based sensitivity analysis".
Here is the code:
set.seed(123)
x_1<-ceiling(1000*runif(1000)) ## generate x1 from range [1,1000]
x_2<-ceiling(100*runif(1000)) ## generate x2 from range [1,100]
x_3<-ceiling(10*runif(1000)) ## generate x3 from range [1,10]
x<-cbind(x_1,x_2,x_3) ## combine as one matrix
A<-matrix(x[1:500,],ncol=3) ## divide this one matrix into two
B<-matrix(x[501:1000,],ncol=3)
AB1<-cbind(B[,1],A[,-1]) ## replace the first column of sample A by the first column of sample B
AB2<-cbind(A[,1],B[,2],A[,3]) ## replace the second column of sample A by the second column of sample B
AB3<-cbind(A[,-3],B[,3]) ## replace the third column of sample A by the third column of sample B
trial<-function(x){ ## define the trial function:
#x1+x2*x3^2
x[,1]+(x[,2])*(x[,3])^2
}
Y_A<-trial(A) ## the output of A
Y_B<-trial(B) ## the output of B
Y_AB1<-trial(AB1) ## the output of AB1
Y_AB2<-trial(AB2) ## the output of AB2
Y_AB3<-trial(AB3) ## the output of AB3
Y<-matrix(cbind(Y_A,Y_B),ncol=1) ## the matrix of total outputs
S1<-mean(Y_B*(Y_AB1-Y_A))/var(Y) ## first order effect of x1
St1<-(sum((Y_A-Y_AB1)^2)/(2*500))/var(Y) ## total order effect of x1
S2<-mean(Y_B*(Y_AB2-Y_A))/var(Y) ## first order effect of x2
St2<-(sum((Y_A-Y_AB2)^2)/(2*500))/var(Y) ## total order effect of x2
S3<-mean(Y_B*(Y_AB3-Y_A))/var(Y) ## first order effect of x3
St3<-(sum((Y_A-Y_AB3)^2)/(2*500))/var(Y) ## total order effect of x3
S<-matrix(c(S1,St1,S2,St2,S3,St3),nrow=2,ncol=3) ## define the results
matrix
rownames(S)<-list("first order","total")
colnames(S)<-list("X1","X2","X3")
S ## print result
The results are:
X1 X2 X3
first order 0.01734781 0.2337758 0.5261082
total 0.01523861 0.4078107 0.6471387
And then, I wanted to compute the theoretical results to validate the above estimated results. I follow the equations in the picture below:
In order to compute the partial variance, I approached it in two ways: monte carlo integration and doing the integration by hand ( I just don't trust computer...)
Here is the code:
pimc=function(n){ ## monte carlo integration
a=runif(n)
g=(a+mean(x_2)*(mean(x_3)^2))^2
pimc=mean(g)
pimc
}
pimc(10000)/var(Y)
##
# doing integration by hand
(1/3+mean(x_2)*(mean(x_3))^2+((mean(x_2))^2)*((mean(x_3))^4))/var(Y)
## first order effect x1
(mean(x_1)^2+mean(x_1)*mean(x_3)^2+(1/3)*mean(x_3)^4)/var(Y)
## first order effect x2
(mean(x_1)^2+(2/3)*(mean(x_1)*mean(x_2))+(1/5)*(mean(x_2)^2))/var(Y)
## first order effect x3
The results are the following:
[,1]
[1,] 0.4414232 ## first order indice of x1 computed by monte carlo integration
[,1]
[1,] 0.4414238 ## first order indice of x1 computed by hand
[,1]
[1,] 0.0505044 ## first order indice of x2 computed by hand
[,1]
[1,] 0.05086958 ## first order indice of x3 computed by hand
The results are constant by using monte-carlo integration and hand solving. But they are quite different from the estimated results. Why is this?
When we do partial integral, we only look at one parameter, and we treat other parameters as constant numbers. So I chose to take the mean of each parameter as the constant number. Is it wrong?

How to use mgarchbekk package in R?

I am trying to estimate BEKK Garch model by using mgarchBEKK pachage, which is available here.
library(quantmod)
library(rugarch)
library(mgarchBEKK)
eps<- read.csv("C.csv", header=TRUE)
> head(eps)
v1 v2
1 -0.001936598 0.001968415
2 -0.000441797 -0.002724438
3 0.003752762 -0.010221719
4 -0.004511632 -0.014637860
5 -0.001426905 0.010597786
6 0.007435739 -0.005880712
> tail(eps)
v1 v2
1954 -0.043228944 0.0000530712
1955 0.082546871 -0.0028188110
1956 0.025058992 0.0058264010
1957 0.001751445 -0.0298050150
1958 -0.007973320 -0.0037243560
1959 -0.005207348 0.0012664230
## Simulate a BEKK process:
simulated <- simulateBEKK(2,1959, c(1,1), params = NULL)
## Prepare the input for the estimation process:
simulated1 <- do.call(cbind, simulated$eps)
## Estimate with default arguments:
estimated <- BEKK(simulated1)
H IS SINGULAR!...
H IS SINGULAR!...
Warning message:
In BEKK(simulated1) : negative inverted hessian matrix element
## Show diagnostics:
diagnoseBEKK(estimated)
## Likewise, you can estimate an mGJR process:
estimated2 <- mGJR(simulated[,1], simulated[,2])
I don't know, what is the problem in my code, because in results it is showing 3968 number of series, instead of 2 series.
You are estimating a model.
What does that mean?
To quote my statistics professors, the philosophy is to "find a stochastic model which may have created the observed series."
That is exactly what the mgarchBEKK is doing. The model is being fitted to the data you supplied (your V1 and V2 series). In simple words this means, that a lot of different parameter combinations are tried out and of those (in your case 3968 tries) the combination that "fits best" is what you see in your results.
I did the same with 3 time series of the length 8596. My results look something like this:
Number of estimated series : 25788
Length of estimated series : 8596
Estimation Time : 3.258482
So the number of estimated series is way above the 3 vectors I used.
The estimates look something like this (since this is a bi- or multivariate model-estimation you have parameter matrices, not single values as you would have with a one dimensional GARCH model):
C estimates:
[,1] [,2] [,3]
[1,] 0.9797469 0.2189191 0.202451941
[2,] 0.0000000 1.0649323 0.003050169
[3,] 0.0000000 0.0000000 0.896492130
ARCH estimates:
[,1] [,2] [,3]
[1,] 0.29110077 -0.008445699 0.008570904
[2,] -0.02109381 0.419092657 0.325321939
[3,] -0.01280835 -0.057648910 0.482502301
GARCH estimates:
[,1] [,2] [,3]
[1,] -0.27770297 0.03587415 -0.73029389
[2,] -0.05172256 -0.25601327 0.01918367
[3,] 0.07945086 0.03364686 -0.50664759
I can't describe the math behind this fitting, as far as I know some form of maximum likelihood estimation is used.
I'm a rookie in statistics, so if anything I said is wrong please feel free to correct me.

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

Resources