Confusion about 'standardize' option of glmnet package in R - r

I have a confusion about the standardize option of glmnet package in R. I get different coefficients when I standardize the covariates matrix and set standardize=FALSE vs. when I do not standardize the covariates matrix and set standardize=TRUE. I assumed they would be the same! These two are shown with an example by creating the following ridge.mod1 and ridge.mod2 models. I also created a model (ridge.mod3) that standardized the outcome (and the covariates matrix) and used the option standardize=FALSE. I was just checking if I needed to standardize the outcome too to get the same coefficients as in ridge.mod1.
set.seed(1)
y <- rnorm(30, 20, 10)
x1 <- rnorm(30, 5, 2)
x2 <- x1 + rnorm(30, 0, 5)
cor(x1,x2)
x <- as.matrix(cbind(x1,x2))
z1 <- scale(x1)
z2 <- scale(x2)
z <- as.matrix(cbind(z1,z2))
y.scale <- scale(y)
n <- 30
# Fixing foldid for proper comparison
foldid=sample(rep(seq(5),length=n))
table(foldid)
library(glmnet)
cv.ridge.mod1 <- cv.glmnet(x, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = TRUE)
ridge.mod1 <- glmnet(x, y, alpha = 0, standardize = TRUE)
coef(ridge.mod1, s=cv.ridge.mod1$lambda.min)
> coef(ridge.mod1, s=cv.ridge.mod1$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 2.082458e+01
x1 2.856136e-37
x2 4.334910e-38
cv.ridge.mod2 <- cv.glmnet(z, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod2 <- glmnet(z, y, alpha = 0, standardize = FALSE)
coef(ridge.mod2, s=cv.ridge.mod2$lambda.min)
> coef(ridge.mod2, s=cv.ridge.mod2$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 2.082458e+01
V1 4.391657e-37
V2 2.389751e-37
cv.ridge.mod3 <- cv.glmnet(z, y.scale, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod3 <- glmnet(z, y.scale, alpha = 0, standardize = FALSE)
coef(ridge.mod3, s=cv.ridge.mod3$lambda.min)
> coef(ridge.mod3, s=cv.ridge.mod3$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.023487e-16
V1 4.752255e-38
V2 2.585973e-38
Could anyone please tell me what's going on there and if (or how) I can get the same coefficients as in ridge.mod1 with prior standardization (in the data processing step) and then using standardize=FALSE?
Update: (what I tried based on the comments below)
So, I tried standardizing by SS/n instead of SS/(n-1). I tried by standardizing both y and x. Neither gave me coefficients equal to the de-standardized coefficients of model 1.
## Standadizing by sqrt(SS(X)/n) like glmnet instead of sqrt(SS(X)/(n-1)) which is done by the scale command
Xs <- apply(x, 2, function(m) (m - mean(m)) / sqrt(sum(m^2) / n))
Ys <- (y-mean(y)) / sqrt(sum(y^2) / n)
# Standadizing only X by sqrt(SS(X)/n)
cv.ridge.mod4 <- cv.glmnet(Xs, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod4 <- glmnet(Xs, y, alpha = 0, standardize = FALSE)
coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)
> coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)[2]/sd(x1)
[1] 7.995171e-38
> coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)[3]/sd(x2)
[1] 2.957854e-38
# Standadizing both Y and X by sqrt(SS(X)/n) but neither is centered
cv.ridge.mod6 <- cv.glmnet(Xs.noncentered, Ys.noncentered, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod6 <- glmnet(Xs.noncentered, Ys.noncentered, alpha = 0, standardize = FALSE)
coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)
> coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)[2] / (sqrt(sum(x1^2) / n))
[1] 1.019023e-39
> coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)[3] / (sqrt(sum(x2^2) / n))
[1] 9.189263e-40
What is it that still is wrong there?

I tweaked your code so that I can work with a more sensible problem. In order to reproduce the coefficients changing the standardize=TRUE and standardize=FALSE options you need to first standardize the variables with the (1/N) variance estimator formula. For this example I also centered the variables to get rid of the constant. I focus only on the coefficients of the variables. After that you have to notice that hence you have to invert that formula to get the de-standardized coefficients. I do that in the following code.
set.seed(1)
x1 <- rnorm(300, 5, 2)
x2 <- x1 + rnorm(300, 0, 5)
x3 <- rnorm(300, 6, 5)
e= rnorm(300, 0, 1)
y <- 0.3*x1+3.5*x2+x3+e
x <- as.matrix(cbind(x1,x2,x3))
sdN=function(x){
sigma=sqrt( (1/length(x)) * sum((x-mean(x))^2))
return(sigma)
}
n=300
foldid=sample(rep(seq(5),length=n))
g1=(x1-mean(x1))/sdN(x1)
g2=(x2-mean(x2))/sdN(x2)
g3=(x3-mean(x3))/sdN(x3)
gy=(y-mean(y))/sdN(y)
equis <- as.matrix(cbind(g1,g2,g3))
library(glmnet)
cv.ridge.mod1 <- cv.glmnet(x, y, alpha = 0, nfolds = 5, foldid=foldid,standardize = TRUE)
coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
cv.ridge.mod2 <- cv.glmnet(equis, gy, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE, intercept=FALSE)
beta=coef(cv.ridge.mod2, s=cv.ridge.mod2$lambda.min)
beta[2]*sdN(y)/sdN(x1)
beta[3]*sdN(y)/sdN(x2)
beta[4]*sdN(y)/sdN(x3)
coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
this yields the results:
> beta[2]*sdN(y)/sdN(x1)
[1] 0.5984356
> beta[3]*sdN(y)/sdN(x2)
[1] 3.166033
> beta[4]*sdN(y)/sdN(x3)
[1] 0.9145646
>
> coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
4 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.5951423
x1 0.5984356
x2 3.1660328
x3 0.9145646
As you can see the coefficients are the same at 4 decimals. So I hope this answer your question.

Related

Monte Carlo simulations for VAR models

I've been trying to estimate VAR models using Monte Carlo Simulation. I have 3 endogenous variables. I need some guidance regarding this.
First of all, I want to add an outlier as a percentage of the sample size.
Second (second simulation for same model), I want to add multivariate contaminated normal distribution like 0.9N (0, I) + 0.1((0,0,0)',(100, 100, 100)) instead of outlier.
Could you tell me how to do these?
Thank you.
RR <- function(n, out){
# n is number of observations
k <- 3 # Number of endogenous variables
p <- 2 # Number of lags
# add outlier
n[1]<- n[1]+out
# Generate coefficient matrices
B1 <- matrix(c(.1, .3, .4, .1, -.2, -.3, .03, .1, .1), k) # Coefficient matrix of lag 1
B2 <- matrix(c(0, .2, .1, .07, -.4, -.1, .5, 0, -.1), k) # Coefficient matrix of lag 2
M <- cbind(B1, B2) # Companion form of the coefficient matrices
# Generate series
DT <- matrix(0, k, n + 2*p) # Raw series with zeros
for (i in (p + 1):(n + 2*p)){ # Generate series with e ~ N(0,1)
DT[, i] <- B1%*%DT[, i-1] + B2%*%DT[, i-2] + rnorm(k, 0, 1)
}
DT <- ts(t(DT[, -(1:p)])) # Convert to time series format
#names <- c("V1", "V2", "V3") # Rename variables
colnames(DT) <- c("Y1", "Y2", "Y3")
#plot.ts(DT) # Plot the series
# estimate VECM
vecm1 <- VECM(DT, lag = 2, r = 2, include = "const", estim ="ML")
vecm2 <- VECM(DT, lag = 2, r = 1, include = "const", estim ="ML")
# mse
mse1 <- mean(vecm1$residuals^2)
mse2 <- mean(vecm2$residuals^2)
#param_list <- unname(param_list)
return(list("mse1" = mse1, "mse2" = mse2, "mse3" = mse3))
}
# defined the parameter grids(define the parameters ranges we want to run our function with)
n_grid = c(50, 80, 200, 400)
out_grid = c(0 ,5, 10)
# collect parameter grids in a list (to enter it into the Monte Carlo function)
prml = list("n" = n_grid, "out" = out_grid)
# run simulation
RRS <- MonteCarlo(func = RR, nrep = 1000, param_list = prml)
summary(RRS)
# make table:
rows = "n"
cols = "out"
MakeTable(output = RRS, rows = rows, cols = cols)

How to calculated DRPS (Discrete Rank Probability Score)

I am working on replicating the scoring rule found in a paper Forecasting the intermittent demand for slow-moving inventories: A modelling approach
The paper describes the scoring rule as follows:
This is my attempt
y <- rpois(n = 100, lambda = 10) # forecasted distribution
x <- 10 # actual value
drps_score <- function(x = value, y = q){
# x = actual value (single observation); y = quantile forecasted value (vector)
Fy = ecdf(y) # cdf function
indicator <- ifelse(y - x > 0, 1, 0) # Heaviside
score <- sum((indicator - Fy(y))^2)
return(score)
}
> drps_score(x = x, y = y)
[1] 53.028
This seems to work well until I provide a vector of 0s as follows:
y <- rep(x = 0, 100)
> drps_score(x = x, y = y)
[1] 0
I know that one of their methods used in this paper was a 0s forecast and their results did not show 0 for DRPS. This makes me think that the calculation is off.
I think there are a few issues at play here.
First off, I don't think you are computing the correct sum inside the scoring function. The score asks you to sum across all possible values of y (i.e. across all positive integers) not across all forecasted samples of y.
Second, I don't think the above definition gives the desired result, with \hat F (y) defined to be 0 when y=x then you don't get a zero score for a forecast with a point mass at the true value. (Yes, I'm saying that source is "wrong", or at least has a definition that doesn't give the desired result.) Here is a re-formulated function that I think fixes both issues:
x <- 10 # actual value
drps_score <- function(x = value, y = q, nsum=100){
# x = actual value (single observation); y = quantile forecasted value (vector)
Fy = ecdf(y) # cdf function
ysum <- 0:nsum
indicator <- ifelse(ysum - x >= 0, 1, 0) # Heaviside
score <- sum((indicator - Fy(ysum))^2)
return(score)
}
> drps_score(x = x, y = rpois(n = 1000, lambda = 8))
[1] 1.248676
> drps_score(x = x, y = rpois(n = 1000, lambda = 9))
[1] 0.878183
> drps_score(x = x, y = rpois(n = 1000, lambda = 10))
[1] 0.692667
> drps_score(x = x, y = rep(10, 100))
[1] 0
> drps_score(x = x, y = rpois(n = 1000, lambda = 11))
[1] 0.883333
The above shows that the distribution that is centered on the true value (lambda=10) has the lowest score for distributions that aren't a point mass.

How to simulate PCA Data?

I am trying to simulate PCA Data as follows:
q <- 5 # no. of PCs
p <- 20 # no. of variables
n <- 2000 # no. of individuals
eps <- 0.05 # error standard deviation
# Eigenvalues
Sig <- seq(3, 1, length.out = q)^2
Lambda <- diag(Sig)
# Matrix of Principal Components
H <- rmvnorm(n = n, mean = rep(0, q), sigma = Lambda)
# Add gaussian noise
E <- matrix(rnorm(n*p, sd = sqrt(eps)), ncol = p)
# Data matrix
Y <- H %*% t(Amat) + E
# Perform PCA
summary(m1 <- prcomp(Y, scale = T)) # and so on...
However, I have no idea how to create the matrix of Loadings Amat in a meaningful way.
Thanks for any help I receive from you and I appreciate it!
This is not using the same structure as the OP, but it simulates a PCA with 4 different groups (which could be species) which each have 3 "traits" (each of the trait have different means and sd based on some biological data found in the literature for example).
set.seed(123) # setting this so the random results will be repeatable
library(MASS)
# Simulating 3 traits for 4 different species
n = 200 # number of "individuals"
# Generate the groups
Amat1 = MASS::mvrnorm(n, mu = c(11.2,11.8,9.91), Sigma = diag(c(1.31,1.01,1.02)))
Amat2 = MASS::mvrnorm(n, mu = c(7.16,8.54,6.82), Sigma = diag(c(0.445,0.546,0.350)))
Amat3 = MASS::mvrnorm(n, mu = c(15.6,14.6,13.5), Sigma = diag(c(1.43,0.885,0.990)))
Amat4 = MASS::mvrnorm(n, mu = c(8.65,14.1,8.24), Sigma = diag(c(0.535,0.844,0.426)))
# Combine the data
Amat = rbind(Amat1,Amat2,Amat3,Amat4)
# Make group data
Amat.gr = cbind(Amat, gl(4,k=n,labels = c(1,2,3,4)))
# Calculate the covariance matrix for each group
by(Amat.gr[,1:3],INDICES = Amat.gr[,4],FUN = cov) # calculate covariance matrix for all groups
# Plot the result
summary(m1 <- prcomp(Amat, scale= T))
# biplot(m1, xlabs=rep(".", nrow(Amat)), cex = 2)
plot(vegan::scores(m1), asp = 1, pch = 19, col = gl(4,k=n,labels = c(1,2,3,4)))
plot(Amat[,1],Amat[,2], pch = 19, col = gl(4,k=n,labels = c(1,2,3,4)))
The plot on the left shows the PCA and on the right the raw data.
I added a toy example with data to show what is the algorithm to compute a PCA in R from Legendre and Legendre 2012.
# Generate vectors (example from Legendre and Legendre 2012)
v1 = c(2,3,5,7,9)
v2 = c(1,4,0,6,2)
# If you want to play with sample size
# n = 100
# v1 = rnorm(n = n, mean = mean(v1), sd = sd(v1))
# v2 = rnorm(n = n, mean = mean(v2), sd = sd(v2))
# Get the y matrix
y = cbind(v1,v2)
# Centered y matrix
yc = apply(y, 2, FUN = function(x) x-mean(x))
# Dispersion matrix
s = 1/(nrow(y)-1)*t(yc) %*% yc
# Compute the single value decomposition to get the eigenvectors and
ev = svd(s)$v
# get the principal components
f = yc %*% ev
# This gives the identity matrix
round(t(svd(s)$v) %*% svd(s)$v,2)
# these are the eigen values
svd(s)$d
-svd(yc)$v #p. 104
plot(f, pch = 19); abline(h=0,v=0, lty = 3)

Xgboost multiclass predicton with linear booster

Does it make sense to use a linear booster to predict a categorical outcome?
I thought it could work like multinomial logistic regression.
An example in R is as follows,
y <- c(0, 1, 2, 0, 1, 2) # target variable with numeric encoding
x1 <- c(1, 3, 5, 3, 5, 7)
x2 <- rnorm(n = 6, sd = 1) + x1
df <- data.matrix(data.frame(x1, x2, y))
xgb <- xgboost(data = df[, c("x1", "x2")], label = df[, "y"],
params = list(booster = "gblinear", objective = "multi:softmax",
num_class = 3),
save_period = NULL, nrounds = 1)
xgb.importance(model = xgb)
I don't get an error but the importance has 6 features instead of the expected 2. Is there any interpretation of the 6 importances in terms of the 2 input variables? Or does this not make any sense and only gbtree is sensible?
Thanks

Why is gradient of first iteration step singular in nls with biv.norm

I am trying to fit a non-linear regression model where the mean-function is the bivariate normal distribution. The parameter to specify is the correlation rho.
The problem: "gradient of first iteration step is singular". Why?
I have here a little example with simulated data.
# given values for independent variables
x1 <- c(rep(0.1,5), rep(0.2,5), rep(0.3,5), rep(0.4,5), rep(0.5,5))
x2 <- c(rep(c(0.1,0.2,0.3,0.4,0.5),5))
## 1 generate values for dependent variable (incl. error term)
# from bivariate normal distribution with assumed correlation rho=0.5
fun <- function(b) pmnorm(x = c(qnorm(x1[b]), qnorm(x2[b])),
mean = c(0, 0),
varcov = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
set.seed(123)
y <- sapply(1:25, function(b) fun(b)) + runif(25)/1000
# put it in data frame
dat <- data.frame(y=y, x1=x1, x2=x2 )
# 2 : calculate non-linear regression from the generated data
# use rho=0.51 as starting value
fun <- function(x1, x2,rho) pmnorm(x = c(qnorm(x1), qnorm(x2)),
mean = c(0, 0),
varcov = matrix(c(1, rho, rho, 1), nrow = 2))
nls(formula= y ~ fun(x1, x2, rho), data= dat, start=list(rho=0.51),
lower=0, upper=1, trace=TRUE)
This yields an error message:
Error in nls(formula = y ~ fun(x1, x2, rho), data = dat, start = list(rho = 0.51), :
singulärer Gradient
In addition: Warning message:
In nls(formula = y ~ fun(x1, x2, rho), data = dat, start = list(rho = 0.51), :
Obere oder untere Grenzen ignoriert, wenn nicht algorithm= "port"
What I don't understand is
I have only one variable (rho), so there is only one gradient which must be =0 if the matrix of gradients is supposed to be singular. So why should the gradient be =0?
The start value cannot be the problem as I know the true rho=0.5. So the start value =0.51 should be fine, shouldn't it?
The data cannot be completely linear dependent as I added an error term to y.
I would appreciate help very much. Thanks already.
Perhaps "optim" does a better job than "nls":
library(mnormt)
# given values for independent variables
x1 <- c(rep(0.1,5), rep(0.2,5), rep(0.3,5), rep(0.4,5), rep(0.5,5))
x2 <- c(rep(c(0.1,0.2,0.3,0.4,0.5),5))
## 1 generate values for dependent variable (incl. error term)
# from bivariate normal distribution with assumed correlation rho=0.5
fun <- function(b) pmnorm(x = c(qnorm(x1[b]), qnorm(x2[b])),
mean = c(0, 0),
varcov = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
set.seed(123)
y <- sapply(1:25, function(b) fun(b)) + runif(25)/1000
# put it in data frame
dat <- data.frame(y=y, x1=x1, x2=x2 )
# 2 : calculate non-linear regression from the generated data
# use rho=0.51 as starting value
fun <- function(x1, x2,rho) pmnorm(x = c(qnorm(x1), qnorm(x2)),
mean = c(0, 0),
varcov = matrix(c(1, rho, rho, 1), nrow = 2))
f <- function(rho) { sum( sapply( 1:nrow(dat),
function(i){
(fun(dat[i,2],dat[i,3],rho) - dat[i,1])^2
} ) ) }
optim(0.51, f, method="BFGS")
The result is not that bad:
> optim(0.51, f, method="BFGS")
$par
[1] 0.5043406
$value
[1] 3.479377e-06
$counts
function gradient
14 4
$convergence
[1] 0
$message
NULL
Maybe even a little bit better than 0.5:
> f(0.5043406)
[1] 3.479377e-06
> f(0.5)
[1] 1.103484e-05
>
Let's check another start value:
> optim(0.8, f, method="BFGS")
$par
[1] 0.5043407
$value
[1] 3.479377e-06
$counts
function gradient
28 6
$convergence
[1] 0
$message
NULL

Resources