Lasso in R package glmnet: solution not optimal? - r

I am just trying to understand the lasso implemented in R package glmnet.
I am fitting the data but the solution I get for a lambda of my choice is
not minimising the criteria of the lasso, for instance if I move the solution in one way
I can get a smaller cost function.
What am i understanding wrong?
library(lattice)
library(Matrix)
library(glmnet)
# Target vector
Target <- c(1.3906275E7, -1.8241672E7, 8181847.0, 1.6927098E7, -6547966.5, -1363836.375)
# Observation vector
Obs <- matrix(c( -0.944, 0.869 ,-0.795,-0.996, 0.617, 0.886,
-0.472 , 0.936 , 0.063 ,-0.080,-0.751 ,-0.834,
-0.107 , 0.343 , 0.261 , 0.327,-0.255,0.705,
-1.803,-0.781,0.168,0.211,-0.349, -0.040),6,4)
#fitting
fits <-glmnet(Obs,Target)
# arbitrary choice of lambda
lambda <- 221800
coef = predict(fits,s = lambda ,type="coefficients")
res = c(coef[2,1],coef[3,1],coef[4,1],coef[5,1])
# Computing the lasso criteria
newbookrisklassor = Target-Obs%*%res
cost = sum(abs(res))
newRisklassor = t(newbookrisklassor) %*% newbookrisklassor+lambda*cost
# Moving solution slightly in 1 way and computing the lasso criteria
epsilon = 500000
resP = res + c(0,epsilon,0,0)
costP = sum(abs(resP))
newbookrisklassorP = Target-Obs%*%resP
newRisklassorP = t(newbookrisklassorP) %*% newbookrisklassorP+lambda*costP
# Error it seems that the resP solution is better
newRisklassor-newRisklassorP

You should add the intercept in the fitting
newbookrisklassor = Target-Obs%*%res- intercept...

Related

Approach for comparing linear, non-linear and different parameterization non-linear models

I search for one approach for comparing linear, non-linear and different parameterization non-linear models. For this:
#Packages
library(nls2)
library(minpack.lm)
# Data set - Diameter in function of Feature and Age
Feature<-sort(rep(c("A","B"),22))
Age<-c(60,72,88,96,27,
36,48,60,72,88,96,27,36,48,60,72,
88,96,27,36,48,60,27,27,36,48,60,
72,88,96,27,36,48,60,72,88,96,27,
36,48,60,72,88,96)
Diameter<-c(13.9,16.2,
19.1,19.3,4.7,6.7,9.6,11.2,13.1,15.3,
15.4,5.4,7,9.9,11.7,13.4,16.1,16.2,
5.9,8.3,12.3,14.5,2.3,5.2,6.2,8.6,9.3,
11.3,15.1,15.5,5,7,7.9,8.4,10.5,14,14,
4.1,4.9,6,6.7,7.7,8,8.2)
d<-dados <- data.frame(Feature,Age,Diameter)
str(d)
I will create three different models, two non-linear models with specific parametization and one linear model. In my example
a suppose that all the coefficients of each mode were significant (and not considering real results).
# Model 1 non-linear
e1<- Diameter ~ a1 * Age^a2
#Algoritm Levenberg-Marquardt
m1 <- nlsLM(e1, data = d,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000))
# Model 2 linear
m2<-lm(Diameter ~ Age, data=d)
# Model 3 another non-linear
e2<- Diameter ~ a1^(-Age/a2)
m3 <- nls2(e2, data = d, alg = "brute-force",
start = data.frame(a1 = c(-1, 1), a2 = c(-1, 1)),
control = nls.control(maxiter = 1000))
Now, my idea is comparing the "better" model despite the different nature of each model, than I try a proportional measure
and for this I use each mean square error of each model comparing of total square error in data set, when a make this I have if
a comparing model 1 and 2:
## MSE approach (like pseudo R2 approach)
#Model 1
SQEm1<-summary(m1)$sigma^2*summary(m1)$df[2]# mean square error of model
SQTm1<-var(d$Diameter)*(length(d$Diameter)-1)#total square error in data se
R1<-1-SQEm1/SQTm1
R1
#Model 2
SQEm2<-summary(m2)$sigma^2*summary(m2)$df[2]# mean square error of model
R2<-1-SQEm2/SQTm1
R2
In my weak opinion model 1 is "better" that model 2. My question is, does this approach sounds correct? Is there any way to compare these models types?
Thanks in advance!
#First cross-validation approach ------------------------------------------
#Cross-validation model 1
set.seed(123) # for reproducibility
n <- nrow(d)
frac <- 0.8
ix <- sample(n, frac * n) # indexes of in sample rows
e1<- Diameter ~ a1 * Age^a2
#Algoritm Levenberg-Marquardt
m1 <- nlsLM(e1, data = d,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000), subset = ix)# in sample model
BOD.out <- d[-ix, ] # out of sample data
pred <- predict(m1, new = BOD.out)
act <- BOD.out$Diameter
RSS1 <- sum( (pred - act)^2 )
RSS1
#[1] 56435894734
#Cross-validation model 2
m2<-lm(Diameter ~ Age, data=d,, subset = ix)# in sample model
BOD.out2 <- d[-ix, ] # out of sample data
pred <- predict(m2, new = BOD.out2)
act <- BOD.out2$Diameter
RSS2 <- sum( (pred - act)^2 )
RSS2
#[1] 19.11031
# Sum of squares approach -----------------------------------------------
deviance(m1)
#[1] 238314429037
deviance(m2)
#[1] 257.8223
Based in gfgm and G. Grothendieck comments, RSS2 has lower error that RSS1 and comparing deviance(m2) and deviance(m2) too, than model 2 is better than model 1.

How does ar.yw estimate the variance

In R, how does the function ar.yw estimate the variance? Specifically, where does the number "var.pred" come from? It does not seem to come from the usual YW estimate of the variance, nor the sum of squared residuals divided by df (even though there is disagreement about what the df should be, none of the choices give an answer equivalent to var.pred). And yes, I know that there are better methods than YW; just trying to figure out what R is doing.
set.seed(82346)
temp <- arima.sim(n=10, list(ar = 0.5), sd=1)
fit <- ar(temp, method = "yule-walker", demean = FALSE, aic=FALSE, order.max=1)
## R's estimate of the sigma squared
fit$var.pred
## YW estimate
sum(temp^2)/10 - fit$ar*sum(temp[2:10]*temp[1:9])/10
## YW if there was a mean
sum((temp-mean(temp))^2)/10 - fit$ar*sum((temp[2:10]-mean(temp))*(temp[1:9]-mean(temp)))/10
## estimate based on residuals, different possible df.
sum(na.omit(fit$resid^2))/10
sum(na.omit(fit$resid^2))/9
sum(na.omit(fit$resid^2))/8
sum(na.omit(fit$resid^2))/7
Need to read the code if it's not documented.
?ar.yw
Which says: "In ar.yw the variance matrix of the innovations is computed from the fitted coefficients and the autocovariance of x." If that is not enough explanation, then you need to look at the code:
methods(ar.yw)
#[1] ar.yw.default* ar.yw.mts*
#see '?methods' for accessing help and source code
getAnywhere(ar.yw.default)
# there are two cases that I see
x <- as.matrix(x)
nser <- ncol(x)
if (nser > 1L) # .... not your situation
#....
else{
r <- as.double(drop(xacf))
z <- .Fortran(C_eureka, as.integer(order.max), r, r,
coefs = double(order.max^2), vars = double(order.max),
double(order.max))
coefs <- matrix(z$coefs, order.max, order.max)
partialacf <- array(diag(coefs), dim = c(order.max, 1L,
1L))
var.pred <- c(r[1L], z$vars)
#.......
order <- if (aic)
(0L:order.max)[xaic == 0L]
else order.max
ar <- if (order)
coefs[order, seq_len(order)]
else numeric()
var.pred <- var.pred[order + 1L]
var.pred <- var.pred * n.used/(n.used - (order + 1L))
So you now need to find the Fortran code for C_eureka. I think I'm finding it here: https://svn.r-project.org/R/trunk/src/library/stats/src/eureka.f This is the code that aI think is returning the var.pred estimate. I'm not a time series guy and It's your responsibility to review this process for applicability to your problem.
subroutine eureka (lr,r,g,f,var,a)
c
c solves Toeplitz matrix equation toep(r)f=g(1+.)
c by Levinson's algorithm
c a is a workspace of size lr, the number
c of equations
c
snipped
c estimate the innovations variance
var(l) = var(l-1) * (1 - f(l,l)*f(l,l))
if (l .eq. lr) return
d = 0.0d0
q = 0.0d0
do 50 i = 1, l
k = l-i+2
d = d + a(i)*r(k)
q = q + f(l,i)*r(k)
50 continue

Estimating t-statistics from Johannsen cointegration method in ca.jo in R

I'm working my way through the vignette workded example in the “vars” library in R. I’m understand most of the example in the vignette expect Table 5 of the vignette here
Running the following code, I see where the cointegrating vector and loadings point estimates are derived, but I do not understand where the t-statistics have been derived.
I ran this code:
library("vars")
data("Canada")
Canada <- Canada[, c("prod", "e", "U", "rw")]
vecm <- ca.jo(Canada[, c("rw", "prod", "e", "U")], type = "trace", ecdet = "trend", K = 3, spec = "transitory")
vecm.r1 <- cajorls(vecm, r = 1)
And got these eigenvectors and weightings,
Eigenvectors, normalised to first column:
(These are the cointegration relations)
rw.l1 prod.l1 e.l1 U.l1 trend.l1
rw.l1 1.00000000 1.0000000 1.0000000 1.000000 1.0000000
prod.l1 0.54487553 -3.0021508 0.7153696 -7.173608 0.4087221
e.l1 -0.01299605 -3.8867890 -2.0625220 -30.429074 -3.3884676
U.l1 1.72657188 -10.2183404 -5.3124427 -49.077209 -5.1326687
trend.l1 -0.70918872 0.6913363 -0.3643533 11.424630 0.1157125
Weights W:
(This is the loading matrix)
rw.l1 prod.l1 e.l1 U.l1 trend.l1
rw.d -0.084814510 0.048563997 -0.02368720 -0.0016583069 5.722004e-12
prod.d -0.011994081 0.009204887 -0.09921487 0.0020567547 -7.478364e-12
e.d -0.015606039 -0.038019447 -0.01140202 -0.0005559337 -1.229460e-11
U.d -0.008659911 0.020499657 0.02896325 0.0009140795 1.103862e-11
Which give me the correct alpha and beta values in Table 5.
But I have no idea where the t-statistics in table 5 have been derived in the code. Can anybody out there point me in the right direction please?
Regards,
James
My biggest gripe with urca: it does not calculate t-statistics for the beta vectors. It looks like the author maybe at some point wanted to include the feature (hence table 5 in the vignette you cite above) but did not get around to actually doing it.
The good news: For the case of one cointegrating vector (r=1), there is sample code available to get the standard errors. See this discussion thread
Standard Errors for the Long-Run Cointegrating Relationship with urca when there is a single cointegrating vector
Continuing from your example above, the code given by the nabble thread discussed above is:
alpha <- coef(vecm.r1$rlm)[1, ] # the coefficients on ecm1
beta <- vecm.r1$beta # the point estimates of beta
resids <- resid(vecm.r1$rlm)
N <- nrow(resids)
sigma <- crossprod(resids) / N
## t-stats for beta
beta.se <- sqrt(diag(kronecker(solve(crossprod(vecm#RK[, -1])),
solve(t(alpha) %*% solve(sigma) %*% alpha))))
beta.t <- c(NA, beta[-1] / beta.se)
names(beta.t) <- rownames(vecm.r1$beta)
beta.t
This reproduces exactly the standard errors given by the vignette.
Compare with Gretl
If we run the same model with gretl, we get exactly identical point estimates for beta but the standard errors roughly agree:
Bonus: Create LaTeX table of the cointegrating vector with standard errors using urca and texreg
You need to write a slightly clumsy "extract" function to assemble the standard errors and point estimates together into the form that texreg likes:
extract.cajo_beta <- function(cajo, orls) {
alpha <- coef(orls$rlm)[1, ];
resids <- resid(orls$rlm);
N <- nrow(resids);
sigma <- crossprod(resids) / N;
# get standard errors and p-values
beta <- orls$beta
beta.se <- sqrt(diag(kronecker(solve(crossprod(cajo#RK[, -1])),
solve(t(alpha) %*% solve(sigma) %*% alpha))));
beta.se2 <- c(NA, beta.se);
beta.t <- c(NA, beta[-1] / beta.se);
beta.pval <- dt(beta.t, df= orls$rlm$df.residual)
tr <- createTexreg(coef.names = as.character(rownames(beta)), coef = as.numeric(beta), se = beta.se2,
pvalues = beta.pval,
# remove this goodness of fit measure afterwards from the table
gof.names = c('Dummy'), gof = c(1), gof.decimal = c(FALSE)
);
return(tr);
}
Then run:
> screenreg(extract.cajo_beta(vecm, vecm.r1))
=================
Model 1
-----------------
rw.l1 1.00
prod.l1 0.54
(0.61)
e.l1 -0.01
(0.68)
U.l1 1.73
(1.45)
trend.l1 -0.71 *
(0.28)
-----------------
Dummy 1
=================
*** p < 0.001, ** p < 0.01, * p < 0.05

How do I get standard errors of maximum-likelihood estimates in STAN?

I am using maximum-likelihood optimization in Stan, but unfortunately the optimizing() function doesn't report standard errors:
> MLb4c <- optimizing(get_stanmodel(fitb4c), data = win.data, init = inits)
STAN OPTIMIZATION COMMAND (LBFGS)
init = user
save_iterations = 1
init_alpha = 0.001
tol_obj = 1e-012
tol_grad = 1e-008
tol_param = 1e-008
tol_rel_obj = 10000
tol_rel_grad = 1e+007
history_size = 5
seed = 292156286
initial log joint probability = -4038.66
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes
13 -2772.49 9.21091e-005 0.0135987 0.07606 0.9845 15
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
> t2 <- proc.time()
> print(t2 - t1)
user system elapsed
0.11 0.19 0.74
>
> MLb4c
$par
psi alpha beta
0.9495000 0.4350983 -0.2016895
$value
[1] -2772.489
> summary(MLb4c)
Length Class Mode
par 3 -none- numeric
value 1 -none- numeric
How do I get the standard errors of the estimates (or confidence interval - quantiles), and possibly p-values?
EDIT: I did as kindly advised by #Ben Goodrich:
> MLb4cH <- optimizing(get_stanmodel(fitb4c), data = win.data, init = inits, hessian = TRUE)
> sqrt(diag(solve(-MLb4cH$hessian)))
psi alpha beta
0.21138314 0.03251696 0.03270493
But these "unconstrained" standard errors seem to be very different from the real ones - here as is the output from bayesian fitting using stan():
> print(outb4c, dig = 5)
Inference for Stan model: tmp_stan_model.
3 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=750.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 0.43594 0.00127 0.03103 0.37426 0.41578 0.43592 0.45633 0.49915 594 1.00176
beta -0.20262 0.00170 0.03167 -0.26640 -0.22290 -0.20242 -0.18290 -0.13501 345 1.00402
psi 0.94905 0.00047 0.01005 0.92821 0.94308 0.94991 0.95656 0.96632 448 1.00083
lp__ -2776.94451 0.06594 1.15674 -2780.07437 -2777.50643 -2776.67139 -2776.09064 -2775.61263 308 1.01220
You can specify the hessian = TRUE argument to the optimizing function, which will return the Hessian as part of the list of output. Thus, you can obtain estimated standard errors via sqrt(diag(solve(-MLb4c$hessian))); however those standard errors pertain to the estimates in the unconstrained space. To obtain estimated standard errors for the parameters in the constrained space, you could either use the delta method or draw many times from a multivariate normal distribution whose mean vector is MLb4c$par and whose variance-covariance is solve(-MLb4c$hessian), convert those draws to the constrained space with the constrain_pars function, and estimate the standard deviation of each column.
Here is some R code you could adapt to your case
# 1: Compile and save a model (make sure to pass the data here)
model <- stan(file="model.stan", data=c("N","K","X","y"), chains = 0, iter = 0)
# 2: Fit that model
fit <- optimizing(object=get_stanmodel(model), as_vector = FALSE,
data=c("N","K","X","y"), hessian = TRUE)
# 3: Extract the vector theta_hat and the Hessian for the unconstrained parameters
theta_hat <- unlist(fit$par)
upars <- unconstrain_pars(linear, relist(theta_hat, fit$par))
Hessian <- fit$hessian
# 4: Extract the Cholesky decomposition of the (negative) Hessian and invert
R <- chol(-Hessian)
V <- chol2inv(R)
rownames(V) <- colnames(V) <- colnames(Hessian)
# 5: Produce a matrix with some specified number of simulation draws from a multinormal
SIMS <- 1000
len <- length(theta_hat)
unconstrained <- upars + t(chol(V)) %*%
matrix(rnorm(SIMS * len), nrow = len, ncol = SIMS)
theta_sims <- t(apply(unconstrained, 2, FUN = function(upars) {
unlist(constrain_pars(linear, upars))
}))
# 6: Produce estimated standard errors for the constrained parameters
se <- apply(theta_sims, 2, sd)

Understanding different results of optim() and lm()

Given:
set.seed(1001)
outcome<-rnorm(1000,sd = 1)
covariate<-rnorm(1000,sd = 1)
log-likelihood of normal pdf:
loglike <- function(par, outcome, covariate){
cov <- as.matrix(cbind(1, covariate))
xb <- cov * par
(- 1/2* sum((outcome - xb)^2))
}
optimize:
opt.normal <- optim(par = 0.1,fn = loglike,outcome=outcome,cov=covariate, method = "BFGS", control = list(fnscale = -1),hessian = TRUE)
However I get different results when running an simple OLS. However maximizing log-likelihhod and minimizing OLS should bring me to a similar estimate. I suppose there is something wrong with my optimization.
summary(lm(outcome~covariate))
Umm several things... Here's a proper working likelihood function (with names x and y):
loglike =
function(par,x,y){cov = cbind(1,x); xb = cov %*% par;(-1/2)*sum((y-xb)^2)}
Note use of matrix multiplication operator.
You were also only running it with one par parameter, so it was not only broken because your loglike was doing element-element multiplication, it was only returning one value too.
Now compare optimiser parameters with lm coefficients:
opt.normal <- optim(par = c(0.1,0.1),fn = loglike,y=outcome,x=covariate, method = "BFGS", control = list(fnscale = -1),hessian = TRUE)
opt.normal$par
[1] 0.02148234 -0.09124299
summary(lm(outcome~covariate))$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02148235 0.03049535 0.7044466 0.481319029
covariate -0.09124299 0.03049819 -2.9917515 0.002842011
shazam.
Helpful hints: create data that you know the right answer for - eg x=1:10; y=rnorm(10)+(1:10) so you know the slope is 1 and the intercept 0. Then you can easily see which of your things are in the right ballpark. Also, run your loglike function on its own to see if it behaves as you expect.
Maybe you will find it usefull to see the difference between these two methods from my code. I programmed it the following way.
data.matrix <- as.matrix(hprice1[,c("assess","bdrms","lotsize","sqrft","colonial")])
loglik <- function(p,z){
beta <- p[1:5]
sigma <- p[6]
y <- log(data.matrix[,1])
eps <- (y - beta[1] - z[,2:5] %*% beta[2:5])
-nrow(z)*log(sigma)-0.5*sum((eps/sigma)^2)
}
p0 <- c(5,0,0,0,0,2)
m <- optim(p0,loglik,method="BFGS",control=list(fnscale=-1,trace=10),hessian=TRUE,z=data.matrix)
rbind(m$par,sqrt(diag(solve(-m$hessian))))
And for the lm() method I find this
m.ols <- lm(log(assess)~bdrms+lotsize+sqrft+colonial,data=hprice1)
summary(m.ols)
Also if you would like to estimate the elasticity of assessed value with respect to the lotsize or calculate a 95% confidence interval
for this parameter, you could use the following
elasticity.at.mean <- mean(hprice1$lotsize) * m$par[3]
var.coefficient <- solve(-m$hessian)[3,3]
var.elasticity <- mean(hprice1$lotsize)^2 * var.coefficient
# upper bound
elasticity.at.mean + qnorm(0.975)* sqrt(var.elasticity)
# lower bound
elasticity.at.mean + qnorm(0.025)* sqrt(var.elasticity)
A more simple example of the optim method is given below for a binomial distribution.
loglik1 <- function(p,n,n.f){
n.f*log(p) + (n-n.f)*log(1-p)
}
m <- optim(c(pi=0.5),loglik1,control=list(fnscale=-1),
n=73,n.f=18)
m
m <- optim(c(pi=0.5),loglik1,method="BFGS",hessian=TRUE,
control=list(fnscale=-1),n=73,n.f=18)
m
pi.hat <- m$par
numerical calculation of s.d
rbind(pi.hat=pi.hat,sd.pi.hat=sqrt(diag(solve(-m$hessian))))
analytical
rbind(pi.hat=18/73,sd.pi.hat=sqrt((pi.hat*(1-pi.hat))/73))
Or this code for the normal distribution.
loglik1 <- function(p,z){
mu <- p[1]
sigma <- p[2]
-(length(z)/2)*log(sigma^2) - sum(z^2)/(2*sigma^2) +
(mu*sum(z)/sigma^2) - (length(z)*mu^2)/(2*sigma^2)
}
m <- optim(c(mu=0,sigma2=0.1),loglik1,
control=list(fnscale=-1),z=aex)

Resources