Lars algorithm with lasso modification - r

I'm trying to implement the lars algorithm with the lasso modification.
At point 3. I'm stuck, I want to program it but I don't really understand it.
Point 1 and 2 I already did, here is the code:
#1. Standardize the predictors to have mean zero and unit norm.
set.seed(19875)
n <- 10
p <- 5
real_p <- 5
x <- matrix(rnorm(n*p), nrow=n, ncol=p)
x <- x-matrix(apply(x,2,mean),ncol=ncol(x),nrow=nrow(x),byrow=T)
x <- x/matrix(apply(x,2,sd),ncol=ncol(x),nrow=nrow(x),byrow=T)
y <- apply(x[,1:real_p], 1, sum) + rnorm(n)
#Start with the residual r = y − y ¯, β1,β2,... ,βp = 0
r=y-mean(y)
beta=matrix(0, ncol=ncol(x), nrow=1)
#2. Find the predictor xj most correlated with r.
co= t(x)%*%r
j= (1:ncol(x))[abs(co)==max(abs(co))][1]
#3. Move βj from 0 towards its least-squares coefficient xj,ri, until some
#other competitor xk has as much correlation with the current residual
#as does xj.
I would very much appreciate any clarification.

Related

constrained optimisation using maxLik

I have read the maxLik document on how to do constrained optimization. However, I do not understand how I can do it. I have a custom likelihood function as below. The value of rho should be between 0 and 1 (making that two constraints). Now how exactly do I put those constraints? I have 3 parameters.
I have seen an almost similar question here with 3 constraints and 3 parameters but I am really a novice and do not understand the proposed hints on how to include the constraints ? how to use maxLik() to do the constrained optimization in R
require(maxLik)
data<- matrix(rnorm(3600,5,1),ncol=20)
Y=data[,c(1:20)]
Y <- as.matrix(Y, ncol=20)
p=4
T=nrow(Y)
X <- Y[p:(T-1),1:4]
unos <- rep(1,T)
X <- cbind(unos, X)
set.seed(101)
loglik <- function(theta) {
eta <- theta[1]
n <- theta[2]
rho <- theta[3]
coefis=as.matrix(c(mu0=0.0112, mu1=0.0115, mu2=0.009, mu3=0.021,
mu4=0.01237),ncol=1) #coefficients for the intercept and four lags
resi= Y- X%*%coefis
y <- Custom_lik(resi, eta, n, rho, T) #my custom likelihood function
return(-y[[1]])
}
m <- maxLik(loglik, start=c(eta=1.1, n=1.5, rho=0.5))

Getting wrong betas when doing OLS regression in R

My first question here. This problem have stolen days from my life. I know, it's not that important, but at the same time: I need to know! I know there are many good formulas for making regression. But when I try to do it using good-old arithmetic just to get the hangs of it, I get ridiculous answers on beta.
Beta vector is supposed to be (X'X)^(-1)X'y (where X is the matrix of regressors and y the vector of answers). I'll give one example (and that it's not suitable for OLS is irrelevant - I just want b:s here):
X <- matrix(1:10)
y <- matrix(2:11)
b <- (t(X) %*% X)^(-1) %*% t(X) %*% y
Which gives b = 1.142857, while summary(lm(y~X)) gives beta = 1 and an intercept of 1. I add a constant to X to get an intercept: X <-cbind(X,1) and the results I get is b = (2.324675,14.5) which doesn't make sense at all. What am I doing wrong here?
There are two problems here. The first is a problem of notation. The power of -1 in the formula actually indicates a matrix inverse. That is calculated with solve in R and not with ^-1, which indicates element-wise reciprocals.
Then, you need to create a design matrix that actually contains an intercept.
X <- matrix(1:10)
y <- matrix(2:11)^2
coef(lm(y~X))
#(Intercept) X
# -21 13
X <- cbind(1, X)
solve(t(X) %*% X) %*% t(X) %*% y
# [,1]
#[1,] -21
#[2,] 13
Obviously, you should not actually do this matrix inversion in real world applications (and R's lm doesn't do it).
The issue is with using ^(-1) for the inverse. It doesn't work like that for Matrices. solve is used to get the inverse of a matrix: https://www.statmethods.net/advstats/matrix.html
# use solve
b <- solve(t(X) %*% X) %*% t(X) %*% y
# fit model without intercept
m <- lm(y~-1+X)
summary(m)
# same coefficients
b
m$coefficients
# with intercept
X2 <- cbind(rep(1, 10), X)
b2 <- solve(t(X2) %*% X2) %*% t(X2) %*% y
m2 <- lm(y~+X)
summary(m2)
b2
m2$coefficients
X <- cbind(1, matrix(1:10))
b<-solve(t(X)%*%X)%*%t(X)%*%y
https://www.rdocumentation.org/packages/Matrix/versions/0.3-26/topics/solve.Matrix

R: Confidence intervals on non-linear fit with a non-analytic model

I need to fit x-y data with a model, which is non-analytic. I have a function f(x) that calculates the model for each x numerically, but there is no analytical equation. For the fit, I use optim in R. I minimise RMS between the model and the data. It works well and returns reasonable parameters.
I would like to find confidence intervals (or at least standard errors) on the best-fitting parameters. I found on internet that this can be done from the Hessian matrix, but only if maximising log-likelihood function. I don't know how to do this, all I have is x, y and f(x) from which I find RMS. Alas, I have no good way of estimating errors on y.
How can I find confidence intervals on my fit parameters?
Edit: perhaps an example in R might help explaining what I'm asking for. This example uses a simple analytic function to fit the data, in my real case the function is non-analytic, so I cannot use, e.g., nls.
set.seed(666)
# generate data
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# use optim to fit the model to the data
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x, y)
# best-fitting parameters
best_a <- res$par[1]
best_b <- res$par[2]
The best fitting parameters are a = 0.50 and b = 0.20. I need to find 95% confidence intervals on these.
This is a job for the bootstrap:
(1) create a large number of synthetic datasets x*. These are created by sampling from x with replacement the same number of data as were in x. For example, if your data is (1,2,3,4,5,6) an x* might be (5,2,4,4,2,3) (note that values might appear multiple times, or not at all because we are sampling with replacement)
(2) For each x*, calculate f(x*). If there are other parameters which don't depend on the data, don't change them. (so f(x,a,b,c) becomes f(x*,a,b,c) as long as a,b,c don't depend on x. Call these quantities f*.
(3) You can estimate anything you want from these f*. If you want the standard deviation of f(x), take the standard deviation of f*. If you want the 95% confidence interval, take the range from the 2.5 to the 97.5 percentiles of f*. More formally, if you want to estimate g(f(x)) you estimate it as g(f(x*)).
I should say this is a very practically-oriented explanation of the bootstrap. I have glossed over many theoretical details, but the bootstrap is near-universally applicable (basically as long as the thing you are trying to estimate actually exists, you are usually okay).
To apply this to the example you have given in your code:
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# this is the part where we bootstrap
# use optim to fit the model to the data
best_a <- best_b <- numeric(10000)
for(i in 1:10000){
j <- sample(100,replace=TRUE)
x.boot <- x[j]; y.boot <- y[j]
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x.boot, y.boot)
# best-fitting parameters
best_a[i] <- res$par[1]
best_b[i] <- res$par[2]
}
# now, we look at the *vector* best_a
# for example, if you want the standard deviation of a,
sd(best_a)
# or a 95% confidence interval for b,
quantile(best_b,c(0.025,0.975))

How to find interval prbability for a given distribution?

Suppose I have some data and I fit them to a gamma distribution, how to find the interval probability for Pr(1 < x <= 1.5), where x is an out-of-sample data point?
require(fitdistrplus)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
fit <- fitdist(a, "gamma",lower = c(0, 0))
Someone does not like my above approach, which is conditional on MLE; now let's see something unconditional. If we take direct integration, we need a triple integration: one for shape, one for rate and finally one for x. This is not appealing. I will just produce Monte Carlo estimate instead.
Under Central Limit Theorem, MLE are normally distributed. fitdistrplus::fitdist does not give standard error, but we can use MASS::fitdistr which would performs exact inference here.
fit <- fitdistr(a, "gamma", lower = c(0,0))
b <- fit$estimate
# shape rate
#1.739737 1.816134
V <- fit$vcov ## covariance
shape rate
shape 0.1423679 0.1486193
rate 0.1486193 0.2078086
Now we would like to sample from parameter distribution and get samples of target probability.
set.seed(0)
## sample from bivariate normal with mean `b` and covariance `V`
## Cholesky method is used here
X <- matrix(rnorm(1000 * 2), 1000) ## 1000 `N(0, 1)` normal samples
R <- chol(V) ## upper triangular Cholesky factor of `V`
X <- X %*% R ## transform X under desired covariance
X <- X + b ## shift to desired mean
## you can use `cov(X)` to check it is very close to `V`
## now samples for `Pr(1 < x < 1.5)`
p <- pgamma(1.5, X[,1], X[,2]) - pgamma(1, X[,1], X[,2])
We can make a histogram of p (and maybe do a density estimation if you want):
hist(p, prob = TRUE)
Now, we often want sample mean for predictor:
mean(p)
# [1] 0.1906975
Here goes an example that uses MCMC techniques and a Bayesian mode of inference to estimate the posterior probability that a new observation falls in the interval (1:1.5). This is an unconditional estimate, as opposed to the conditional estimate obtained by integrating the gamma-distribution with maximum-likelihood parameter estimates.
This code requires that JAGS be installed on your computer (free and easy to install).
library(rjags)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
# Specify the model in JAGS language using diffuse priors for shape and scale
sink("GammaModel.txt")
cat("model{
# Priors
shape ~ dgamma(.001,.001)
rate ~ dgamma(.001,.001)
# Model structure
for(i in 1:n){
a[i] ~ dgamma(shape, rate)
}
}
", fill=TRUE)
sink()
jags.data <- list(a=a, n=length(a))
# Give overdispersed initial values (not important for this simple model, but very important if running complicated models where you need to check convergence by monitoring multiple chains)
inits <- function(){list(shape=runif(1,0,10), rate=runif(1,0,10))}
# Specify which parameters to monitor
params <- c("shape", "rate")
# Set-up for MCMC run
nc <- 1 # number of chains
n.adapt <-1000 # number of adaptation steps
n.burn <- 1000 # number of burn-in steps
n.iter <- 500000 # number of posterior samples
thin <- 10 # thinning of posterior samples
# Running the model
gamma_mod <- jags.model('GammaModel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(gamma_mod, n.burn)
gamma_samples <- coda.samples(gamma_mod,params,n.iter=n.iter, thin=thin)
# Summarize the result
summary(gamma_samples)
# Compute improper (non-normalized) probability distribution for x
x <- rep(NA, 50000)
for(i in 1:50000){
x[i] <- rgamma(1, gamma_samples[[1]][i,1], rate = gamma_samples[[1]][i,2])
}
# Find which values of x fall in the desired range and normalize.
length(which(x>1 & x < 1.5))/length(x)
Answer:
Pr(1 < x <= 1.5) = 0.194
So pretty close to the conditional estimate, but this is not guaranteed to generally be the case.
You can just use pgamma with estimated parameters in fit.
b <- fit$estimate
# shape rate
#1.739679 1.815995
pgamma(1.5, b[1], b[2]) - pgamma(1, b[1], b[2])
# [1] 0.1896032
Thanks. But how about P(x > 2)?
Check out the lower.tail argument:
pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
By default, pgamma(q) evaluates Pr(x <= q). Setting lower.tail = FALSE gives Pr(x > q). So you can do:
pgamma(2, b[1], b[2], lower.tail = FALSE)
# [1] 0.08935687
Or you can also use
1 - pgamma(2, b[1], b[2])
# [1] 0.08935687

random effects variance of intercept being zero

I am running a power analysis using a normal LMM in R. I have seven input parameters, two of which I do not need to test for (no. of years and no. of sites). The other 5 parameters are the intercept, slope and the random effects standard deviation of the residual, intercept and slope.
Given that my response data (year is the sole explanatory variable in the model) is bound between (-1, +1), the intercept also falls in this range. However, what I am finding is that if I run, say, 1000 simulations with a given intercept and slope (which I am treating as constant over 10 years), then if the random effects intercept SD falls below a certain value, there are many simulations where the random effects intercept SD is zero. If I inflate the intercept SD then this seems to simulate correctly (please see below where I use residual Sd=0.25, intercept SD = 0.10 and slope SD = 0.05; if I increase intercept SD to 0.2, this is correctly simulated; or if I drop the residual SD to say 0.05, the variance parameters are correctly simulated).
Is this problem due to my coercion of the range to be (-1, +1)?
I include the code for my function and the processing of the simulations below, if this would help:
Function: generating the data:
normaldata <- function (J, K, beta0, beta1, sigma_resid,
sigma_beta0, sigma_beta1){
year <- rep(rep(0:J),K) # 0:J replicated K times
site <- rep (1:K, each=(J+1)) # 1:K sites, repeated J years
mu.beta0_true <- beta0
mu.beta1_true <- beta1
# random effects variance parameters:
sigma_resid_true <- sigma_resid
sigma_beta0_true <- sigma_beta0
sigma_beta1_true <- sigma_beta1
# site-level parameters:
beta0_true <<- rnorm(K, mu.beta0_true, sigma_beta0_true)
beta1_true <<- rnorm(K, mu.beta1_true, sigma_beta1_true)
# data
y <<- rnorm(n = (J+1)*K, mean = beta0_true[site] + beta1_true[site]*(year),
sd = sigma_resid_true)
# NOT SURE WHETHER TO IMPOSE THE LIMITS HERE OR LATER IN CODE:
y[y < -1] <- -1 # Absolute minimum
y[y > 1] <- 1 # Absolute maximum
return(data.frame(y, year, site))
}
Processing the simulated code:
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
n.sims = 1000
sigma.resid <- rep(0, n.sims)
sigma.intercept <- rep(0, n.sims)
sigma.slope <- rep(0,n.sims)
intercept <- rep(0,n.sims)
slope <- rep(0,n.sims)
signif <- rep(0,n.sims)
for (s in 1:n.sims){
y.data <- normaldata(10,200, 0.30, ((0-0.30)/10), 0.25, 0.1, 0.05)
lme.power <- lmer(y ~ year + (1+year | site), data=y.data)
summary(lme.power)
theta.hat <- fixef(lme.power)[["year"]]
theta.se <- se.fixef(lme.power)[["year"]]
signif[s] <- ((theta.hat + 1.96*theta.se) < 0) |
((theta.hat - 1.96*theta.se) > 0) # returns TRUE or FALSE
signif[s]
betas <- fixef(lme.power)
intercept[s] <- betas[1]
slope[s] <- betas[2]
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
sigma.resid[s] <- vc1[4,5]
sigma.intercept[s] <- vc2[1]
sigma.slope[s] <- vc2[2]
cat(paste(s, " ")); flush.console()
}
power <- mean (signif) # proportion of TRUE
power
summary(sigma.resid)
summary(sigma.intercept)
summary(sigma.slope)
summary(intercept)
summary(slope)
Thank you in advance for any help you can offer.
This is really more of a statistical than a computational question, but the short answer is: you haven't made any mistakes, this is exactly as expected. This example on rpubs runs some simulations of a Normally distributed response (i.e. it corresponds exactly to the model assumed by LMM software, so the constraint you're worried about isn't an issue).
The lefthand histogram below is from simulations with 25 samples in 5 groups, equal variance (of 1) within and between groups; the righthand histogram is from simulations with 15 samples in 3 groups.
The sampling distribution of variances for null cases (i.e., no real between-group variation) is known to have a point mass or "spike" at zero; it's not surprising (although as far as I know not theoretically worked out) that the sampling distribution of the variances should also have a point mass at zero when the between-sample is non-zero but small and/or when the sample is small and/or noisy.
http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#zero-variance has more on this topic.

Resources