R: how to add L1 norm line to plot from glmnet - r

I'm doing lasso regression, and I want to choose some beta coefficients that best explain my model by using Leave one out cross validation.
Here is my code:
library(glmnet)
set.seed(19875)
n=100
p=500
real_p=15
x=matrix(rnorm(n*p), nrow=n, ncol=p)
y=as.matrix(apply(x[, 1:real_p], 1, sum) + rnorm(n))
lasso=glmnet(x,y,alpha = 1)
plot(lasso)
#computing loocv
cvlassofit<-cv.glmnet(x,y, nfolds =n, grouped = FALSE )
plot(cvlassofit)
The first plot generates the beta coefficient paths:
Then I want to add a vertical line that chooses the best coefficients that have small mean square error. The plot should then look like this:
In the code part where I do CV I get the best lambda that has the smallest mse(mean square error).
Here is the plot:
Now, can I somehow based on the lambda get a value for the L1 norm, so that I could add a vertical line to the first plot? Or instead of log(lambda) in the last plot, could I do a L1 norm?

Now, can I somehow based on the lambda get a value for the L1 norm, so that I could add a vertical line to the first plot? Or instead of log(lambda) in the last plot, could I do a L1 norm?
You can do it as follows:
lambda_min <- cvlassofit$lambda.min
estimates <- as.vector(coef(lasso, s = lambda_min, exact = TRUE))
norm. <- sum(abs(estimates))
plot(lasso, xlim = range(0, norm., as.vector(lasso$beta)))
abline(v = norm., col = "red")
Here is the result:

Related

Determine what is the break point for the slope change in R [migrated]

I'm trying to implement a "change point" analysis, or a multiphase regression using nls() in R.
Here's some fake data I've made. The formula I want to use to fit the data is:
$y = \beta_0 + \beta_1x + \beta_2\max(0,x-\delta)$
What this is supposed to do is fit the data up to a certain point with a certain intercept and slope ($\beta_0$ and $\beta_1$), then, after a certain x value ($\delta$), augment the slope by $\beta_2$. That's what the whole max thing is about. Before the $\delta$ point, it'll equal 0, and $\beta_2$ will be zeroed out.
So, here's my function to do this:
changePoint <- function(x, b0, slope1, slope2, delta){
b0 + (x*slope1) + (max(0, x-delta) * slope2)
}
And I try to fit the model this way
nls(y ~ changePoint(x, b0, slope1, slope2, delta),
data = data,
start = c(b0 = 50, slope1 = 0, slope2 = 2, delta = 48))
I chose those starting parameters, because I know those are the starting parameters, because I made the data up.
However, I get this error:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Have I just made unfortunate data? I tried fitting this on real data first, and was getting the same error, and I just figured that my initial starting parameters weren't good enough.
(At first I thought it could be a problem resulting from the fact that max is not vectorized, but that's not true. It does make it a pain to work with changePoint, wherefore the following modification:
changePoint <- function(x, b0, slope1, slope2, delta) {
b0 + (x*slope1) + (sapply(x-delta, function (t) max(0, t)) * slope2)
}
This R-help mailing list post describes one way in which this error may result: the rhs of the formula is overparameterized, such that changing two parameters in tandem gives the same fit to the data. I can't see how that is true of your model, but maybe it is.
In any case, you can write your own objective function and minimize it. The following function gives the squared error for data points (x,y) and a certain value of the parameters (the weird argument structure of the function is to account for how optim works):
sqerror <- function (par, x, y) {
sum((y - changePoint(x, par[1], par[2], par[3], par[4]))^2)
}
Then we say:
optim(par = c(50, 0, 2, 48), fn = sqerror, x = x, y = data)
And see:
$par
[1] 54.53436800 -0.09283594 2.07356459 48.00000006
Note that for my fake data (x <- 40:60; data <- changePoint(x, 50, 0, 2, 48) + rnorm(21, 0, 0.5)) there are lots of local maxima depending on the initial parameter values you give. I suppose if you wanted to take this seriously you'd call the optimizer many times with random initial parameters and examine the distribution of results.
Just wanted to add that you can do this with many other packages. If you want to get an estimate of uncertainty around the change point (something nls cannot do), try the mcp package.
# Simulate the data
df = data.frame(x = 1:100)
df$y = c(rnorm(20, 50, 5), rnorm(80, 50 + 1.5*(df$x[21:100] - 20), 5))
# Fit the model
model = list(
y ~ 1, # Intercept
~ 0 + x # Joined slope
)
library(mcp)
fit = mcp(model, df)
Let's plot it with a prediction interval (green line). The blue density is the posterior distribution for the change point location:
# Plot it
plot(fit, q_predict = T)
You can inspect individual parameters in more detail using plot_pars(fit) and summary(fit).

Method of Moments for Gamma distribution- histogram and superimposing the PDF

I have this question. 'Model the data in nfsold (nfsold is just a vector containing 150 numbers)as a set of 150independent observations from a Gamma(lambda; k) distribution. Use the Method of Moments, to obtain estimates of k and lambda. Draw a histogram of the data and superimpose the PDF of your fitted gamma distribution as a preliminary check that this distribution matches the observed data.'
This is the code I have written.
#The first moment of each Xi, i = 1,...,n, is E(Xi) = k/lamda.
#The second moment of each Xi is E(Xi^2) = k(k+1)/(lamda)^2
#Since we have to find 2 two things, k and lamda we require 2 moments to do this.
x_bar = mean = sum(nfsold)/150 #This is the first moment
mean
second_moment = sum(nfsold^2)/150
second_moment
#(1/n)(sum xi) = k/lamda
#(1/n)(sum x^2i) = k(k+1)/(lamda)^2
#By solving these because of the methods of moments we get lambda and k.
lamda_hat = (x_bar)/((second_moment)-(x_bar)^2)
lamda_hat
k_hat = (x_bar)^2/ ((second_moment)-(x_bar)^2)
k_hat
independent_observations = dgamma(x,k_hat, rate = lamda_hat)
hist( independent_observations, breaks = 15, prob = TRUE, main="Histogram for the Gamma Distribution of the data in nfsold", xlab="Independent Observations", ylab="P.D.F")
curve(dgamma(x,k_hat, rate =lamda_hat), add=TRUE, col="green")
My problem is that my superimposed curve does not follow my histogram, so I feel like there is something wrong with my code, please would I be able to have some help with correcting it?
Thanks!

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

How to draw an $\alpha$ confidence areas on a 2D-plot?

There are a lot of answers regarding to plotting confidence intervals.
I'm reading the paper by Lourme A. et al (2016) and I'd like to draw the 90% confidence boundary and the 10% exceptional points like in the Fig. 2 from the paper: .
I can't use LaTeX and insert the picture with the definition of confidence areas:
library("MASS")
library(copula)
set.seed(612)
n <- 1000 # length of sample
d <- 2 # dimension
# random vector with uniform margins on (0,1)
u1 <- runif(n, min = 0, max = 1)
u2 <- runif(n, min = 0, max = 1)
u = matrix(c(u1, u2), ncol=d)
Rg <- cor(u) # d-by-d correlation matrix
Rg1 <- ginv(Rg) # inv. matrix
# round(Rg %*% Rg1, 8) # check
# the multivariate c.d.f of u is a Gaussian copula
# with parameter Rg[1,2]=0.02876654
normal.cop = normalCopula(Rg[1,2], dim=d)
fit.cop = fitCopula(normal.cop, u, method="itau") #fitting
# Rg.hat = fit.cop#estimate[1]
# [1] 0.03097071
sim = rCopula(n, normal.cop) # in (0,1)
# Taking the quantile function of N1(0, 1)
y1 <- qnorm(sim[,1], mean = 0, sd = 1)
y2 <- qnorm(sim[,2], mean = 0, sd = 1)
par(mfrow=c(2,2))
plot(y1, y2, col="red"); abline(v=mean(y1), h=mean(y2))
plot(sim[,1], sim[,2], col="blue")
hist(y1); hist(y2)
Reference.
Lourme, A., F. Maurer (2016) Testing the Gaussian and Student's t copulas in a risk management framework. Economic Modelling.
Question. Could anyone help me and give the explanation of the variable v=(v_1,...,v_d) and G(v_1),..., G(v_d) in the equation?
I think v is the non-random matrix, the dimensions should be $k^2$ (grid points) by d=2 (dimensions). For example,
axis_x <- seq(0, 1, 0.1) # 11 grid points
axis_y <- seq(0, 1, 0.1) # 11 grid points
v <- expand.grid(axis_x, axis_y)
plot(v, type = "p")
So, your question is about the vector nu and correponding G(nu).
nu is a simple random vector drawn from any distribution that has a domain (0,1). (Here I use uniform distribution). Since you want your samples in 2D one single nu can be nu = runif(2). Given the explanations above, G is a gaussain pdf with mean 0 and a covariance matrix Rg. (Rg has dimensions of 2x2 in 2D).
Now what the paragraph says: if you have a random sample nu and you want it to be drawn from Gamma given the number of dimensions d and confidence level alpha then you need to compute the following statistic (G(nu) %*% Rg^-1) %*% G(nu) and check that is below the pdf of Chi^2 distribution for d and alpha.
For example:
# This is the copula parameter
Rg <- matrix(c(1,runif(2),1), ncol = 2)
# But we need to compute the inverse for sampling
Rginv <- MASS::ginv(Rg)
sampleResult <- replicate(10000, {
# we draw our nu from uniform, but others that map to (0,1), e.g. beta, are possible, too
nu <- runif(2)
# we compute G(nu) which is a gaussian cdf on the sample
Gnu <- qnorm(nu, mean = 0, sd = 1)
# for this we compute the statistic as given in formula
stat <- (Gnu %*% Rginv) %*% Gnu
# and return the result
list(nu = nu, Gnu = Gnu, stat = stat)
})
theSamples <- sapply(sampleResult["nu",], identity)
# this is the critical value of the Chi^2 with alpha = 0.95 and df = number of dimensions
# old and buggy threshold <- pchisq(0.95, df = 2)
# new and awesome - we are looking for the statistic at alpha = .95 quantile
threshold <- qchisq(0.95, df = 2)
# we can accept samples given the threshold (like in equation)
inArea <- sapply(sampleResult["stat",], identity) < threshold
plot(t(theSamples), col = as.integer(inArea)+1)
The red points are the points you would keep (I plot all points here).
As for drawing the decision boundries, I think it is a little bit more complicated, since you need to compute the exact pair of nu so that (Gnu %*% Rginv) %*% Gnu == pchisq(alpha, df = 2). It is a linear system that you solve for Gnu and then apply inverse to get your nu at the decision boundries.
edit: Reading the paragraph again, I noticed, the parameter for Gnu does not change, it is simply Gnu <- qnorm(nu, mean = 0, sd = 1).
edit: There was a bug: for threshold you need to use the quantile function qchisq instead of the distribution function pchisq - now corrected in the code above (and updated the figures).
This has two parts: first, compute the copula value as a function of X and Y; then, plot the curve giving the boundary where the copula exceeds the threshold.
Computing the value is basically linear algebra which #drey has answered. This is a rewritten version so that the copula is given by a function.
cop1 <- function(x)
{
Gnu <- qnorm(x)
Gnu %*% Rginv %*% Gnu
}
copula <- function(x)
{
apply(x, 1, cop1)
}
Plotting the boundary curve can be done using the same method as here (which in turn is the method used by the textbooks Modern Applied Stats with S, and Elements of Stat Learning). Create a grid of values, and use interpolation to find the contour line at the given height.
Rg <- matrix(c(1,runif(2),1), ncol = 2)
Rginv <- MASS::ginv(Rg)
# draw the contour line where value == threshold
# define a grid of values first: avoid x and y = 0 and 1, where infinities exist
xlim <- 1e-3
delta <- 1e-3
xseq <- seq(xlim, 1-xlim, by=delta)
grid <- expand.grid(x=xseq, y=xseq)
prob.grid <- copula(grid)
threshold <- qchisq(0.95, df=2)
contour(x=xseq, y=xseq, z=matrix(prob.grid, nrow=length(xseq)), levels=threshold,
col="grey", drawlabels=FALSE, lwd=2)
# add some points
data <- data.frame(x=runif(1000), y=runif(1000))
points(data, col=ifelse(copula(data) < threshold, "red", "black"))

Obtain standardised residuals and "Residual v.s. Fitted" plot for "mlm" object from `lm()`

set.seed(0)
## 2 response of 10 observations each
response <- matrix(rnorm(20), 10, 2)
## 3 covariates with 10 observations each
predictors <- matrix(rnorm(30), 10, 3)
fit <- lm(response ~ predictors)
I have been generating residual plots for the entire model using:
plot(fitted(fit),residuals(fit))
However, I would like to make individual plots for each predictor covariate. I can do them one at a time by:
f <- fitted(fit)
r <- residual(fit)
plot(f[,1],r[,1])
The issue with this approach however, is that it needs to be generalizable for data sets with more predictor covariates. Is there a way that I use plot while iterating through each column of (f) and (r)? Or is there a way that plot() can group each co-variate by colour?
Make Sure you are using standardised residuals rather than raw residuals
I often see plot(fitted(fit), residuals(fit)) but it is statistically wrong. We use plot(fit) to generate diagnostic plot, because we need standardised residuals rather than raw ones. Read ?plot.lm for more. But plot method for "mlm" is poorly supported:
plot(fit)
# Error: 'plot.mlm' is not implemented yet
Define "rstandard" S3 method for "mlm"
plot.mlm is not supported for many reasons, one of which is the lack of rstandard.mlm. For "lm" and "glm" class, there is a generic S3 method "rstandard" to get standardised residuals:
methods(rstandard)
# [1] rstandard.glm* rstandard.lm*
There is no support for "mlm". So we shall fill this gap first.
It is not difficult to get standardised residuals. Let hii be diagonals of the hat matrix, the point-wise estimated standard error for residuals is sqrt(1 - hii) * sigma, where sigma = sqrt(RSS / df.residual) is estimated residual standard error. RSS is residual sum of squares; df.residual is residual degree of freedom.
hii can be computed from matrix factor Q of QR factorization of model matrix: rowSums(Q ^ 2). For "mlm", there is only one QR decomposition since the model matrix is the same for all responses, hence we only need to compute hii once.
Different response has different sigma, but they are elegantly colSums(residuals(fit) ^ 2) / df.residual(fit).
Now, let's wrap up those ideas to get our own "rstandard" method for "mlm":
## define our own "rstandard" method for "mlm" class
rstandard.mlm <- function (model) {
Q <- with(model, qr.qy(qr, diag(1, nrow = nrow(qr$qr), ncol = qr$rank))) ## Q matrix
hii <- rowSums(Q ^ 2) ## diagonal of hat matrix QQ'
RSS <- colSums(model$residuals ^ 2) ## residual sums of squares (for each model)
sigma <- sqrt(RSS / model$df.residual) ## ## Pearson estimate of residuals (for each model)
pointwise_sd <- outer(sqrt(1 - hii), sigma) ## point-wise residual standard error (for each model)
model$residuals / pointwise_sd ## standardised residuals
}
Note the use of .mlm in function name to tell R this is S3 method associated. Once we have defined this function, we can see it in "rstandard" method:
## now there are method for "mlm"
methods(rstandard)
# [1] rstandard.glm* rstandard.lm* rstandard.mlm
To call this function, we don't have to explicitly call rstandard.mlm; calling rstandard is enough:
## test with your fitted model `fit`
rstandard(fit)
# [,1] [,2]
#1 1.56221865 2.6593505
#2 -0.98791320 -1.9344546
#3 0.06042529 -0.4858276
#4 0.18713629 2.9814135
#5 0.11277397 1.4336484
#6 -0.74289985 -2.4452868
#7 0.03690363 0.7015916
#8 -1.58940448 -1.2850961
#9 0.38504435 1.3907223
#10 1.34618139 -1.5900891
Standardised residuals are N(0, 1) distributed.
Getting residuals v.s. fitted plot for "mlm"
Your initial try with:
f <- fitted(fit); r <- rstandard(fit); plot(f, r)
is not a bad idea, provided that dots for different models can be identified from each other. So we can try using different point colours for different models:
plot(f, r, col = as.numeric(col(f)), pch = 19)
Graphical arguments like col, pch and cex can take vector input. I ask plot to use col = j for the r[,j] ~ f[,j], where j = 1, 2,..., ncol(f). Read "Color Specification" of ?par for what col = j means. pch = 19 tells plot to draw solid dots. Read basic graphcial parameters for various choices.
Finally you may want a legend. You can do
plot(f, r, col = as.numeric(col(f)), pch = 19, ylim = c(-3, 4))
legend("topleft", legend = paste0("response ", 1:ncol(f)), pch = 19,
col = 1:ncol(f), text.col = 1:ncol(f))
In order to leave space for the legend box we extend ylim a little bit. As standardised residuals are N(0,1), ylim = c(-3, 3) is a good range. Should we want to place the legend box on the top left, we extend ylim to c(-3, 4). You can customize your legend even more via ncol, title, etc.
How many responses do you have?
If you have no more than a few responses, above suggestion works nicely. If you have plenty, plotting them in separate plot is suggested. A for loop as you found out is decent, except that you need split plotting region into different subplots, possibly using par(mfrow = c(?, ?)). Also set inner margin mar and outer margin oma if you take this approach. You may read How to produce a nicer plot for my categorical time series data in a matrix? for one example of doing this.
If you have even more responses, you might want a mixture of both? Say if you have 42 responses, you can do par(mfrow = c(2, 3)), then plot 7 responses in each subfigure. Now the solution is more opinion based.
This is how I solved it.
for(i in 1:ncol(f)) {
plot(f[,i],r[,i])
}
Mind blown.

Resources