How does glmnet standardize variables when weights are present? - r

glmnet allows the user to input a vector of observation weights through the weights argument. glmnet also standardizes (by default) the predictor variables to have zero mean and unit variance. My question is: when weights is provided, does glmnet standardize the predictors using the weighted mean (and standard deviation) of each column or the unweighted mean (and standard deviation)?

There's a description of glmnet's standardization at Link
In the post you can see the Fortran-Code-Snippet of glmnet's source that computes the standardization. ("Proof" paragraph, second bullet).
I'm not familiar with Fortran, but to me it looks very much like it is in fact using the weighted mean and sd.
Edit: From the glmnet vignette:
"weights is for the observation weights. Default is 1 for each
observation. (Note: glmnet rescales the weights to sum to N, the
sample size.)"
With w in the Fortran code being the rescaled weights, this seems to be consistent with weighted mean standardization.

For what it's worth, consistent with the accepted answer, the weights in glmnet are sampling weights, and not inverse variance weights. For example if you have many more observations than unique observations, you can compress your dataset and get the same coefficient estimates:
n <- 50
m <- 5
y_norm <- rnorm(n)
y_bool <- rbinom(n,1,.5)
x <- matrix(rnorm(n*m),n)
w <- rpois(n,3) + 1 # weights
w_indx <- rep(1:n,times=w) # weights index
m1 = glmnet(x, y_norm, weights = w)
m2 = glmnet(x[w_indx,] ,y_norm[w_indx])
all.equal(coef(m1,s=.1),
coef(m2,s=.1))
>>> TRUE
M1 = glmnet(x,y_bool,weights = w,family = "binomial")
M2 = glmnet(x[w_indx,],y_bool[w_indx],family = "binomial")
all.equal(coef(M1,s=.1),
coef(M2,s=.1))
>>> TRUE
Of course, a bit more care needs to be used when using weights with cv.glmnet since the weights of aggregated records should be spread across folds using a multinomial distribution...

Related

Simulating likelihood ratio test (LRT) pvalue using Monte Carlo method [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I'm trying to figure out my assignment to simulate lrt test p-value output using the Monte Carlo method. As far as I understand, the lrt test is supposed to test for "better", more accurate model.
I know how to perform such a test:
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)
lrtest(nested, complex)
From there I can return my p-value and perform some calculations like type I and type II errors or power of a test and see how it changes depending of number of simulations.
My question is how am I supposed to simulate the random data. It doesn't have to be grades or school related stuff this was just a showcase of my understanding.
I was thinking about making data frame with 3 to 4 columns with 1 column being a dependent value (0,1) and the rest being random numbers generated from the normal distribution or some different distribution.
But I don't know if this approach will create understandable results, or if this even makes sense.
I looked at this function function but it didn't really help me to understand anything.
I came up with something like this:
library(lmtest)
n <- 1000
depentend = sample(c(0,1), replace=TRUE, size=n)
pvalue <- c()
for(i in 1:1000) {
independend_x = rnorm(n, mean = 2,sd = 0.2)
independend_y = rnorm(n, mean = 7,sd = 0.5)
nested <- lm(depentend~independend_x)
complex <- lm(depentend~independend_x + independend_y)
lrtest(nested, complex)
pvalue <- c(pvalue, as.numeric(lrtest(nested, complex)[5][2,1]))
}
but I don't know if this is the right direction.
I would be really thankful if someone could help me to understand how to simulate data for the Monte Carlo sampling method.
Monte Carlo simulations are performed to compute a distribution of something that is difficult to compute or for which one is too lazy to perform the exact computation.
The likelihood ratio test computes a p-value based on the distribution of the likelihood ratio $\Lambda$, and that distribution is the value that you want to simulate instead of compute or estimate with formula's. The trick is to use simulation instead of computations.
Your problem does not seem to be so much how to perform the simulations, but more like what is the distribution that you are interested in and want to simulate and what are the boundary conditions that you need to fix. Which computation or estimation is it that you want to replace/estimate with simulation?
For your likelihood ratio test you probably want to test the hypothesis $H_0: \theta_{age} = 0$ against the alternative hypothesis $H_a: \theta_{age} \neq 0$. In this case you compute the ratio of the likelihood $\mathcal{L}$ where one of the hypotheses is a composite hypothesis and you select the highest likelihood among them.
$$\Lambda = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\text{sup}_{\theta_{age} \neq 0}\mathcal{L}( \theta_{age} | \text{some data})} = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\mathcal{L}( \hat\theta_{age} | \text{some data})} $$ where the supremum is found by using the likelihood for the maximum likelihood estimator $ \hat\theta_{age} $
To compute these likelihood functions you need assumptions about the distributions. In your case you do this with glm (where you need to decide on some distribution and link function) or more simple lm (which assumes Gaussian conditional distribution for the data).
The simulations are then computed for a given null hypothesis. For instance, given some data, you assume that $\theta_{age} = 0$ and you want to compute what the distribution of the outcomes of $\Lambda$ is. You need some more data and parameters
The independent variables. These you probably want to fix at some values that relate to your practical problem. You want to know the distribution given some independent variables. Potentially you may wish to study what happens when there is an error in these independent variables, in that case you may also simulate these variables.
The variance/dispersion/noise-level of the conditional distribution. This you may vary to see how this influences the statistic. Or you have some value of interest, for instance if you have data for which you estimated the noise.
The other coefficients. These you may likewise vary or keep fixed depending on the situation, whether you want to model a particular situation or a more range of situations.
Example
The code below computes a simulation for a given regressor matrix (the independent variables) and given other coefficients. For large sample size the distribution will approach a chi-squared distribution. The simulation shows that using that limit as an estimate for the distribution underestimates the p-value by a lot.
(I ran the code with only 5000 simulations because I am using an online r-editor an compiler, on a computer you can get more precise results)
n_sim = 5*10^3
### simulate likelihood ratio test
### given coefficient and independent variables
### we assume a logistic model with binomial distribution
sim = function(theta1, X) {
### compute model
Z = X %*% theta1
p = 1/(1+exp(-Z))
### simulate dependent variable
Y = rbinom(length(p), 1, p)
### compute (log)likelihood ratio
mod1 = glm(Y ~ 1 + X[,2] + X[,3], family = binomial)
mod0 = glm(Y ~ 1 + X[,2], family = binomial)
logratio = -2*(logLik(mod0)-logLik(mod1))
return(as.numeric(logratio))
}
set.seed(1)
n = 10
### coefficients with the last one zero
theta1 = c(1,1,0)
### some regressor matrix, independent variables
X = cbind(rep(1,n), matrix(rnorm(n*2),n)) ### first column is intercept
### simulate
Lsim = replicate(n_sim,sim(theta1,X))
### ordering for empirical distribution
Lsim = Lsim[order(Lsim)]
perc = c(1:length(Lsim))/length(Lsim)
plot(Lsim,1-perc, main = "emperical distribution", ylab = "P(likelihood > L)", xlab = "L", type = "l")
lines(qchisq(perc,1),1-perc, lty = 2)
legend(8,1, c("n=10","n=40", "chi-squared estimate"), lty = c(1,1,2), col = c(1,2,1))
#### repeat with larger n
set.seed(1)
n = 40
theta1 = c(1,1,0)
X = cbind(rep(1,n), matrix(rnorm(n*2),n))
Lsim2 = replicate(n_sim,sim(theta1,X))
Lsim2 = Lsim2[order(Lsim2)]
lines(Lsim2, 1-perc, col = 2)
Note that there are many variants and this is just an example what simulation does. Here we simulate data based on a given distribution. (And it replaces a computation that we could not perform. We had an estimate with a chi-squared distribution, but that is not accurate for small $n$.)
Other times this distribution is not know and one uses the data and some resampling method to simulate/estimate the distribution of the statistic.
For your situation you need to figure out what exact computation (for which information/conditions are given) it is that you want to replace by using simulations.

How to find RMSE value? and What is good RMSE value?

I am doing forecasting of electrical power output, I have different sets of data that varies from 200-4000 observations. I have calculated forecasting but I do not know how to calculate RMSE value and R (correlation coefficient) in R. I tried to calculate it on excel and the result for rmse was 0.0078. so I have basically two questions here.
How to calculate RMSE and R value in R?
What is good RMSE value? is 0.007 a good considerable value?
Here are two functions, one to compute the MSE and the second calls the first one and takes the squre root, RMSE.
These functions accept a fitted model, not a data set. For instance the output of lm, glm, and many others.
mse <- function(x, na.rm = TRUE, ...){
e <- resid(x)
mean(e^2, na.rm = TRUE)
}
rmse <- function(x, ...) sqrt(mse(x, ...))
Like I said in a comment to the question, a value is not good on its own, it's good when compared to others obtained from other fitted models.
Root Mean Square Error (RMSE) is the standard deviation of the prediction errors. prediction errors are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.
The formula is:
Where:
f = forecasts (expected values or unknown results),
o = observed values (known results).
The bar above the squared differences is the mean (similar to x̄). The same formula can be written with the following, slightly different, notation:
Where:
Σ = summation (“add up”)
(zfi – Zoi)2 = differences, squared
N = sample size.
You can use which ever method you want as both reflects the same and "R" that you are refering to is pearson coefficient that defines the variance amount in the data
Coming to Question2 a good rmse value is always depends on the upper and lower bound of your rmse and a good value should always be smaller that gives less probe of error

Estimating bias in linear regression and linear mixed model in R simulation

I want to run simulations to estimate bias in linear model and linear mixed model. The bias is E(beta)-beta where beta is the association between my X and Y.
I generated my X variable from a normal distribution and Y from a multivariate normal distribution.
I understand how I can calculate E(beta) from simulations, which is the sum of beta estimates from all simulations divided by the total number of simulation, but I am not sure how I can estimate true beta.
meanY <- meanY + X*betaV
This is how I generated the meanY (betaV is the effect size) that is then used to generate multivariate Y outcome as shown below.
Y[jj,] <- rnorm(nRep, mean=meanY[jj], sd=sqrt(varY))
I understand how I can calculate E(beta) from simulations, which is the sum of beta estimates from all simulations divided by the total number of simulation, but I am not sure how I can estimate the true beta.
From my limited understanding, true beta is not obtained from the data but from the setting where I set fixed beta value.
Based on how I generated my data, how can I estimate the true beta?
There are a couple of methods of simulating bias. I'll take an easy example using a linear model. A linear mixed model could likely use a similar approach, however i am not certain it would go well for a generalized linear mixed model (I am simply not certain).
A simple method for estimating bias, when working with a simple linear model, is to 'choose' which model to estimate ones bias from. Lets say for example Y = 3 + 4 * X + e. I have chosen beta <- c(3,4), and as such i need to only simulate my data. For a linear model, the model assumptions are
Observations are independent
Observations are normally distributed
The mean can be described as by the linear predictor
Using these 3 assumptions, simulating a fixed design is simple.
set.seed(1)
xseq <- seq(-10,10)
xlen <- length(xseq)
nrep <- 100
#Simulate X given a flat prior (uniformly distributed. A normal distribution would likely work fine as well)
X <- sample(xseq, size = xlen * nrep, replace = TRUE)
beta <- c(3, 4)
esd = 1
emu <- 0
e <- rnorm(xlen * nrep, emu, esd)
Y <- cbind(1, X) %*% beta + e
fit <- lm(Y ~ X)
bias <- coef(fit) -beta
>bias
(Intercept) X
0.0121017239 0.0001369908
which indicates a small bias. To test if this bias is significant, we could perform a wald-test or t-test (or replicate the process 1000 times, and check the distribution of outcomes).
#Simulate linear model many times
model_frame <- cbind(1,X)
emany <- matrix(rnorm(xlen * nrep * 1000, emu, esd),ncol = 1000)
#add simulated noise. Sweep adds X %*% beta across all columns of emany
Ymany <- sweep(emany, 1, model_frame %*% beta, "+")
#fit many models simulationiously (lm is awesome!)
manyFits <- lm(Y~X)
#Plot density of fitted parameters
par(mfrow=c(1,2))
plot(density(coef(manyFits)[1,]), main = "Density of intercept")
plot(density(coef(manyFits)[2,]), main = "Density of beta")
#Calculate bias, here i use sweep to substract beta across all rows of my coefficients
biasOfMany <- rowMeans(sweep(coef(manyFits), 1, beta, "-"))
>biasOfMany
(Intercept) X
5.896473e-06 -1.710337e-04
Here we see that the bias is reduced quite a bit, and has changed sign for betaX giving reason to believe the bias is insignificant.
Changing the design would allow one to look into bias of interactions, outliers and other stuff using the same method.
For linear mixed models, one could perform the same method, however here you would have to design the random variables, which would require some more work, and the implementation of lmer as far as i know, does not fit a model across all columns of Y.
However b (the random effects) could be simulated, and so could any noise parameters. Do however note, that as b is a single vector containing a single outcome of simulations (often of a multivariate normal distribution), one would have to re-run the model for each simulation of b. Basically this will increase the number of times one would have to re-run the model fitting procedure, in order to get a good estimate of the bias.

R glmnet force coefficients to sum up to 1

I was wondering if it is possible that in R package glmnet that we force the coefficients to sum up to 1? As if those coefficients are weights between [0,1] of individual predictors?
I figured out how to force coef to be between [0,1] using :
cvfit <- cv.glmnet(X,y, lower.limits=rep(0,ncol(X)),
upper.limits=rep(1,ncol(X)))
And I figured out how to force intercept to be zero using:
cvfit <- cv.glmnet(X,y, lower.limits=rep(0,ncol(X)),
upper.limits=rep(1,ncol(X)), intercept=FALSE)
But I don't know how to make coefficients to sum up to 1.
Thanks!
All the best,
Kathy

Residuals and plots in ordered multinomial regression

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Resources