I was wondering if it is possible that in R package glmnet that we force the coefficients to sum up to 1? As if those coefficients are weights between [0,1] of individual predictors?
I figured out how to force coef to be between [0,1] using :
cvfit <- cv.glmnet(X,y, lower.limits=rep(0,ncol(X)),
upper.limits=rep(1,ncol(X)))
And I figured out how to force intercept to be zero using:
cvfit <- cv.glmnet(X,y, lower.limits=rep(0,ncol(X)),
upper.limits=rep(1,ncol(X)), intercept=FALSE)
But I don't know how to make coefficients to sum up to 1.
Thanks!
All the best,
Kathy
Related
I'm trying to switch from lm() to the faster lm.fit() in order to calculate r² values from large matrices faster. (I don't think I can use cor(), per Function to calculate R2 (R-squared) in R, when x is a matrix.)
Why do lm() and lm.fit() calculate different fitted values and residuals?
set.seed(0)
x <- matrix(runif(50), 10)
y <- 1:10
lm(y ~ x)$residuals
lm.fit(x, y)$residuals
I wasn't able to penetrate the lm() source code to figure out what could be contributing to the difference...
From ?lm.fit x "should be design matrix of dimension n * p", where p is the number of coefficients. You therefore have to pass a vector of ones for the intercept to get the same model.
Thus estimating
lm.fit(cbind(1,x), y)
will give the same parameters as
lm(y ~ x)
I am calculating a multi-variate OLS regression in R, and I know the residual are autocorrelated. I know I can use Newey-West correction when performing the t-test to check whether one of the coefficient is zero. I can do that using:
require(sandwich)
model <- lm(y ~ x1 + x2)
coeftest(model, vcov=NeweyWest(model))
where y was the variable to regress and x1 and x2 the predictor. This seems a good approach since my sample size is large.
But what if I want to run an F-test to test whether the coefficient of x1 is 1 and the coefficient of x2 is zero simultaneously? I cannot find a way to do that in R, if I want to account for the autocorrelation of the residuals. For instance, if I use the function linearHypothesis in R, it seems that Newey-West cannot be used as an argument of vcov. Any suggestion? An alternative would be to do bootstrapping to estimate a confidence ellipse for my point (1,0), but I was hoping to use an F-test if possible. Thank you!
glmnet allows the user to input a vector of observation weights through the weights argument. glmnet also standardizes (by default) the predictor variables to have zero mean and unit variance. My question is: when weights is provided, does glmnet standardize the predictors using the weighted mean (and standard deviation) of each column or the unweighted mean (and standard deviation)?
There's a description of glmnet's standardization at Link
In the post you can see the Fortran-Code-Snippet of glmnet's source that computes the standardization. ("Proof" paragraph, second bullet).
I'm not familiar with Fortran, but to me it looks very much like it is in fact using the weighted mean and sd.
Edit: From the glmnet vignette:
"weights is for the observation weights. Default is 1 for each
observation. (Note: glmnet rescales the weights to sum to N, the
sample size.)"
With w in the Fortran code being the rescaled weights, this seems to be consistent with weighted mean standardization.
For what it's worth, consistent with the accepted answer, the weights in glmnet are sampling weights, and not inverse variance weights. For example if you have many more observations than unique observations, you can compress your dataset and get the same coefficient estimates:
n <- 50
m <- 5
y_norm <- rnorm(n)
y_bool <- rbinom(n,1,.5)
x <- matrix(rnorm(n*m),n)
w <- rpois(n,3) + 1 # weights
w_indx <- rep(1:n,times=w) # weights index
m1 = glmnet(x, y_norm, weights = w)
m2 = glmnet(x[w_indx,] ,y_norm[w_indx])
all.equal(coef(m1,s=.1),
coef(m2,s=.1))
>>> TRUE
M1 = glmnet(x,y_bool,weights = w,family = "binomial")
M2 = glmnet(x[w_indx,],y_bool[w_indx],family = "binomial")
all.equal(coef(M1,s=.1),
coef(M2,s=.1))
>>> TRUE
Of course, a bit more care needs to be used when using weights with cv.glmnet since the weights of aggregated records should be spread across folds using a multinomial distribution...
I recently want to run a standard error of Fama-Macbeth test, when we compute standard error, we need standard devation. This test's sd is \frac{1}{n^2}\sum (x_i-\bar x)^2. In my mind, the denominator is n for a normal computation of sd. So my question is, whether a program , such as R and Eviews, when they run a linear regression they also give the coefficients' standard error by sd who is computed by denominator \frac{1}{n^2} ?
Thanks for everyone.
The formula for calculating the standard error of coefficients in a linear regression can be found in introductory textbooks or e.g. How are the standard errors of coefficients calculated in a regression?.
The variance of the coefficients is given by
where sigma-hat-squared is the sum of squared residuals divided by the degrees of freedom, given by n-k-1, where n is the number of observations, k is the number of covariates, and assuming that the model has an intercept.
We can verify this empirically. Using the built-in mtcars dataset
fit <- lm(mpg ~ wt, mtcars)
we can see that
vcv <- (sum(fit$residuals^2)/(nrow(mtcars) - 2)) *
solve(t(model.matrix(fit)) %*% model.matrix(fit))
all.equal(summary(fit)$coefficients[, "Std. Error"],
sqrt(diag(vcv)))
# [1] TRUE
I have a matrix (x) containing 100 samples (rows) and 10000 independent features (columns). The observations are binary, either the sample is good or bad {0,1} (stored in vector y). I want to perform leave one out cross-validation and determine the Area Under Curve (AUC) for each feature separately (something like colAUC from CAtools package). I tried to use glmnet, but it didn't work. As it is said in manual, I tried to set the nfold parameter to be equal to the number of observations (100).
>result=cv.glmnet(x,y,nfolds=100,type.measure="auc",family="binomial")
And I'm getting these warnings:
>"Warning messages:
1: Too few (< 10) observations per fold for type.measure='auc' in
cv.lognet; changed to type.measure='deviance'. Alternatively, use smaller
value for nfolds
2: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold"
Any ideas what I'm doing wrong? And is there any other way or R package to obtain LOO-balanced AUC values for each of the features?
I'll really appreciate any help. Thank you!
When you do a LOO-CV, you have a test set with only 1 sample in it, and you can of course not build an AUC with that. However, you can loop and store the predictions at each step:
k <- dim(x)[1]
predictions <- c()
for (i in 1:k) {
model <- glmnet(x[-i,], y[-i], family="binomial")
predictions <- c(predictions, predict(model, newx=x[i,]))
}
So that in the end you can make a ROC curve, for example:
library(pROC)
roc(y, predictions)