reducing the weighting of one of the x variables in glmnet - r

I can create a model using glmnet.
However if I know beforehand that the contribution of one variable is weak (say x[,1]) how can I reduce its contribution in the glmnet function?
library(glmnet)
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
fit1 <- glmnet(x,y)
print(fit1)
coef(fit1,s=0.01) # extract coefficients at a single value of lambda
predict(fit1,newx=x[1:10,],s=c(0.01,0.005)) # make predictions

This is a bit of a hack, but because penalized regression approaches assume that the variables are all scaled equally, shrinking the scale of a predictor variable should require a larger value of the parameter to characterize the same magnitude of effect, and hence a greater penalty, e.g.
x2 <- scale(x)
x2[,1] <- x2[,1]/100
fit2 <- glmnet(x2,y,standardize=FALSE)
coef(fit2,s=0.01)
shows that the first variable has been eliminated. (You have to force glmnet not to re-standardize the x-variables internally; you should make sure you've scaled the predictors yourself ...)

Why just not remove it? I don't think you can the reduce the "contribution" of one variable. That variable makes noise to your prediction. When an independent variable added to a model the R² is increased, no matter what. If that variable does not have any relationship to dependent variable( which is your situation ) smaller amount of "contribution" is added to R². If dependent(x) and independent(y) does have a relationship then the greater amount of R² is achieved which means you can explain the y with x better.

Related

Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R?

In lm(y ~ x1 + x2+ x3 +...+ xn) , not all independent variables are positive.
For example, we know that x1 to x5 must have positive coefficients and x6 to x10 must have negative coefficients.
However, when lm(y ~ x1 + x2+ x3 +...+ x10) is performed using R, some of x1 ~ x5 have negative coefficients and some of x6 ~ x10 have positive coefficients. is the data analysis result.
I want to control this using a linear regression method, is there any good way?
The sign of a coefficient may change depending upon its correlation with other coefficients. As #TarJae noted, this seems like an example of (or counterpart to?) Simpson's Paradox, which describes cases where the sign of a correlation might reverse depending on if we condition on another variable.
Here's a concrete example in which I've made two independent variables, x1 and x2, which are both highly correlated to y, but when they are combined the coefficient for x2 reverses sign:
# specially chosen seed; most seeds' result isn't as dramatic
set.seed(410)
df1 <- data.frame(y = 1:10,
x1 = rnorm(10, 1:10),
x2 = rnorm(10, 1:10))
lm(y ~ ., df1)
Call:
lm(formula = y ~ ., data = df1)
Coefficients:
(Intercept) x1 x2
-0.2634 1.3990 -0.4792
This result is not incorrect, but arises here (I think) because the prediction errors from x1 happen to be correlated with the prediction errors from x2, such that a better prediction is created by subtracting some of x2.
EDIT, additional analysis:
The more independent series you have, the more likely you are to see this phenomenon arise. For my example with just two series, only 2.4% of the integer seeds from 1 to 1000 produce this phenomenon, where one of the series produces a negative regression coefficient. This increases to 16% with three series, 64% of the time with five series, and 99.9% of the time with 10 series.
Constraints
Possibilities include using:
nls with algorithm = "port" in which case upper and lower bounds can be specified.
nnnpls in the nnls package which supports upper and lower 0 bounds or use nnls in the same package if all coefficients should be non-negative.
bvls (bounded value least squares) in the bvls package and specify the bounds.
there is an example of performing non-negative least squares in the vignette of the CVXR package.
reformulate it as a quadratic programming problem (see Wikipedia for the formulation) and use quadprog package.
nnls in the limSolve package. Negate the columns that should have negative coefficients to convert it to a non-negative least squares problem.
These packages mostly do not have a formula interface but instead require that a model matrix and dependent variable be passed as separate arguments. If df is a data frame containing the data and if the first column is the dependent variable then the model matrix can be calculated using:
A <- model.matrix(~., df[-1])
and the dependent variable is
df[[1]]
Penalties
Another approach is to add a penalty to the least squares objective function, i.e. the objective function becomes the sum of the squares of the residuals plus one or more additional terms that are functions of the coefficients and tuning parameters. Although doing this does not impose any hard constraints to guarantee the desired signs it may result in the correct signs anyways. This is particularly useful if the problem is ill conditioned or if there are more predictors than observations.
linearRidge in the ridge package will minimize the sum of the square of the residuals plus a penalty equal to lambda times the sum of the squares of the coefficients. lambda is a scalar tuning parameter which the software can automatically determine. It reduces to least squares when lambda is 0. The software has a formula method which along with the automatic tuning makes it particularly easy to use.
glmnet adds penalty terms containing two tuning parameters. It includes least squares and ridge regression as a special cases. It also supports bounds on the coefficients. There are facilities to automatically set the two tuning parameters but it does not have a formula method and the procedure is not as straight forward as in the ridge package. Read the vignettes that come with it for more information.
1- one way is to define an optimization program and minimize the mean square error by constraints and limits. (nlminb, optim, etc.)
2- Another one is using a library called "lavaan" as follow:
https://stats.stackexchange.com/questions/96245/linear-regression-with-upper-and-or-lower-limits-in-r

LASSO-type regressions with non-negative continuous dependent variable (dependent var)

I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)
You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")

Why does survey weight change R SQUARED?

library(survival)
library(survminer)
library(dplyr)
ovarian=ovarian
ovarian$weighting = sample(1:100,26,replace=T)
fitWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian,weight=weighting)
fitNOWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian)
In this example above the value of the R-Squared for fitWEIGHT equals to 1. However the same model without fake sample weights has R-Squared equals to less than half (0.5). Why is this happening?
Weighting here is effectively repeating the observations. You're calculating weights with a perfectly distributed random sample ovarian$weighting = sample(1:100,26,replace=T) that's distributed across your underlying data set. So re-observing each sets of data points according to the normally distributed weights is likely biasing the function to ensure perfect correlation between your dependent and independent variables. It's probably not perfectly perfectly correlated, but the 1:100 range is likely blowing it out beyond the default number of significant digits and so it rounds to 1. If you change the sample to 1:10 or 40:50 or something it would likely continue to push the correlation bias but to reduce the r2 to nearly-1 instead of rounded-to-1 value that you're seeing now under the current weighting strategy.
For additional discussion on weights for this function see below. To ensure that the weights you're specifying are the types of weights you're expecting for this analysis. It's really weighting the observation count (ie, a form of over/re-sampling the observation you're assigning the weight to). https://www.rdocumentation.org/packages/survival/versions/2.43-3/topics/coxph
Where it states:
Case Weights Case weights are treated as replication weights, i.e., a
case weight of 2 is equivalent to having 2 copies of that subject's
observation. When computers were much smaller grouping like subjects
together was a common trick to used to conserve memory. Setting all
weights to 2 for instance will give the same coefficient estimate but
halve the variance. When the Efron approximation for ties (default) is
employed replication of the data will not give exactly the same
coefficients as the weights option, and in this case the weighted fit
is arguably the correct one.
When the model includes a cluster term or the robust=TRUE option the
computed variance treats any weights as sampling weights; setting all
weights to 2 will in this case give the same variance as weights of 1.

How do I calculate AUC from two continuous variables in R?

I have the following data:
# actual value:
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
I already calculated MSE and RMSE for these two, but they're asking for AUC and ROC curve. How can I calculate it from this data using R? I thought AUC is for classification problems, was I mistaken? Can we still calculate AUC for numeric values like above?
Question:
I thought AUC is for classification problems, was I mistaken?
You are not mistaken. The area under the receiver operating characteristic curve can't be computed for two numeric vectors like in your example. It's used to determine how well your binary classifier stands up to a gold standard binary classifier. You need a vector of cases vs. controls, or levels for the a vector that put each value in one of two categories.
Here's an example of how you'd do this with the pROC package:
library(pROC)
# actual value
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
df <- data.frame(a = a, p = p)
# order the data frame according to the actual values
odf <- df[order(df$a),]
# convert the actual values to an ordered binary classification
odf$a <- odf$a > 12 # arbitrarily decided to use 12 as the threshold
# construct the roc object
roc_obj <- roc(odf$a, odf$p)
auc(roc_obj)
# Area under the curve: 0.9615
Here, we have arbitrarily decided that threshold for the gold standard (a) is 12. If that's the case, than observations that have a lower value than 12 are controls. The prediction (p) classifies very well, with an AUC of 0.9615. We don't have to decide on the threshold for our prediction classifier in order to determine the AUC, because it's independent of the threshold decision. We can slide up and down depending on whether it's more important to find cases or to not misclassify a control.
Important Note
I completely made up the threshold for the gold standard classifier. If you choose a different threshold (for the gold standard), you'll get a different AUC. For example, if we chose 28, the AUC would be 1. The AUC is independent of the threshold for the predictor, but absolutely depends on the threshold for the gold standard.
EDIT
To clarify the above note, which was apparently misunderstood, you were not mistaken. This kind of analysis is for classification problems. You cannot use it here without more information. In order to do it, you need a threshold for your a vector, which you don't have. You CAN'T make one up and expect to get a non made up result for the AUC. Because the AUC depends on the threshold for the gold standard classifier, if you just make up the threshold, as we did in the exercise above, you are also just making up the AUC.

Scale back linear regression coefficients in R from scaled and centered data

I'm fitting a linear model using OLS and have scaled my regressors with the function scale in R because of the different units of measure between variables. Then, I fit the model using the lm command and get the coefficients of the fitted model. As far as I know the coefficients of the fitted model are not in the same units of the original regressors variables and therefore must be scaled back before they can be interpreted. I have been searching for a direct way to do it by couldn't find anything. Does anyone know how to do that?
Please have a look to the code, could you please help me implementing what you proposed?
library(zoo)
filename="DataReg4.csv"
filepath=paste("C:/Reg/",filename, sep="")
separator=";"
readfile=read.zoo(filepath, sep=separator, header=T, format = "%m/%d/%Y", dec=".")
readfile=as.data.frame(readfile)
str(readfile)
DF=readfile
DF=as.data.frame(scale(DF))
fm=lm(USD_EUR~diff_int+GDP_US+Net.exports.Eur,data=DF)
summary(fm)
plot(fm)
I'm sorry this is the data.
http://www.mediafire.com/?hmcp7urt0ag8187
If you used the scale function with default arguments then your regressors will be centered (subtracting their mean) and divided by their standard deviations. You can interpret the coefficients without transforming them back to the original units:
Holding everything else constant, on average, a one standard deviation change in one of the regressors is associated with a change in the dependent variable corresponding to the coefficient of that regressor.
If you have included an intercept term in your model keep in mind that the interpretation of the intercept will change. The estimated intercept now represents the average level of the dependent variable when all of the regressors are at their average levels. This is a result of subtracting the mean from each variable.
To interpret the coefficients in non-standard deviation terms, just calculate the standard deviation of each regressor and multiple that by the coefficient.
To de-scale or back-transform regression coefficients from a regression done with scaled predictor variable(s) and non-scaled response variable the intercept and slope should be calculated as:
A = As - Bs*Xmean/sdx
B = Bs/sdx
thus the regression is,
Y = As - Bs*Xmean/sdx + Bs/sdx * X
where
As = intercept from the scaled regression
Bs = slope from the scaled regression
Xmean = the mean of the scaled predictor variable
sdx = the standard deviation of the predictor variable
This can be adjusted if Y was also scaled but it appears you decided not to do that ultimately with your dataset.
If I understand your description (that is unfortunately at the moment code-free), you are getting standardized regression coefficients for Y ~ As + Bs*Xs where all those "s" items are scaled variables. The coefficients then are the predicted change on a std deviation scale of Y associated with a change in X of one standard deviation of X. The scale function would have recorded the means and standard deviations in attributes for hte scaled object. If not, then you will have those estimates somewhere in your console log. The estimated change in dY for a change dX in X should be: dY*(1/sdY) = Bs*dX*(1/sdX). Predictions should be something along these lines:
Yest = As*(sdX) + Xmn + Bs*(Xs)*(sdX)
You probably should not have needed to standardize the Y values, and I'm hoping that you didn't because it makes dealing with the adjustment for the means of the X's easier. Put some code and example data in if you want implemented and checked answers. I think #DanielGerlance is correct in saying to multiply rather than divide by the SD's.

Resources