Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R? - r

In lm(y ~ x1 + x2+ x3 +...+ xn) , not all independent variables are positive.
For example, we know that x1 to x5 must have positive coefficients and x6 to x10 must have negative coefficients.
However, when lm(y ~ x1 + x2+ x3 +...+ x10) is performed using R, some of x1 ~ x5 have negative coefficients and some of x6 ~ x10 have positive coefficients. is the data analysis result.
I want to control this using a linear regression method, is there any good way?

The sign of a coefficient may change depending upon its correlation with other coefficients. As #TarJae noted, this seems like an example of (or counterpart to?) Simpson's Paradox, which describes cases where the sign of a correlation might reverse depending on if we condition on another variable.
Here's a concrete example in which I've made two independent variables, x1 and x2, which are both highly correlated to y, but when they are combined the coefficient for x2 reverses sign:
# specially chosen seed; most seeds' result isn't as dramatic
set.seed(410)
df1 <- data.frame(y = 1:10,
x1 = rnorm(10, 1:10),
x2 = rnorm(10, 1:10))
lm(y ~ ., df1)
Call:
lm(formula = y ~ ., data = df1)
Coefficients:
(Intercept) x1 x2
-0.2634 1.3990 -0.4792
This result is not incorrect, but arises here (I think) because the prediction errors from x1 happen to be correlated with the prediction errors from x2, such that a better prediction is created by subtracting some of x2.
EDIT, additional analysis:
The more independent series you have, the more likely you are to see this phenomenon arise. For my example with just two series, only 2.4% of the integer seeds from 1 to 1000 produce this phenomenon, where one of the series produces a negative regression coefficient. This increases to 16% with three series, 64% of the time with five series, and 99.9% of the time with 10 series.

Constraints
Possibilities include using:
nls with algorithm = "port" in which case upper and lower bounds can be specified.
nnnpls in the nnls package which supports upper and lower 0 bounds or use nnls in the same package if all coefficients should be non-negative.
bvls (bounded value least squares) in the bvls package and specify the bounds.
there is an example of performing non-negative least squares in the vignette of the CVXR package.
reformulate it as a quadratic programming problem (see Wikipedia for the formulation) and use quadprog package.
nnls in the limSolve package. Negate the columns that should have negative coefficients to convert it to a non-negative least squares problem.
These packages mostly do not have a formula interface but instead require that a model matrix and dependent variable be passed as separate arguments. If df is a data frame containing the data and if the first column is the dependent variable then the model matrix can be calculated using:
A <- model.matrix(~., df[-1])
and the dependent variable is
df[[1]]
Penalties
Another approach is to add a penalty to the least squares objective function, i.e. the objective function becomes the sum of the squares of the residuals plus one or more additional terms that are functions of the coefficients and tuning parameters. Although doing this does not impose any hard constraints to guarantee the desired signs it may result in the correct signs anyways. This is particularly useful if the problem is ill conditioned or if there are more predictors than observations.
linearRidge in the ridge package will minimize the sum of the square of the residuals plus a penalty equal to lambda times the sum of the squares of the coefficients. lambda is a scalar tuning parameter which the software can automatically determine. It reduces to least squares when lambda is 0. The software has a formula method which along with the automatic tuning makes it particularly easy to use.
glmnet adds penalty terms containing two tuning parameters. It includes least squares and ridge regression as a special cases. It also supports bounds on the coefficients. There are facilities to automatically set the two tuning parameters but it does not have a formula method and the procedure is not as straight forward as in the ridge package. Read the vignettes that come with it for more information.

1- one way is to define an optimization program and minimize the mean square error by constraints and limits. (nlminb, optim, etc.)
2- Another one is using a library called "lavaan" as follow:
https://stats.stackexchange.com/questions/96245/linear-regression-with-upper-and-or-lower-limits-in-r

Related

Model Syntax for Simple Moderation Model in Lavaan (with bootstrapping)

I am a social scientist currently running a simple moderation model in R, in the form of y ~ x + m + m * x. My moderator is a binary categorical variable (two separate groups).
I started out with lm(), bootstrapped estimates with boot() and obtained bca confidence intervals with boot.ci. Since there is no automated way of doing this for all parameters (at my coding level at least), this is bit tedious. Howver, I now saw that the lavaan package offer bootstrapping as part of the regular sem() function, and also bca CIs as part of parameterEstimates(). So, I was wondering (since I am using lavaan in other analyses) whether I could just replace lm() with lavaan for the sake of keeping my work more consistent.
Doing this, I was wondering about what the equivalent model for lavaan would be to test for moderation in the same way. I saw this post where Jeremy Miles proposes the code below, which I follow mostly.
mod.1 <- "
y ~ c(a, b) * x
y ~~ c(v1, v1) * y # This step needed for exact equivalence
y ~ c(int1, int2) * 1
modEff := a - b
mEff := int1 - int2"
But it would be great if you could help me figure out some final things.
1) What does the y ~~ c(v1, v1) * y part mean and and why is it needed for "exact equivalence" to the lm model? From the output it seems this constrics variances of the outcome for both groups to the same value?
2) From the post, am I right to understand that either including the interaction effect as calculated above OR constraining (only) the slope between models and looking at model fit with anova()would be the same test for moderation?
3) The lavaan page says that adding test = "bootstrap" to the sem() function allows for boostrap adjusted p-values. However, I read a lot about p-values conflicting with the bca-CIs at times, and this has happened to me. Searching around, I understand that this conflict comes from the assumptions for the distribution of the data under the H0 for p-values, but not for CIs (which just give the range of most likely values). I was therefore wondering what it exactly means that the p-values given here are "bootstrap-adjusted"? Is it technically more true to report these for my SEM models than the CIs?
Many questions, but I would be very grateful for any help you can provide.
Best,
Alex
I think I can answer at least Nr. 1 and 2 of your questions but it is probably easier to not use SEM and instead program a function that conveniently gives you CIs for all coefficients of your model.
So first, to answer your questions:
What is proposed in the code you gave is called multigroup comparison. Essentially this means that you fit the same SEM to two different groups of cases in your dataset. It is equivalent to a moderated regression with binary moderator because in both cases you get two slopes (often called „simple slopes“) for the scalar predictor, one slope per group of the moderator.
Now, in your lavaan code you only see the scalar predictor x. The binary moderator is implied by group="m" when you fit the model with fit.1 <- sem(mod.1, data = df, group = "m") (took this from the page you linked).
The two-element vectors (c( , )) in the lavaan code specify named parameters for the first and second group, respectively. By y ~~ c(v1, v1) * y , the residual variances of y are set equal in both groups because they have the same name. In contrast, the slopes c(a, b) and the intercepts c(int1, int2) are allowed to vary between groups.
Yes. If you use the SEM, you would fit the model a second time adding a == b and compare the model this to the first version where the slopes can differ. This is the same as comparing lm() models with and without a:b (or a*b) in the formula.
Here I cannot provide a direct answer to your question. I suspect if you want BCa CIs as you would get from applying boot.ci to an lm model fit, this might not be implemented. In the lavaan documentation BCa confidence intervals are only mentioned once: In the section about the parameterEstimates function, which can also perform bootstrap (see p. 89). However, it does not produce actual BCa (bias-corrected and accelerated) CIs but only bias-corrected ones.
As mentioned above, I guess the simplest solution would be to use lm() and either repeat the boot.ci procedure for each coefficient or write a wrapper function that does this for you. I suggest this also because a reviewer may be quite puzzled to see you do multigroup SEM instead of a simple moderated regression, which is much more common.
You probably did something like this already:
lm_fit <- function(dat, idx) coef( lm(y ~ x*m, data=dat[idx, ]) )
bs_out <- boot::boot(mydata, statistic=lm_fit, R=1000)
ci_out <- boot::boot.ci(bs_out, conf=.95, type="bca", index=1)
Now, either you repeat the last line for each coefficient, i.e., varying index from 1 to 4. Or you get fancy and let R do the repeating with a function like this:
all_ci <- function(bs) {
est <- bs$t0
lower <- vector("numeric", length(bs$t0))
upper <- lower
for (i in 1:length(bs$t0)) {
ci <- tail(boot::boot.ci(bs, type="bca", index=i)$bca[1,], 2)
lower[i] <- ci[1]
upper[i] <- ci[2]
}
cbind(est, lower, upper)
}
all_ci(bs_out)
I am sure this could be written more concisely but it should work fine for bootstraps of simple lm() models.

reducing the weighting of one of the x variables in glmnet

I can create a model using glmnet.
However if I know beforehand that the contribution of one variable is weak (say x[,1]) how can I reduce its contribution in the glmnet function?
library(glmnet)
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
fit1 <- glmnet(x,y)
print(fit1)
coef(fit1,s=0.01) # extract coefficients at a single value of lambda
predict(fit1,newx=x[1:10,],s=c(0.01,0.005)) # make predictions
This is a bit of a hack, but because penalized regression approaches assume that the variables are all scaled equally, shrinking the scale of a predictor variable should require a larger value of the parameter to characterize the same magnitude of effect, and hence a greater penalty, e.g.
x2 <- scale(x)
x2[,1] <- x2[,1]/100
fit2 <- glmnet(x2,y,standardize=FALSE)
coef(fit2,s=0.01)
shows that the first variable has been eliminated. (You have to force glmnet not to re-standardize the x-variables internally; you should make sure you've scaled the predictors yourself ...)
Why just not remove it? I don't think you can the reduce the "contribution" of one variable. That variable makes noise to your prediction. When an independent variable added to a model the R² is increased, no matter what. If that variable does not have any relationship to dependent variable( which is your situation ) smaller amount of "contribution" is added to R². If dependent(x) and independent(y) does have a relationship then the greater amount of R² is achieved which means you can explain the y with x better.

Covariance matrix of the estimated parameteric coefficients and estimated smoothing parameters in a GAM (package: mgcv)?

Consider the code below to fit a generalized additive model including two terms x0 which is linear and x1 which is nonlinear:
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2, method="REML")
b <- gam(y~x1+s(x2, k=5),data=dat)
The model b estimates 3 parameters: an intercept, one parametric coefficient for x1, and one smoothing parameter for x2. How can I extract the estimated covariance matrix of these 3 parameters? I have used vcov(b) which gives the following results:
(Intercept) x0 s(x1).1 s(x1).2 s(x1).3 s(x1).4
(Intercept) 0.104672470 -0.155791753 0.002356237 0.001136459 0.001611635 0.001522158
x0 -0.155791753 0.322528093 -0.004878003 -0.002352757 -0.003336490 -0.003151250
s(x1).1 0.002356237 -0.004878003 0.178914602 0.047701707 0.078393786 0.165195739
s(x1).2 0.001136459 -0.002352757 0.047701707 0.479869768 0.606310668 0.010704075
s(x1).3 0.001611635 -0.003336490 0.078393786 0.606310668 0.933905535 0.025816649
s(x1).4 0.001522158 -0.003151250 0.165195739 0.010704075 0.025816649 0.184471259
It seems vcov(b) gives the covariance related to each knot of the smooth term s(x1), as the results contain s(x1).1, s(x1).2, s(x1).3, s(x1).4 (That's what I guess). I need the covariance between the estimated smoothing parameter and other parametric coefficients, which should be just one for (Intercept) and just one for x0. Is it available at all?
Edit: I set the method of estimation to REML in the code. I agree that I might have used incorrect phrases to explain my idea as said by Gavin Simpson, and I understand all he said. Yet the idea of calculating the covariance between the parametric coefficients (intercept and coefficient of x1) and them smoothing parameter comes from the method of estimation. If we set it to ML or REML, then there could be the covariance I guess. In this case, the estimated covariance matrix for the log smoothing parameter estimates are provided by sp.vcov. So I think such value could exist similarly for the parametric coefficients and the smoothing parameter.
Your statement
The model b estimates 3 parameters: an intercept, one parametric coefficient for x1, and one smoothing parameter for x2.
is incorrect.
The model estimates many more coefficients that these three. Note also that it is confusing to speak of a smoothing parameter for x2 as the model also estimates one of those, but I doubt this is what you mean by that phrase. The smoothing parameter estimated for x2 is the value that controls the wiggliness of the fitted spline. It is also estimated alongside the other coefficients you see, although it isn't typically considered as part of the main model estimated parameters because what you see in the VCOV are actually the variances and covariances of the model coefficients conditional upon this value of the smoothness parameter.
The GAM fitted here is one in which the effect of x2 is represented by a spline basis expansion of x2. For the basis used and the identifiability constraints applied to the basis, this means that the true effect of x2, f(x2), is estimated via a k-1 basis functions. This is a function hat(f(x2)) = \sum \beta_i b_i(x2) estimated by summing up the weighted (by beta_i, the model coefficients for the ith basis function, b) basis functions evaluated at the observed values of x2 (b_i(x2)).
Hence once the basis is chosen and once we have a smoothness parameter (my version, the one controlling the wiggliness), this model is simply a GLM with x1 and the 4 basis functions evaluated at x2. Hence it is parametric and there isn't a single element in the VCOV that relates to the smooth f(x2) - the model just doesn't work that way.

Mixed model starting values for lme4

I am trying to fit a mixed model using the lmer function from the lme4 package. However, I do not understand what should be input to the start parameter.
My purpose is to use a simple linear regression to use the coefficients estimated there as starting values to the mixed model.
Lets say that my model is the following:
linear_model = lm(y ~ x1 + x2 + x3, data = data)
coef = summary(linear_model)$coefficients[- 1, 1] #I remove the intercept
result = lmer(y ~ x1 + x2 + x3 | x1 + x2 + x3, data = data, start = coef)
This example is an oversimplified version of what I am doing since I won't be able to share my data.
Then I get the following kind of error:
Error during wrapup: incorrect number of theta components (!=105) #105 is the value I get from the real regression I am trying to fit.
I have tried many different solutions, trying to provide a list and name those values theta like I saw suggested on some forums.
Also the Github code test whether the length is appropriate but I cant find to what it refers to:
# Assign the start value to theta
if (is.numeric(start)) {
theta <- start
}
# Check the length of theta
length(theta)!=length(pred$theta)
However I can't find where pred$theta is defined and so I don't understand where that value 105 is coming from.
Any help ?
A few points:
lmer doesn't in fact fit any of the fixed-effect coefficients explicitly; these are profiled out so that they are solved for implicitly at each step of the nonlinear estimation process. The estimation involves only a nonlinear search over the variance-covariance parameters. This is detailed (rather technically) in one of the lme4 vignettes (eqs. 30-31, p. 15). Thus providing starting values for the fixed-effect coefficients is impossible, and useless ...
glmer does fit fixed-effects coefficients explicitly as part of the nonlinear optimization (as #G.Grothendieck discusses in comments), if nAGQ>0 ...
it's admittedly rather obscure, but the starting values for the theta parameters (the only ones that are explicitly optimized in lmer fits) are 0 for the off-diagonal elements of the Cholesky factor, 1 for the diagonal elements: this is coded here
ll$theta[] <- is.finite(ll$lower) # initial values of theta are 0 off-diagonal, 1 on
... where you need to know further that, upstream, the values of the lower vector have been coded so that elements of the theta vector corresponding to diagonal elements have a lower bound of 0, off-diagonal elements have a lower bound of -Inf; this is equivalent to starting with an identity matrix for the scaled variance-covariance matrix (i.e., the variance-covariance matrix of the random-effects parameters divided by the residual variance), or a random-effects variance-covariance matrix of (sigma^2 I).
If you have several random effects and big variance-covariance matrices for each, things can get a little hairy. If you want to recover the starting values that lmer will use by default you can use lFormula() as follows:
library(lme4)
ff <- lFormula(Reaction~Days+(Days|Subject),sleepstudy)
(lwr <- ff$reTrms$lower)
## [1] 0 -Inf 0
ifelse(lwr==0,1,0) ## starting values
## [1] 1 0 1
For this model, we have a single 2x2 random-effects variance-covariance matrix. The theta parameters correspond to the lower-triangle Cholesky factor of this matrix, in column-wise order, so the first and third elements are diagonal, and the second element is off-diagonal.
The fact that you have 105 theta parameters worries me; fitting such a large random-effects model will be extremely slow and take an enormous amount of data to fit reliably. (If you know your model makes sense and you have enough data you might want to look into faster options, such as using Doug Bates's MixedModels package for Julia or possibly glmmTMB, which might scale better than lme4 for problems with large theta vectors ...)
your model formula, y ~ x1 + x2 + x3 | x1 + x2 + x3, seems very odd. I can't figure out any context in which it would make sense to have the same variables as random-effect terms and grouping variables in the same model!

Warning message using robust regression with lmRob: Denominator smaller than tl= 1e-100 in test for bias

I'm doing robust regressions in R using lmRob from the robust package. I have four response variables (R1-R4) and eight predictors (X1-X8), thus doing four separate robust regressions à la:
`library(robust)
mod.rob <- lmRob(R1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=dataset)
summary(mod.rob)`
... and so on for the four response variables.
R1 to R3 works perfect, but when I try to do R4 I get two error messages. First:
Warning message:
In lmRob.fit.compute(x, y, x1.idx = x1.idx, nrep = nrep, robust.control = robust.control, :
Max iteration for refinement reached.
... and after summary(mod.rob):
`Warning messages:
1: In test.lmRob(object) :
Denominator smaller than tl= 1e-100 in test for bias.
2: In test.lmRob(object) :
Denominator smaller than tl= 1e-100 in test for bias.`
In the resulting model all t-values and estimates are zero and all p-values are 1, except for the intercept. The test for bias M-estimate has NaN for both statistic and p-value, which I suspect is where the model fails for some reason. If i change the parameter mxr to 200 I get rid of the first warning, but not the second ones.
I have tried to modify some other parameters; initial.alg, tl, tlo, tua, but to no avail. Doing an ordinary LS with mod.lm <- lm(R4 ~ X1 ... X8) works just fine. I guess my statistics knowledge is lacking here, because I don't really understand what is wrong.
Edit:
I have uploaded the data here: http://jmp.sh/QVV6O9w
My goal is to fit regression models using these data. My statistics background is a lot more lacking than I thought, so when diving deeper into regression modelling I found it a bit more complicated than I first had hoped. In short, I just want to construct models that are as accurate as possible. The predictors are all based on previous literature, and from what I have read, it is more justified to keep them all in the model rather than using some stepwise method. I have so far tried to learn various kinds of robust techniques, GAMs, and bootstrapping, since the OLS linear models violate the normality of residuals assumption. I am at a loss of how to proceed, really.

Resources