Force certain parameters to have positive coefficients in lm() - r

I would like to know how to constrain certain parameters in lm() to have positive coefficients. There are a few packages or functions (e.g. display) that can make all coefficients, and the intercept, positive.
For instance, in this example, I would like to force only x1 and x2 to have positive coefficients.
x1=c(NA,rnorm(99)*10)
x2=c(NA,NA,rnorm(98)*10)
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
lm(y~x1+x2+x3)
Call:
lm(formula = y ~ x1 + x2 + x3)
Coefficients:
(Intercept) x1 x2 x3
-0.06278 0.02261 -0.02233 -0.99626
I have tried function nnnpls() in package nnls, it can control the coefficient sign easily. Unfortunately I can't use it due to issues with NAs in the data as this function doesn't allow NA.
I saw function glmc() can be used to apply constraints but I couldn't get it working.
Could someone let me know what should I do?

You could use the package colf for this. It currently offers two least squares non linear optimizers, namely nls or nlxb:
library(colf)
colf_nlxb(y ~ x1 + x2 + x3, data = DF, lower = c(-Inf, 0, 0, -Inf))
#nlmrt class object: x
#residual sumsquares = 169.53 on 98 observations
# after 3 Jacobian and 3 function evaluations
# name coeff SEs tstat pval gradient JSingval
#1 param_X.Intercept. -0.0066952 NA NA NA 3.8118 103.3941
#2 param_x1 0.0000000 NA NA NA 103.7644 88.7017
#3 param_x2 0.0000000 NA NA NA 0.0000 9.8032
#4 param_x3 -0.9487088 NA NA NA 330.7776 0.0000
colf_nls(y ~ x1 + x2 + x3, data = DF, lower = c(-Inf, 0, 0, -Inf))
#Nonlinear regression model
# model: y ~ param_X.Intercept. * X.Intercept. + param_x1 * x1 + param_x2 *
# x2 + param_x3 * x3
# data: model_ingredients$model_data
#param_X.Intercept. param_x1 param_x2 param_x3
# -0.0392 0.0000 0.0000 -0.9801
# residual sum-of-squares: 159
#
#Algorithm "port", convergence message: both X-convergence and relative convergence (5)
You can set the lower and/or upper bounds to specify the limits as you like for each one of the coefficients.

You can use package penalized:
set.seed(1)
x1=c(NA,rnorm(99)*10)
x2=c(NA,NA,rnorm(98)*10)
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
DF <- data.frame(x1,x2,x3,y)
lm(y~x1+x2+x3, data=DF)
#Call:
#lm(formula = y ~ x1 + x2 + x3, data = DF)
#
#Coefficients:
#(Intercept) x1 x2 x3
# -0.02438 -0.01735 -0.02030 -0.98203
This gives the same:
library(penalized)
mod1 <- penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = FALSE, data=na.omit(DF))
coef(mod1)
#(Intercept) x1 x2 x3
#-0.02438357 -0.01734856 -0.02030120 -0.98202831
If you constraint the coefficients of x1 and x2 to be positive, they become zero (as expected):
mod2 <- penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = c(T, T, F), data=na.omit(DF))
coef(mod2)
#(Intercept) x3
#-0.03922266 -0.98011223

With ConsReg https://cran.r-project.org/web/packages/ConsReg/index.html package you can deal with this kind of problems
You can set bound limits (lower and upper) and also restrictions within coefficients, like beta1 > beta2 which in some cases can be very useful.

Related

Is it ok to run a plm fixed effect model and add a factor dummy variable (tree way fixed effects)?

Is it ok to run a "plm" fixed effect model and add a factor dummy variable in R as below?
The three factors "Time", "Firm” and "Country" are all separate indices which I want to fix all together.
Instead of making two indices in total by combining "Firm” and "Country", I find the below specification works much better for my case.
Is this an acceptable format?
plm(y ~ lag(x1, 1) + x2 + x3 + x4 + x5 + factor(Country), data=DATA,
index=c("Firm","Time"), model="within")
It is okay to add additional factors. We can prove this by calculating an LSDV model. As a preliminary note, you will of course need robust standard errors, usually clustered at the highest aggregate level, i.e. country in this case.
Note: R >= 4.1 is used in the following.
LSDV
fit1 <-
lm(y ~ d + x1 + x2 + x3 + x4 + factor(id) + factor(time) + factor(country),
dat)
lmtest::coeftest(
fit1, vcov.=sandwich::vcovCL(fit1, cluster=dat$country, type='HC0')) |>
{\(.) .[!grepl('\\(|factor', rownames(.)), ]}()
# Estimate Std. Error t value Pr(>|t|)
# d 10.1398727 0.3181993 31.8664223 4.518874e-191
# x1 1.1217514 1.6509390 0.6794627 4.968995e-01
# x2 3.4913273 2.7782157 1.2566797 2.089718e-01
# x3 0.6257981 3.3162148 0.1887085 8.503346e-01
# x4 0.1942742 0.8998307 0.2159008 8.290804e-01
After adding factor(country), the estimators we get with plm::plm are identical to LSDV:
plm::plm
fit2 <- plm::plm(y ~ d + x1 + x2 + x3 + x4 + factor(country),
index=c('id', 'time'), model='within', effect='twoways', dat)
summary(fit2, vcov=plm::vcovHC(fit2, cluster='group', type='HC1'))$coe
# Estimate Std. Error t-value Pr(>|t|)
# d 10.1398727 0.3232850 31.3651179 5.836597e-186
# x1 1.1217514 1.9440165 0.5770277 5.639660e-01
# x2 3.4913273 3.2646905 1.0694206 2.849701e-01
# x3 0.6257981 3.1189939 0.2006410 8.409935e-01
# x4 0.1942742 0.9250759 0.2100089 8.336756e-01
However, cluster='group' will refer to "id" and not to "country", so the standard errors are wrong. It seems that clustering by the additional factor with plm is currently not possible, at least I am not aware of anything.
Alternatively you may use lfe::felm to not have to do without the immensely reduced computing times relative to LSDV:
lfe::felm
summary(lfe::felm(y ~ d + x1 + x2 + x3 + x4 | id + time + country | 0 | country,
dat))$coe
# Estimate Cluster s.e. t value Pr(>|t|)
# d 10.1398727 0.3184067 31.8456637 1.826374e-33
# x1 1.1217514 1.6520151 0.6790201 5.004554e-01
# x2 3.4913273 2.7800267 1.2558611 2.153737e-01
# x3 0.6257981 3.3183765 0.1885856 8.512296e-01
# x4 0.1942742 0.9004173 0.2157602 8.301083e-01
For comparison, here is what Stata computes, the standard errors closely resemble those of LSDV and lfe::felm:
Stata
. reghdfe y d x1 x2 x3 x4, absorb (country time id) vce(cluster country)
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
d | 10.13987 .3185313 31.83 0.000 9.49907 10.78068
x1 | 1.121751 1.652662 0.68 0.501 -2.202975 4.446478
x2 | 3.491327 2.781115 1.26 0.216 -2.103554 9.086209
x3 | .6257981 3.319675 0.19 0.851 -6.052528 7.304124
x4 | .1942742 .9007698 0.22 0.830 -1.617841 2.006389
_cons | 14.26801 23.65769 0.60 0.549 -33.32511 61.86114
Simulated Panel Data:
n1 <- 20; t1 <- 4; n2 <- 48
dat <- expand.grid(id=1:n1, time=1:t1, country=1:n2)
set.seed(42)
dat <- within(dat, {
id <- as.vector(apply(matrix(1:(n1*n2), n1), 2, rep, t1))
d <- runif(nrow(dat), 70, 80)
x1 <- sample(0:1, nrow(dat), replace=TRUE)
x2 <- runif(nrow(dat))
x3 <- runif(nrow(dat))
x4 <- rnorm(nrow(dat))
y <-
10*d + ## treatment effect
as.vector(replicate(n2, rep(runif(n1, 2, 5), t1))) + ## id FE
rep(runif(n1, 10, 12), each=t1) + ## time FE
rep(runif(n2, 10, 12), each=n1*t1) + ## country FE
- .7*x1 + 1.3*x2 + 2.4*x3 +
.5 * x4 + rnorm(nrow(dat), 0, 50)
})
readstata13::save.dta13(dat, 'panel.dta') ## for Stata

KFAS: Negative variances and huge Std Errors

I am trying to fit a Marketing Mix Model and run into the following problem:
Warning message:
In KFS(model) :
Possible error in diffuse filtering: Negative variances in Pinf, check the >model or try changing the tolerance parameter tol or P1/P1inf of the model.
Below is a reproducible sample code with more detail.
The goal is to develop a tool to optimize marketing mix
Below x1(t), x2(t), x3(t) are investments into 3 marketing channelsat time t
y(t) is the sales at time t
We want to use Kalman Filter approach:
y(t) = alpha + lambda * y(t-1) + beta1 * x1(t) + beta2 * x2(t) + beta3 * x3(t) +
beta12 * x1(t) * x2(t) + beta13 * x1(t) * x3(t) + beta23 * x2(t) * x3(t) + N(0, sigma)
In order to:
i. deduce sales attributed to each channel x1(t), x2(t), x3(t)
ii. their synergies x1 * x2, x1 * x3, x2 * x3
iii. carry over sales yCO(t) = y(t-1)
We use KFAS package
https://cran.r-project.org/web/packages/KFAS/index.html
Below is a simple reproducible example where we:
1. Simulate x1(t), x2(t), x3(t)
2. Set input parameter values used for sales simulation
b1, b2, b12, ..., b23, lambdaà, and sigma
3. Use simulated y(t), x1(t), x2(t), x3(t) to fit the model
4. Compare estimated coefficient with input values b1, b2,...
library(KFAS)
library(dplyr)
sigma<-50
set.seed(1)
x1<-1000 + rnorm(n = 100,mean = 0,sd = 100) + rnorm(100, 0, sigma)
x2<-rep(0, 100)
x2[sort(which(1:100%%6==0))]<-500
x3<-300+100*sin(1:100%%12/12*pi) + rnorm(100, 0, sigma)
#Operationalize with SQRT
x1<-sqrt(x1)
x2<-sqrt(x2)
x3<-sqrt(x3)
#Set input parameters fro simulation
lambda0<-0.5
b1 <- 3
b2 <- 4
b3 <- 5
b12 <- 0.3
b13 <- 0.2
b23 <- 0.1
y_s <-
b1 * x1 +
b2 * x2 +
b3 * x3 +
b12 * x1 * x2 +
b13 * x1 * x3 +
b23 * x2 * x3 +
rnorm(100, sd = sigma)
# function to account for carry over term
# y(t) = lambda * y(t-1) + y_s(t) , where
# y_s(t) = b1 * x1 + b2 * x2 + b3 * x3 + Synergy terms
getCarryOver<-function(t,
y_s,
lambda)
{
if (t==1) return(y_s[1])
else lambda*getCarryOver(t-1,y_s,lambda) + y_s[t]
}
# Add Carry Over term
y<-vector('numeric',100)
for (i in 1:100){
y[i]<-getCarryOver(i,y_s,lambda0)
}
yCO=dplyr::lag(y)
yCO[1]=0
if (!identical(y[-1],y_s[-1]+lambda0*yCO[-1]))
stop('identical(y,y_s+lambda0*yCO)')
model <- SSModel(y ~ SSMregression(~ x1 +
x2 +
x3 +
x1*x2 +
x1*x3 +
x2*x3 +
yCO
, Q = diag(NA,1)), H = NA)
fit <- fitSSM(model, inits = c(0,0,0,0,0,0,0,0), method = "BFGS")
model <- fit$model
model$Q
model$H
out <- KFS(model)
print(out)
This doesn't fully answer your question but this is too long for a comment...
You are not creating the same model you have formulated in before the codes. By defining Q=NA you are actually stating that the first coefficient x1 should be time varying with unknown variance. And then in the fitSSM call you are giving too many initial values so you don't notice the error (only Q and H need numerical estimation by fitSSM, the coeffients for x1 etc are directly estimated by Kalman filter). I admit there is probably few checks missing here which would warn user accordingly. SSMregression function is only needed if you have time-varying regression coefficients or complex multivariate models, here you can just write SSModel(y~ x1*x2 + x1*x3 + x2*x3 + yCO, H=NA) (the main effects are automatically included as in lm).
I would also check the carryover term calculations, just to be sure that you actually generate your data correctly.
If you still get errors, it could be that you have really high multicollinearity that you need to modify the prior for first time step, ie set model$P1inf[] <- 0 (removes the diffuse initialization) and set diag(model$P1) to something moderate like 100, (prior variance of the coefficients).
Actually if you are using the the formulation you suggest (all x's and yC0 as simple explanatory variable) then you should get identical results with lm. And running your code with lm I get same apparently wrong results:
> model <- SSModel(y~ x1*x2 + x1*x3 + x2*x3 + yCO, H=NA)
>
> fit <- fitSSM(model, inits = 0, method = "BFGS")
> out <- KFS(fit$model)
Warning message:
In KFS(fit$model) :
Possible error in diffuse filtering: Negative variances in Pinf, check the model or try changing the tolerance parameter tol or P1/P1inf of the model.
> out
Smoothed values of states and standard errors at time n = 100:
Estimate Std. Error
(Intercept) -1.171e+03 1.300e+03
x1 3.782e+01 4.102e+01
x2 -4.395e+00 1.235e+01
x3 7.287e+01 6.844e+01
yCO 5.244e-01 3.396e-02
x1:x2 5.215e-01 3.979e-01
x1:x3 -1.853e+00 2.167e+00
x2:x3 1.671e-01 3.471e-01
> summary(lm(y~ x1*x2 + x1*x3 + x2*x3 + yCO))
Call:
lm(formula = y ~ x1 * x2 + x1 * x3 + x2 * x3 + yCO)
Residuals:
Min 1Q Median 3Q Max
-137.297 -29.870 -2.214 35.178 87.578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.171e+03 1.300e+03 -0.901 0.370
x1 3.782e+01 4.102e+01 0.922 0.359
x2 -4.395e+00 1.235e+01 -0.356 0.723
x3 7.287e+01 6.844e+01 1.065 0.290
yCO 5.244e-01 3.396e-02 15.445 <2e-16 ***
x1:x2 5.215e-01 3.979e-01 1.311 0.193
x1:x3 -1.853e+00 2.167e+00 -0.855 0.395
x2:x3 1.671e-01 3.471e-01 0.481 0.631
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 50.25 on 92 degrees of freedom
Multiple R-squared: 0.88, Adjusted R-squared: 0.8709
F-statistic: 96.43 on 7 and 92 DF, p-value: < 2.2e-16
So I think you have some issues with generating your example data, or in the model formulation.

Panel regression with ar1 model

Excuse my naiveté. I'm not sure what this type of model is called -- perhaps panel regression.
Imagine I have the following data:
n <- 100
x1 <- rnorm(n)
y1 <- x1 * 0.5 + rnorm(n)/2
x2 <- rnorm(n)
y2 <- x2 * 0.5 + rnorm(n)/2
x3 <- rnorm(n)
y3 <- x3 * 0.25 + rnorm(n)/2
x4 <- rnorm(n)
y4 <- x4 * 0 + rnorm(n)/2
x5 <- rnorm(n)
y5 <- x5 * -0.25 + rnorm(n)/2
x6 <- rnorm(n)
y6 <- x6 * -0.5 + rnorm(n) + rnorm(n)/2
x7 <- rnorm(n)
y7 <- x7 * -0.75 + rnorm(n)/2
foo <- data.frame(s=rep(1:100,times=7),
y=c(y1,y2,y3,y4,y5,y6,y7),
x=c(x1,x2,x3,x4,x5,x6,x7),
i=rep(1:7,each=n))
Where y and x are individual AR1 time series measured over 100 seconds (I use 's' instead of 't' for the time variable) divided equally into groups (i). I wish to model these as:
y_t= b_0 + b_1(y_{t-1}) + b_2(x_{t}) + e_t
but while taking the group (i) into account:
y_{it)= b_0 + b_1(y_{it-1}) + b_2(x_{it}) + e_{it}
I wish to know if b_2 (the coef on x) is a good predictor of y and how that coef varies with group. I also want to know the R2 and RMSE by group and to predict y_i given x_i and i. The grouping variable can be discrete or continuous.
I gather that this type of problem is called panel regression but it is not a term that is familiar to me. Is using plm in R a good approach to investigate this problem?
Based on the comment below, I guess this is a simple start:
require(dplyr)
require(broom)
fitted_models <- foo %>% group_by(grp) %>% do(model = lm(y ~ x, data = .))
fitted_models %>% tidy(model)
fitted_models %>% glance(model)
Since you don't include fixed or random effects in the model, we are dealing with the pooled OLS (POLS) which can be estimated using lm or plm.
Let's construct example data of 100 groups and 100 observations for each:
df <- data.frame(x = rnorm(100 * 100), y = rnorm(100 * 100),
group = factor(rep(1:100, each = 100)))
df$ly <- unlist(tapply(df$y, df$group, function(x) c(NA, head(x, -1))))
head(df, 2)
# x y group ly
# 1 1.7893855 1.2694873 1 NA
# 2 0.8671304 -0.9538848 1 1.2694873
Then
m1 <- lm(y ~ ly + x:group, data = df)
is a model with a common autoregressive coefficient and a group-dependent effect of x:
head(coef(m1)[-1:-2], 5)
# x:group1 x:group2 x:group3 x:group4 x:group5
# -0.02057244 0.06779381 0.04628942 -0.11384630 0.06377069
This allows you to plot them, etc. I suppose one thing that you will want to do is to test whether those coefficients are equal. That can be done as follows:
m2 <- lm(y ~ ly + x, data = df)
library(lmtest)
lrtest(m1, m2)
# Likelihood ratio test
#
# Model 1: y ~ ly + x:group
# Model 2: y ~ ly + x
# #Df LogLik Df Chisq Pr(>Chisq)
# 1 103 -14093
# 2 4 -14148 -99 110.48 0.2024
Hence, we cannot reject that the effects of x are the same, as expected.

R fitting a polynomial on data

I have some data synthetically generated from a function which is shown below.
fn <- function(w1,w2){
f= -(0.1 + 1.3*w1 + 0.4*w2 - 1.8*w1*w1 - 1.8*w2*w2)
return(f)
}
Next I create a data frame with the values as shown below
x = data.frame(
yval = fn(seq(0.1,0.9,by=0.01),seq(1.1,0.3,by=-0.01)),
x1 = seq(0.1,0.9,by=0.01),
x2 = seq(1.1,0.3,by=-0.01)
)
I want to see if I can recreate the coefficients of the polynomial in fn by using a polynomial fit which I attempt as shown below
fit = lm(yval ~ x1 + x2 + I(x1^2) + I(x2^2),data=x)
coef(fit)
However when I run the above code, I get the following
(Intercept) x1 x2 I(x1^2) I(x2^2)
2.012 -5.220 NA 3.600 NA
It appears that the term x2 was never "detected". Would anybody know what I could be doing wrong? I know that if I create synthetic linear data and try to re-create the coefficients using lm, I would get back the coefficients fairly accurately. Thanks in advance.
If you're fitting to a grid of 2 predictors, you want expand.grid.
x <- expand.grid(x1=seq(0.1, 0.9, by=0.01), x2=seq(1.1, 0.3, by=-0.01))
x$yval <- with(x, fn(x1, x2))
fit = lm(yval ~ x1 + x2 + I(x1^2) + I(x2^2),data=x)
coef(fit)
(Intercept) x1 x2 I(x1^2) I(x2^2)
-0.1 -1.3 -0.4 1.8 1.8

Replace lm coefficients in [r]

Is it possible to replace coefficients in lm object?
I thought the following would work
# sample data
set.seed(2157010)
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x1 + rnorm(length(x1))
fit <- lm( y ~ x1 + x2)
# view origional coefficeints
coef(fit)
# replace coefficent with new values
fit$coef(fit$coef[2:3]) <- c(5, 1)
# view new coefficents
coef(fit)
Any assistance would be greatly appreciated
Your code is not reproducible, as there's few errors in your code. Here's corrected version which shows also your mistake:
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
(Intercept) x1 x2
260.55645444 -0.04276353 2.91272272
# replace coefficients with new values, use whole name which is coefficients:
fit$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit)
(Intercept) x1 x2
260.5565 5.0000 1.0000
So the problem was that you were using fit$coef, although the name of the component in lm output is really coefficients. The abbreviated version works for getting the values, but not for setting, as it made new component named coef, and the coef function extracted the values of fit$coefficient.

Resources