Constrained weighted linear regression in R - r

I am trying to set up a contrained weighted linear regression. That is to say, that I have a dataset of i observations and three different x values. Each observations has a weight. I want to perform a weighted multiple linear regression using the restrictions that the weighted mean of each x value has to be zero and the weighted standard deviation should be one.
Since I am new and have no reputation yet, I can‘t post images with latex formulas. So I have to write them down this way.
First restriction $\sum_{i} w_{i} X_{i,k} = 0$ for k = 1,2,3.
Second one: $\sum_{i} w_{i} X_{i,k}^2 = 1$ for k = 1,2,3.
This is an example dataset:
y <- rnorm(10)
w <- rep(0.1, 10)
x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
data <- cbind(y, x1, x2, x3, w)
lm(y ~ x1 + x2 + x3, data = data, weigths = data$w)
The weights do not have to be equal for each observation but have to add up to one.
I would like to include these restrictions into the regression. Is there a way to do that?

You could perhaps use the Generalised Linear Model:
glm(y ~ x1 + x2 + x3, weights = w, data=data)
Data needs to be a data.frame(...).

Related

Use linear model for propensity score matching (R MatchIt)

I want to use the propensity score fitted from a linear model to match observations using the MatchIt library.
For example, suppose df is a dataframe. If I were to balance its treatment column in terms of x1, x2 and x3 using the propensity score from a logit, I would set distance = 'glm' and link = 'logit':
m <- matchit(formula = treatment ~ x1 + x2 + x3,
data = df,
method = 'nearest',
distance = 'glm',
link = 'logit')
How can I do the same with a linear model instead of a logit?
As per the documentation:
When link is prepended by "linear.", the linear predictor is used instead of the predicted probabilities.
Hence, I tried:
m <- matchit(formula = treatment ~ x1 + x2 + x3,
data = df,
method = 'nearest',
distance = 'glm',
link = 'linear.logit')
I'm afraid that doing this (link = linear.logit) would use the score from the log-odds of the logit model.
Is there any way I can just use a linear model instead of a generalized linear model?
You cannot do this from within matchit() (and you shouldn't, in general, which is why it is not allowed). But you can always estimate propensity score outside matchit() however you want and the supply them to matchit(). To use a linear probability model, you would run the following:
fit <- lm(treatment ~ x1 + x2 + x3, data = df)
ps <- fit$fitted
m <- matchit(treatment ~ x1 + x2 + x3,
data = df,
method = 'nearest',
distance = ps)

How to get simple slopes for particular values in emtrends?

I have a mixed model with two categorical predictors (X1, X2) and one continuous predictor (X3).
model <- lmer(z ~ x1 * x2 * x3 + (1|group), data = data)
I am interested in comparing high and low traits of my continuous predictor.
My plan is to contrast the simple slopes for X3 (at -1 SD, M, +1 SD).
As far as I understand this can be done using the emtrends() function from emmeans like so:
sd1 <- mean(data$X3, na.rm = T) + sd(data$X3, na.rm = T)
mean <- mean(data$X3, na.rm = T)
sd2 <- mean(data$X3, na.rm = T) - sd(data$X3, na.rm = T)
mylist <- list(X3 = c(sd1, mean, sd2))
emtrends(model, ~ X1 * X2 | X3,
var = "X3",
at = mylist)
However, the coefficients provided are the same for the three values of X3.
Can anyone shed light on this?
simple_slopes function in the reghelper -package could be an alternative to emmeans in this specific case. The following simulation probes simple slopes for the -1,0,1 values of x3 (that was simulated as having mean=0, sd=1), but you can of course use any values. For categorical predictors, all combinations as well as slope for x3 at the "mean" values of both categorical predictors are calculated:
library(lme4)
library(reghelper)
set.seed(42143)
#sample size
n=1000
#generate the factor variables (dummy coded)
x1<-sample(x = c(0,1),size = n,replace=T,prob=c(.90,.10))
x2<-sample(x = c(0,1),size = n,replace=T,prob=c(.90,.10))
#generate the continuous variable
x3<-rnorm(n,mean=0,sd=1)
#generate the group variable
group<-sample(letters,size=n,replace=T)
#generate the dependent variable with some main effect and three-way interaction weights
z<-0.2*x1-0.3*x2+0.4*x3+0.2*x1*x2*x3+rnorm(n)
#collect variables to a data frame
dat<-data.frame(x1,x2,x3,group,z)
#run the model
model <- lmer(z ~ x1 * x2 * x3 + (1|group), data = dat)
#explore simple slopes at specified variables values.
#"sstest" gives the slopes
simple_slopes(model,levels=list(x1=c(0,1,mean(dat$x1),"sstest"),
x2=c(0,1,mean(dat$x2),"sstest"),
x3=c(-1,0,1,"sstest")))

R : Plotting Prediction Results for a multiple regression

I want to observe the effect of a treatment variable on my outcome Y. I did a multiple regression: fit <- lm (Y ~ x1 + x2 + x3). x1 is the treatment variable and x2, x3 are the control variables. I used the predict function holding x2 and x3 to their means. I plotted this predict function.
Now I would like to add a line to my plot similar to a simple regression abline but I do not know how to do this.
I think I have to use line(x,y) where y = predict and x is a sequence of values for my variable x1. But R tells me the lengths of y and x differ.
I think you are looking for termplot:
## simulate some data
set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
y <- cbind(1,x1,x2,x3) %*% runif(4) + rnorm(100, sd = 0.1)
## fit a model
fit <- lm(y ~ x1 + x2 + x3)
termplot(fit, se = TRUE, terms = "x1")
termplot uses predict.lm(, type = "terms") for term-wise prediction. If a model has intercept (like above), predict.lm will centre each term (What does predict.glm(, type=“terms”) actually do?). In this way, each terms is predicted to be 0 at the mean of the covariate, and the standard error at the mean is 0 (hence the confidence interval intersects the line at the mean).

optimal predictor value for multivariate regression in R

Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?

Isolating coefficients from polynomial fit in r

I have fitted a simple 2nd order polynomial to time series data in the following form:
polyfit <- lm(y ~ poly(x,2))
I wish to extract the respective coefficients (A, B and C) from the fitted polynomial of the form y = Ax^2 + Bx + C. I naturally thought the answer would be found in polyfit$coefficients within the polyfit object but these coefficients are not correct. I have tried some very simple data sets and compared with excel and whilst the poly curve fits are identical in R and excel, the A,B and C coefficints obtained from excel are correct but those obtained from the polyfit object arent? Have I extracted the incorrect information from the polyfit object? It would be more convenient for me to extract the coefficients directly from R for my purposes? Can anyone help?
By default poly fits orthogonal polynomials. Essentially it uses the following ideas...
x <- 1:10
# in this case same as x-mean(x)
y1 <- residuals(lm(x ~ 1))
# normalize to have unit norm
y1 <- y1/sqrt(sum(y1^2))
y2 <- residuals(lm(y1^2 ~ y1))
y2 <- y2/sqrt(sum(y2^2))
y3 <- residuals(lm(x^3 ~ y2 + y1))
y3 <- y3/sqrt(sum(y3^2))
cbind(y1, y2, y3)
poly(x, 3)
to construct a set of orthonormal vectors that still produce the same predictions. If you just want to get a polynomial in the regular way then you want to specify raw=TRUE as a parameter.
y <- rnorm(20)
x <- 1:20
o <- lm(y ~ poly(x, 2, raw = TRUE))
# alternatively do it 'by hand'
o.byhand <- lm(y ~ x + I(x^2))

Resources