Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?
Related
I have the following data in R and I want to get the linear regression model for y~x1+x2+x3+x4 using weighted least square. If I want to use the sample variance as the basis for weighted squares estimation of the original data.
Schubert et al. (1992) conducted an experiment with a catapult to determine
the effects of hook (
x
1
), arm length (
x
2
), start angle (
x
3
), and stop angle (
x
4
) on the
distance (
y
) that the catapult throws a ball. They throw the ball three times for each
setting of the factors. The following table summarizes the experimental results, where
the factor levels of
x
1
, x
2
, x
3
and
x
4
are standardized
x1<-c(-1,-1,-1,-1,1,1,1,1)
x2<-c(-1,-1,1,1,-1,-1,1,1)
x3<-c(-1,1,-1,1,-1,1,-1,1)
x4<-c(-1,1,1,-1,1,-1,-1,1)
y<-c(28,46.3,21.9,52.9,75,127.7,86.2,195)
How do I write this code in R? I tried the following one but I do not know how to set the weighted win R. This code dose not work.
l=lm(y~x1+x2+x3+x4, weights=1/x)
Nothwithstanding the lack of clarifications and details around your model and data (see below), I think you're after something like this:
# Store sample data in `data.frame`
df <- data.frame(y, x1, x2, x3, x4)
# First: "Vanilla" OLS estimation
fit_OLS <- lm(y ~ ., data = df)
# Second: Weights
weights <- 1 / fitted(lm(abs(residuals(fit_OLS)) ~ fitted(fit_OLS))) ^ 2
# Third: Weighted least squares estimation
fit_WLS <- lm(y ~ ., data = df, weights = weights)
# Compare coefficients
coef(fit_OLS)
#(Intercept) x1 x2 x3 x4
# 79.125 41.850 9.875 26.350 5.425
coef(fit_WLS)
#(Intercept) x1 x2 x3 x4
# 79.125 41.850 9.875 26.350 5.425
Not wanting to reiterate poorly what others have explained much better, I refer you to this post, detailing the rationale behind calculating the weights.
As you can see, parameter estimates of the OLS and WLS routines are the same, since the estimated weights are identical.
This loops back to my original question (see comments): The values for those x1, x2, x3, x4 predictors seem odd. If they denote hook, arm length, start angle and stop angle of a catapult, why do they only take on values -1 and 1? Did you discretise original data somehow? This is not clear but important for assessing the model fit.
I want to get coefficients for a linear model related to synergism/antagonism between different chemicals.
Chemicals X, Y, Z. Coefficients b0...b7.
0 = b0 + b1x + b2y + b3z + b4xy + b5xz + b6yz + b7xyz
Some combination of X, Y, Z will kill 50% of cells (b0 represents this total effectiveness), and the sign/magnitude of the higher order terms represents interactions between chemicals.
Given real datapoints, I want to fit this model.
EDIT: I've gotten rid of the trivial solution by adding a forcing value at the start. Test data:
x1 <- c(0,1,2,3,4)
y1 <- c(0,2,1,5,4)
z1 <- c(0,1,-0.66667,-6,-7.25)
q <- c(-1,0,0,0,0)
model <- lm(q ~ x1*y1*z1)
This set has coefficients: -30, 12, 6, 4, 1, -1, 0, 0.5
EDIT: Progress made from before, will try some more data points. The first four coefficients look good (multiply by 30):
Coefficients:
(Intercept) x1 y1 z1 x1:y1 x1:z1 y1:z1 x1:y1:z1
-1.00000 0.47826 0.24943 0.13730 -0.05721 NA NA NA
EDIT: Adding more data points hasn't been successful so far, not sure if I need to have a certain minimum amount to be accurate.
Am I setting things up ccorrectly? Once I have coefficents, I want to solve for z so that I can plot a 3D surface. Thanks!
I was able to get the coefficients using just 16 arbitrary data points, and appending a point to exclude the trivial answer:
x1 <- c(0,1,2,3,4,1,2,3,4,1,2,3,4,5,6,7,8)
y1 <- c(0,2,1,5,4,3,7,5,8,6,2,1,5,5,3,5,7)
z1 <- c(0,1,-0.66667,-6,-7.25,-0.66667,-5.55556,-6,-6.125,-4,-2.5,-6,-6.8,-7.3913,-11.1429,-8.2069,-6.83333)
q <- c(-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
model <- lm(q ~ x1*y1*z1)
Ok, so here is the code that demonstrate the problem I am referring to:
x1 <- c(0.001, 0.002, 0.003, 0.0003)
x2 <- c(15000893, 23034340, 3034300, 232332242)
x3 <- c(1,3,5,6)
y <- rnorm( 4 )
model=lm( y ~ x1 + x2 )
model2=lm( y ~ x1 + x3 )
type <- "hc0"
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
linearHypothesis(model, hyp, white.adjust=type)
Note that this is not caused by perfect multicollinearity.
As you can see, I deliberately set the value of x2 to be very large and the value of x1 to be very small. When this happens, I cannot perform a linearHypothesis test of model=lm( y ~ x1 + x2 ) on all coefficients being 0: linearHypothesis(model, hyp, white.adjust=type). R will throw the following error:
> linearHypothesis(model, hyp, white.adjust=type)
Error in solve.default(vcov.hyp) :
system is computationally singular: reciprocal condition number = 2.31795e-23
However, when I use model2=lm( y ~ x1 + x3 ) instead, whose x3 is not too large compared to x1, the linearHypothesis test succeeds:
> linearHypothesis(model2, hyp, white.adjust=type)
Linear hypothesis test
Hypothesis:
x1 = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x1 + x3
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 3
2 1 2 11.596 0.2033
I am aware that this might be caused by the fact that R cannot invert matrices whose numbers are smaller than a certain extent, in this case 2.31795e-23. However, is there a way to circumvent that? Is this the limitation in R or the underlying C++?
What is the good practice here? The only method I can think of is to rescale the variables so that they are at the same scale. But I am also concerned about the amount of information I will lose by dividing everything by their standard errors.
In fact, I have 200 variables that are percentages, and 10 variables (including dependent variables) that are large (potentially to the 10^6 scale). It might be troubling to scale them one by one.
I want to observe the effect of a treatment variable on my outcome Y. I did a multiple regression: fit <- lm (Y ~ x1 + x2 + x3). x1 is the treatment variable and x2, x3 are the control variables. I used the predict function holding x2 and x3 to their means. I plotted this predict function.
Now I would like to add a line to my plot similar to a simple regression abline but I do not know how to do this.
I think I have to use line(x,y) where y = predict and x is a sequence of values for my variable x1. But R tells me the lengths of y and x differ.
I think you are looking for termplot:
## simulate some data
set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
y <- cbind(1,x1,x2,x3) %*% runif(4) + rnorm(100, sd = 0.1)
## fit a model
fit <- lm(y ~ x1 + x2 + x3)
termplot(fit, se = TRUE, terms = "x1")
termplot uses predict.lm(, type = "terms") for term-wise prediction. If a model has intercept (like above), predict.lm will centre each term (What does predict.glm(, type=“terms”) actually do?). In this way, each terms is predicted to be 0 at the mean of the covariate, and the standard error at the mean is 0 (hence the confidence interval intersects the line at the mean).
I have a data set that I am exploring using multiple regression in R. My model is as follows:
model<-lm(Trait~Noise+PC1+PC2)
where Noise, PC1, and PC2 are continuous covariates that predict a particular Trait that is also continuous.
The summary(model) call shows that both Noise and PC1 significantly affect changes in Trait, just in opposite ways. Trait increases as 'Noise' increases, but decreases as PC1 increases.
To tease apart this relationship, I want to create simulated data sets based on the sample size (45) of my original data set and by manipulating Noise and PC1 within the parameters seen in my data set, so: high levels of both, low levels of both, high of one and low of the other, etc...
Can someone offer up some advice on how to do this? I am not overly familiar with R, so I apologize if this question is overly simple.
Thank you for your time.
It's a bit unclear what you're looking for (this should probably be on Cross Validated), but here's a start and an approximate description of linear regression.
Let's say I have some datapoints that are 3 dimensional (Noise, PC1, PC2), and you say there's 45 of them.
x=data.frame(matrix(rnorm(3*45),ncol=3))
names(x)<-c('Noise','PC1','PC2')
These data are randomly distributed around this 3 dimensional space. Now we imagine there's another variable that we're particularly interested in called Trait. We think that the variations in each of Noise, PC1, and PC2 can explain some of the variation observed in Trait. In particular, we think that each of those variables is linearly proportional to Trait, so it's just the basic old y=mx+b linear relationship you've seen before, but there's a different slope m for each of the variables. So in total we imagine Trait = m1*Noise + m2*PC1 + m3*PC2 +b plus some added noise (it's a shame one of your variables is named Noise, that's confusing).
So going back to simulating some data, we'll just pick some values for these slopes and put them in a vector called beta.
beta<-c(-3,3,.1) # these are the regression coefficients
So the model Trait = m1 Noise + m2 PC1 + m3 PC2 +b might also be expressed with simple matrix multiplication, and we can do it in R with,
trait<- as.matrix(x)%*%beta + rnorm(nrow(x),0,1)
where we've added Gaussian noise of standard deviation equal to 1.
So this is the 'simulated data' underlying a linear regression model. Just as a sanity check, let's try
l<-lm(trait~Noise+PC1+PC2,data=x)
summary(l)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.13876 0.11159 1.243 0.221
Noise -3.08264 0.12441 -24.779 <2e-16 ***
PC1 2.94918 0.11746 25.108 <2e-16 ***
PC2 -0.01098 0.10005 -0.110 0.913
So notice that the slope we picked for PC2 was so small (0.1) relative to the overall variability in the data, that it isn't detected as a statistically significant predictor. And the other two variables have opposite effects on Trait. So in simulating data, you might adjust the observed ranges of the variables, as well at the magnitudes of the regression coefficients beta.
Here is a simple simulation and then fitting. I am not sure whether this answers your question. But it's a very simple way to simulate
# create a random matrix X
N <- 500 # obs = 500
p <- 20 # 20 predictors
X <- matrix(rnorm(N*p), ncol = p) # design matrix
X.scaled <- scale(X) # scale the columns to make mean 0 and variance 1
X <- cbind(matrix(1, nrow = N), X.scaled) # add intercept
# create coeff matrix
b <- matrix(0, nrow = p+1)
b[1, ] <- 5 # intercept
b[2:6, ] <- 3 # first 5 predictors are 3
b[7:11, ] <- -3 # next 5 predictors are -3
# create noise
eps <- matrix(rnorm(N), nrow = N)
# generate the response
y = X%*%b + eps # response vector
#--------------------------------------------
# fit the model
X <- X[, -1] # remove the column one's before fitting
colnames(X) <- paste ("x", seq(1:p), sep="") # name the columns
colnames(y) <- "y" # name the response
data <- data.frame(cbind(y, X)) # make a dataframe
lm_res <- lm(y~., data) # fit with lm()
# the output
> lm_res$coeff
# (Intercept) x1 x2 x3 x4 x5
# 4.982574286 2.917753373 3.021987926 3.067855616 3.135165773 2.997906784
# x6 x7 x8 x9 x10 x11
#-2.997272333 -2.927680633 -2.944796765 -3.070785884 -2.910920487 -0.051975284
# x12 x13 x14 x15 x16 x17
# 0.085147066 -0.040739293 0.054283243 0.009348675 -0.021794971 0.005577802
# x18 x19 x20
# 0.079043493 -0.024066912 -0.007653293
#