3-variable linear model in R - r

I want to get coefficients for a linear model related to synergism/antagonism between different chemicals.
Chemicals X, Y, Z. Coefficients b0...b7.
0 = b0 + b1x + b2y + b3z + b4xy + b5xz + b6yz + b7xyz
Some combination of X, Y, Z will kill 50% of cells (b0 represents this total effectiveness), and the sign/magnitude of the higher order terms represents interactions between chemicals.
Given real datapoints, I want to fit this model.
EDIT: I've gotten rid of the trivial solution by adding a forcing value at the start. Test data:
x1 <- c(0,1,2,3,4)
y1 <- c(0,2,1,5,4)
z1 <- c(0,1,-0.66667,-6,-7.25)
q <- c(-1,0,0,0,0)
model <- lm(q ~ x1*y1*z1)
This set has coefficients: -30, 12, 6, 4, 1, -1, 0, 0.5
EDIT: Progress made from before, will try some more data points. The first four coefficients look good (multiply by 30):
Coefficients:
(Intercept) x1 y1 z1 x1:y1 x1:z1 y1:z1 x1:y1:z1
-1.00000 0.47826 0.24943 0.13730 -0.05721 NA NA NA
EDIT: Adding more data points hasn't been successful so far, not sure if I need to have a certain minimum amount to be accurate.
Am I setting things up ccorrectly? Once I have coefficents, I want to solve for z so that I can plot a 3D surface. Thanks!

I was able to get the coefficients using just 16 arbitrary data points, and appending a point to exclude the trivial answer:
x1 <- c(0,1,2,3,4,1,2,3,4,1,2,3,4,5,6,7,8)
y1 <- c(0,2,1,5,4,3,7,5,8,6,2,1,5,5,3,5,7)
z1 <- c(0,1,-0.66667,-6,-7.25,-0.66667,-5.55556,-6,-6.125,-4,-2.5,-6,-6.8,-7.3913,-11.1429,-8.2069,-6.83333)
q <- c(-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
model <- lm(q ~ x1*y1*z1)

Related

Nelson Aalen estimate in mice

So I am dealing with the imputation of data set having time to event data. Several papers suggest that the use of Nelson-Aalen estimate (as an approximation to the baseline hazard) will provide better imputation results. Is there a way to find Nelson-Aalen estimate and bind to my data set in R. I have found a function named nelsonaalen(data, time, event) in the mice package but I am afraid whether it will cause any error since I can only include one time variable(failure time) in it. The variables in my data are as follows:
N = 1000
xt=runif(N, 0, 50)
x1=rnorm(N, 2, 1)
x2=rnorm(N, -2, 1)
x3 <- rnorm(N, 0.5*x1 + 0.5*x2, 2)
x4 <- rnorm(N, 0.3333*x1 + 0.3333*x2 + 0.3333*x3, 2 )
lp <- 0.05*x1 + 0.2*x2 + 0.1*x3 + 0.02*x4
T <- qweibull(runif(N,pweibull(xt,shape = 7.5, scale = 84*exp(-lp/7.5)),1), shape=7.5, scale=84*exp(-lp/7.5))
Cens1 <- 100
time_M <- pmin(T,Cens1)
event_M <- time_M == Tm
Here xt denotes starting time, T denotes the failure time and the x1 to x4 are my covariates in which I'll create missing values in two of the covariates (x3 and x4).

Is it possible to fit specific slopes to best fit segments of data in R?

Background: I am analyzing oil production data where I plot daily oil rate on the y-axis and a diagnostic "time" factor on the x-axis. This combination tends to exhibit a certain trend depending on the flow regime where there is typically a half slope or quarter slope followed by a unit slope. It is very basic, but the approach is archaic and everything is done manually.
I was wondering if there was a way in R where you can find the segment of the data that best fits a specific slope and fit the associated line over that data maybe up to a R^2 criteria on a log-log plot? Also is there a way to get the point where that slope changes?
example of What the raw data looks like
example of desired end result
what about using a scatterplot?
scatter.smooth(x=data$x, y=data$y, main="y ~ x") # scatterplot
In the future please provide your data in reproducible form so we can work with it. This time I have provided some sample data in the Note at the end.
Let kvalues be the possible indexes of x of the change point. We do not include ones near the ends to avoid numeric problems. Then for each kvalue we perform the regression defined in the regr function and compute the residual sum of squares using deviance. Take the least of thoxe and display that regression. No packages are used.
(If you want to fix the slopes then remove the slope parameters from the formula and starting values and replace them with the fixed values in the formula.)
kvalues <- 5:45
st <- list(a1 = 1, b1 = 1, a2 = 2, b2 = 2)
regr <- function(k) try(nls(y ~ ifelse(x < k, a1 + b1 * x, a2 + b2 * x), start = st))
i <- which.min(sapply(kvalues, function(k) deviance(regr(k))))
k <- kvalues[i]
k; x[k]
## [1] 26
## [1] 26
fm <- regr(k)
fm
## Nonlinear regression model
## model: y ~ ifelse(x < k, a1 + b1 * x, a2 + b2 * x)
## data: parent.frame()
## a1 b1 a2 b2
## 1.507 -1.042 1.173 -2.002
## residual sum-of-squares: 39.52
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 2.917e-09
plot(y ~ x)
lines(fitted(fm) ~ x)
abline(v = x[k])
Note
set.seed(123)
x <- 1:50
y <- 1 - rep(1:2, each = 25) * x + rnorm(50)

Constrained weighted linear regression in R

I am trying to set up a contrained weighted linear regression. That is to say, that I have a dataset of i observations and three different x values. Each observations has a weight. I want to perform a weighted multiple linear regression using the restrictions that the weighted mean of each x value has to be zero and the weighted standard deviation should be one.
Since I am new and have no reputation yet, I can‘t post images with latex formulas. So I have to write them down this way.
First restriction $\sum_{i} w_{i} X_{i,k} = 0$ for k = 1,2,3.
Second one: $\sum_{i} w_{i} X_{i,k}^2 = 1$ for k = 1,2,3.
This is an example dataset:
y <- rnorm(10)
w <- rep(0.1, 10)
x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
data <- cbind(y, x1, x2, x3, w)
lm(y ~ x1 + x2 + x3, data = data, weigths = data$w)
The weights do not have to be equal for each observation but have to add up to one.
I would like to include these restrictions into the regression. Is there a way to do that?
You could perhaps use the Generalised Linear Model:
glm(y ~ x1 + x2 + x3, weights = w, data=data)
Data needs to be a data.frame(...).

Linear regression with constraints on the coefficients

I am trying to perform linear regression, for a model like this:
Y = aX1 + bX2 + c
So, Y ~ X1 + X2
Suppose I have the following response vector:
set.seed(1)
Y <- runif(100, -1.0, 1.0)
And the following matrix of predictors:
X1 <- runif(100, 0.4, 1.0)
X2 <- sample(rep(0:1,each=50))
X <- cbind(X1, X2)
I want to use the following constraints on the coefficients:
a + c >= 0
c >= 0
So no constraint on b.
I know that the glmc package can be used to apply constraints, but I was not able to determine how to apply it for my constraints. I also know that contr.sum can be used so that all coefficients sum to 0, for example, but that is not what I want to do. solve.QP() seems like another possibility, where setting meq=0 can be used so that all coefficients are >=0 (again, not my goal here).
Note: The solution must be able to handle NA values in the response vector Y, for example with:
Y <- runif(100, -1.0, 1.0)
Y[c(2,5,17,56,37,56,34,78)] <- NA
solve.QP can be passed arbitrary linear constraints, so it can certainly be used to model your constraints a+c >= 0 and c >= 0.
First, we can add a column of 1's to X to capture the intercept term, and then we can replicate standard linear regression with solve.QP:
X2 <- cbind(X, 1)
library(quadprog)
solve.QP(t(X2) %*% X2, t(Y) %*% X2, matrix(0, 3, 0), c())$solution
# [1] 0.08614041 0.21433372 -0.13267403
With the sample data from the question, neither constraint is met using standard linear regression.
By modifying both the Amat and bvec parameters, we can add our two constraints:
solve.QP(t(X2) %*% X2, t(Y) %*% X2, cbind(c(1, 0, 1), c(0, 0, 1)), c(0, 0))$solution
# [1] 0.0000000 0.1422207 0.0000000
Subject to these constraints, the squared residuals are minimized by setting the a and c coefficients to both equal 0.
You can handle missing values in Y or X2 just as the lm function does, by removing the offending observations. You might do something like the following as a pre-processing step:
has.missing <- rowSums(is.na(cbind(Y, X2))) > 0
Y <- Y[!has.missing]
X2 <- X2[!has.missing,]

optimal predictor value for multivariate regression in R

Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?

Resources