Use of offset in lm regression - R - r

I have this code
dens <- read.table('DensPiu.csv', header = FALSE)
fl <- read.table('FluxPiu.csv', header = FALSE)
mydata <- data.frame(c(dens),c(fl))
dat = subset(mydata, dens>=3.15)
colnames(dat) <- c("x", "y")
attach(dat)
and I would like to do a least-square regression on the data contained in dat, the function has the form
y ~ a + b*x
and I want the regression line to pass through a specific point P(x0,y0) (which is not the origin).
I'm trying to do it like this
x0 <- 3.15
y0 <-283.56
regression <- lm(y ~ I(x-x0)-1, offset=y0)
(I think that data = dat is not necessary in this case) but I get this error :
Error in model.frame.default(formula = y ~ I(x - x0) - 1, : variable
lengths differ (found for '(offset)').
I don't know why. I guess that I haven't defined correctly the offset value but I couldn't find any example online.
Could someone explain to me how offset works, please?

Your offset term has to be a variable, like x and y, not a numeric constant. So you need to create a column in your dataset with the appropriate values.
dat$o <- 283.56
lm(y ~ I(x - x0) - 1, data=dat, offset=o)

In fact, the real issue here is that you should specify offset with a vector whose length is the same as the number of rows (or the length, if data is composed as a vector) of your data. The following code will do your job as expected:
regression <- lm(y ~ I(x-x0)-1, offset = rep(y0, length(y)))
Here is a good explanation for those who are interested:
http://rfunction.com/archives/223

Related

Iteratively adding variables to an lm() function in R?

Simple question, but I'm finding myself boggled.
I'm looking to make a loop that will continuously add variables to the IV of an lm() function. I would test the results of the LM until a condition is met. I'm just having trouble finding a way to dynamically adding variables to the IV part of the regression, one at a time.
The 1st iteration would look like:
lm(Y ~ X, data = data)
The second iteration like:
lm(Y ~ X + X2, data = data)
The third iteration like:
lm(Y ~ X + X2 + X3, data = data)
And so on...
If any of you could point me in the right direction, I'd appreciate it very much.
Thanks!
An alternative way is to use Y ~ . as the formula and provide the subset of data as required. Here, . means "all columns not otherwise in the formula" (see ?formula). Using mtcars as an example:
Y <- 'mpg'
Xs <- names(mtcars)[-1]
fits <- lapply(seq_along(Xs), function(x){
lm(paste(Y, '~ .'), data = mtcars[, c(Y, Xs[1:x])])
})
We can use reformulate to create the formula after passing the independent variables as a list
out <- lapply(list("X", c("X", "X2"), c("X", "X2", "X3")),
function(x) lm(reformulate(x, response = "Y"), data = data))
Or make it automated
Xs <- setdiff(names(data), "Y")
ind <- sequence(seq_along(Xs))
lapply(split(Xs[ind], cumsum(ind == 1)), function(x)
lm(reformulate(x, response = "Y"), data = data))

How to predict gam model with random effect in R?

I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.
Here is my code;
x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)
gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit
#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix
However, the last line results in error;
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
not all required variables have been supplied in newdata!
Thanks to the previous answer, I am sure that above codes work without random effect, as in here.
How can I predict gam models with random effect?
Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.
The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.
Option 1:
newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
z1 <- matrix(z1, 20, 30)
Option 2:
z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
newdata.guaranteed=TRUE)
z2 <- matrix(z2, 20, 30)
These produce the same result:
> all.equal(z1, z2)
[1] TRUE
A couple of notes:
Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.
Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:
expand.grid(x = newx, y = newy,
r = with(df, factor(2, levels = levels(r))))
should get the right set-up for a factor r

force given coefficients in lm()

I am currently trying to fit a polynomial model to measurement data using lm().
fit_poly4 <- lm(y ~ poly(x, degree = 4, raw = T), weights = w)
with x as independent, y as dependent variable and w = 1/variance of the measurements.
I want to try a polynomial with given coefficients instead of the ones determined by R. Specifically I want my polynomial to be
y = -3,3583*x^4 + 43*x^3 - 191,14*x^2 + 328,2*x - 137,7
I tried to enter it as
fit_poly4 <- lm(y ~ 328.2*x-191.14*I(x^2)+43*I(x^3)-3.3583*I(x^4)-137.3,
weights = w)
but this just returns an error:
Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars
Is there a way to determine the coefficients in lm() and how would one do this?
I'm not sure why you want to do this, but you can use an offset term:
set.seed(101)
dd <- data.frame(x=rnorm(1000),y=rnorm(1000), w = rlnorm(1000))
fit_poly4 <- lm(y ~
-1 + offset(328.2*x-191.14*I(x^2)+43*I(x^3)-3.3583*I(x^4)-137.3),
data=dd,
weights = w)
the -1 suppresses the usual intercept term.

How to correctly pass formulas associated with a variable name with random effects into fitted regression models in `R`?

I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".

Formula from Data.frame Columns

I want to create a regression model from a vector (IC50) against a number of different molecular descriptors (A,B,C,D etc).
I want to use,
model <- lm (IC50 ~ A + B + C + D)
the molecular descriptors are found in the columns of a data.frame. I would like to use a function that takes the IC50 vector and the appropriately sub-setted data.frame as inputs.
My problem is that I can't convert the columns to formula for the model.
Can anyone help.
Sample data and feeble attempt,
IC50 <- c(0.1,0.2,0.55,0.63,0.005)
descs <- data.frame(A=c(0.002,0.2,0.654,0.851,0.654),
B=c(56,25,89,55,60),
C=c(0.005,0.006,0.004,0.009,0.007),
D=c(189,202,199,175,220))
model <- function(x=IC50,y=descs) {
a <- lm(x ~ y)
return(a)
}
I went down the substitute/deparse route but this didn't import the data.
You can do simply
model <- function(x = IC50, y = descs)
lm(x ~ ., data = y)

Resources