passing model parameters to R's predict() function robustly - r

I am trying to use R to fit a linear model and make predictions. My model includes some constant side parameters that are not in the data frame. Here's a simplified version of what I'm doing:
dat <- data.frame(x=1:5,y=3*(1:5))
b <- 1
mdl <- lm(y~I(b*x),data=dat)
Unfortunately the model object now suffers from a dangerous scoping issue: lm() does not save b as part of mdl, so when predict() is called, it has to reach back into the environment where b was defined. Thus, if subsequent code changes the value of b, the predict value will change too:
y1 <- predict(mdl,newdata=data.frame(x=3)) # y1 == 9
b <- 5
y2 <- predict(mdl,newdata=data.frame(x=3)) # y2 == 45
How can I force predict() to use the original b value instead of the changed one? Alternatively, is there some way to control where predict() looks for the variable, so I can ensure it gets the desired value? In practice I cannot include b as part of the newdata data frame, because in my application, b is a vector of parameters that does not have the same size as the data frame of new observations.
Please note that I have greatly simplified this relative to my actual use case, so I need a robust general solution and not just ad-hoc hacking.

eval(substitute the value into the quoted expression
mdl <- eval(substitute(lm(y~I(b*x),data=dat), list(b=b)))
mdl
# Call:
# lm(formula = y ~ I(1 * x), data = dat)
# ...

We could also use bquote
mdl <- eval(bquote(lm(y~I(.(b)*x), data=dat)))
mdl
#Call:
#lm(formula = y ~ I(1 * x), data = dat)
#Coefficients:
#(Intercept) I(1 * x)
# 9.533e-15 3.000e+00
According to ?bquote description
‘bquote’ quotes its
argument except that terms wrapped in ‘.()’ are evaluated in the
specified ‘where’ environment.

Related

Using variable as function argument in r

I want to control a linear regression model with a specific variable. If i assign variable "c" as 0, i want to estimate the model:
lm(y ~ x) # model with intercept
and if i assign variable "c" as 1 i want to estimate:
lm(y ~ x - 1) # model without intercept
I tried the code below
c <- 1
lm(y ~ x - c)
but it didn't work. c is 1 but in lm function i can't use this variable for argument. How can i assign and use a variable to add intercept and remove?
I don't think you can do that with a simple variable. Rather than conditionally setting the value of a, you can conditionally remove the intercept. Something like
myformula <- y~x
if(TRUE) {
myformula <- update(myformula, ~.-1)
}
myformula
# y ~ x - 1
Formula objects don’t evaluate their arguments (otherwise they fundamentally wouldn’t work). So you need to find a way of interpolating an evaluated value into the unevaluated formula expression.
Like all problems of computer science, this can be solved by one more layer of indirection.
Create an unevaluated expression that creates your formula, and evaluate it after interpolating the variable:
formula = eval(bquote(y ~ x - .(c)))
lm(formula)

Do lm object coefficients always list intercept first?

In coef(l), where l is a object of class "lm", is (Intercept) always listed first?
R's source code for lm() is not so straightforward. lm() appears to call lm.fit(), which gets coefficients by calling a C function with .Call(C_Cdqrls, x, y, tol, FALSE), which ultimately calls a least squares fitting routine in FORTRAN according to this informative blog post. I'm not really familiar enough with R internals or actual code to do least squares regression to answer my question.
No, only when you have an intercept. Intercept is implicit in formula, but you can specify a model without it using - 1 or 0 +:
x <- rnorm(20)
y <- rnorm(20, 10)
> coef(lm(y ~ x + I(x^2)))
(Intercept) x I(x^2)
10.3035412 -0.1506304 -0.3092836
> coef(lm(y ~ I(x^3) + x - 1))
I(x^3) x
-0.5094851 -0.6598634
The coefficients will be listed in the order they appear in the formula. If there is an intercept, it will be the first. But as in many other situations in R, if you need to obtain the value of a specific component (intercept or any other), it is a good practice to call by it's name. It will return NA if the object don't have it:
intercept <- coef(model)["(Intercept)"]

linear regression function creating a list instead of a model

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?
What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))
It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

How to correctly pass formulas associated with a variable name with random effects into fitted regression models in `R`?

I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".

Updating data in lm() calls

Is there is an equivalent to update for the data part of an lm call object?
For example, say i have the following model:
dd = data.frame(y=rnorm(100),x1=rnorm(100))
Model_all <- lm(formula = y ~ x1, data = dd)
Is there a way of operating on the lm object to have the equivalent effect of:
Model_1t50 <- lm(formula = y ~ x1, data = dd[1:50,])
I am trying to construct some psudo out of sample forecast tests, and it would be very convenient to have a single lm object and to simply roll the data.
I'm fairly certain that update actually does what you want!
example(lm)
dat1 <- data.frame(group,weight)
lm1 <- lm(weight ~ group, data=dat1)
dat2 <- data.frame(group,weight=2*weight)
lm2 <- update(lm1,data=dat2)
coef(lm1)
##(Intercept) groupTrt
## 5.032 -0.371
coef(lm2)
## (Intercept) groupTrt
## 10.064 -0.742
If you're hoping for an effiency gain from this, you'll be disappointed -- R just substitutes the new arguments and re-evaluates the call (see the code of update.default). But it does make the code a lot cleaner ...
biglm objects can be updated to include more data, but not less. So you could do this in the opposite order, starting with less data and adding more. See http://cran.r-project.org/web/packages/biglm/biglm.pdf
However, I suspect you're interested in parameters estimated for subpopulations (ie if rows 1:50 correspond to level "a" of factor variable factrvar. In this case, you should use interaction in your formula (~factrvar*x1) rather than subsetting to data[1:50,]. Interaction of this type will give different effect estimates for each level of factrvar. This is more efficient than estimating each parameter separately and will constrain any additional parameters (ie, x2 in ~factrvar*x1 + x2) to be the same across values of factrvar--if you estimated the same model multiple times to different subsets, x2 would receive a separate parameter estimate each time.

Resources