Passing offset inside custom regression function - r

I am trying to create a custom regression function for running multiple models (simplified here for detail). However, I am unable to pass an offset into this function. I am aware that this can be done inside of a formula but for this particular use case it must be an optional parameter.
Here is what I have:
fit_glm <- function(formula, df,
model_offset = NULL,
family = quasipoisson){
fit <- glm(formula, data = df,
offset = model_offset,
family = family)
return(fit)
}
data(mtcars)
fit_glm(mpg ~ hp, df = mtcars)
Whenever I run this I am met with Error in eval(extras, data, env) : object 'model_offset' not found. Perhaps I am missing something very simple.
Performing this call using:
glm(mpg ~ hp, data = mtcars, family = quasipoisson, offset = NULL)
Works perfectly fine. I want this method to be possible for running models with and without an offset, currently neither work.
Any help is much appreicated, tia

Related

Error when adjusting a GLM: Error in eval(family$initialize)

I am trying to adjust a generalized linear model defined below:
It must be noted that the response variable Var1, as well as the regressor variable Var2, have zero values, for which a constant has been added to avoid problems when applying the log.
model = glm(Var1+2 ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), data = data2)
However, I am facing an error when performing the graph for the diagnostic analysis using the hnp function, which is expressed by:
library(hnp)
hnp(model)
Gaussian model (glm object)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
In order to get around the situation, I tried to perform the manual implementation to then carry out the construction of the graph, however, the error message is still present.
dfun <- function(obj) resid(obj)
sfun <- function(n, obj) simulate(obj)[[1]]
ffun <- function(resp) glm(resp ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), data = data2)
hnp(model, newclass = TRUE, diagfun = dfun, simfun = sfun, fitfun = ffun)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
Some guidelines in which I found information to try to solve the problem were used, such as considering initial values to initialize the estimation algorithm both in the linear predictor, as well as for the means, however, these were not enough to solve the problem, see below the computational routine:
fit = lm(Var1+2 ~ log(Var2+2) + offset(log(Var3/Var4)), data=data2)
coefficients(fit)
(Intercept) log(Var2+2)
32.961103 -8.283306
model = glm(Var1+2 ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), start = c(32.96, -8.28), data = data2)
hnp(model)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
See that the error persists even when trying to manually implement the half-normal plot.
dfun <- function(obj) resid(obj)
sfun <- function(n, obj) simulate(obj)[[1]]
ffun <- function(resp) glm(resp ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), data = data2, start = c(32.96, -8.28))
hnp(model, newclass = TRUE, diagfun = dfun, simfun = sfun, fitfun = ffun)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
I also tried to readjust the model by removing the zeros from the database, however, I didn't get any solution to the problem, that is, it still persists.
I suspect what you meant to fit is a log transformed response variable against your predictors. You can more detail about the difference between a log link glm and a log transformed response variable. Essentially when you use a log link, you are assuming the errors are on the exponential scale. I am not so familiar with hnp but my guess it there are problems simulating the response variable.
If I run your regression like this using the data provided, it looks ok
data2$Y = with(data2, log( (Var1+2)/Var3/Var4))
model = glm(Y ~ log(Var2+2), data = data2)
hnp(model)

build a model within a function returns "variable lengths differ (found for 'var')"

I am trying to build a function to run a model based on a list of variables.
The code works fine outside a function
data <- ovarian
model <- glm(fustat ~ rx, family = binomial(), data = data)
odds.ratio(model)
but when I try to do this in a function it returns an error
OR <- function(var){
model <- glm(fustat ~ var, family = binomial(), data = data)
return(odds.ratio(model))
}
OR("rx")
Error in model.frame.default(formula = fustat ~ var, data = data, drop.unused.levels = TRUE) :
variable lengths differ (found for 'var')
Any idea how to get around this?
Thanks
You are passing string as input to the function, convert it into formula or something which can be coerced to a formula. One way would be to use reformulate.
OR <- function(var){
model <- glm(reformulate(var, "fustat"), family = binomial(), data = data)
return(odds.ratio(model))
}
OR("rx")

Call to weight in lm() within function doesn't evaluate properly

I'm writing a function that requires a weighted regression. I've repeatedly been getting an error with the weights parameter, and I've created a minimal reproducible example you can find here:
wt_reg <- function(form, data, wts) {
lm(formula = as.formula(form), data = data,
weights = wts)
}
wt_reg(mpg ~ cyl, data = mtcars, wts = 1:nrow(mtcars))
This returns
Error in eval(extras, data, env) : object 'wts' not found
If you run this all separately, it works fine. I've dug into lm, and it appears the issue is a call to eval(mf, parent.frame()). Even though wts is in the parent.frame(), it doesn't appear to be evaluated correctly within the call. Here's a little more detail:
mf is assigned such that it's the same as
stats::model.frame(formula = as.formula(form), data = data, weights = wts,
drop.unused.levels = TRUE)
When I run
parent.frame()$wts
it does return a numeric vector. But when I run
eval(stats::model.frame(formula = as.formula(form), data = data, weights = wts,
drop.unused.levels = TRUE), parent.frame())
it doesn't.
I can run
stats::model.frame(formula = as.formula(parent.frame()$form),
data = parent.frame()$data, weights = parent.frame()$wts,
drop.unused.levels = TRUE)
and it works. You can test this yourself if you want using the example from the top.
Any thoughts? I really have no idea what's going on here...
Formulas as special in R in that they not only keep track of symbol/variable names, they also keep track of the environment where they were created. Check out
ff <- mpg ~ cyl
environment(ff)
# <environment: R_GlobalEnv>
foo <- function() {
ff <- mpg ~ cyl
environment(ff)
}
foo()
# <environment: 0x0000026172e505d8> private function environment (different each time)
The problem is that lm will try to use the environment where the formula was created to look up variables rather than the parent frame. Since you create the formula in the call to wt_reg, the formula holds on the the global scope. But wts only exists in the function scope. You can alter your function to change the environment on the formula to the local function environment then everything should work
wt_reg <- function(form, data, wts) {
ff <- as.formula(form)
environment(ff) <- environment()
lm(formula = ff, data = data,
weights = wts)
}
wt_reg(mpg ~ cyl, data = mtcars, wts = 1:nrow(mtcars))
The eval(mf, parent.frame) you are referring to in lm() is calling model.frame() with your formula. And from the description on the ?model.frame help page: "All the variables in formula, subset and in ... are looked for first in data and then in the environment of formula (see the help for formula() for further details) and collected into a data frame". So it again is looking in the environment of the formula, not the calling frame.

Issue running glmnet() for mtcars dataset

Whenever I run glmnet(mpg ~ ., data = mtcars, alpha=1) (from the glmnet package) I get the following error:
"Error in glmnet(mpg ~ ., data = mtcars, alpha = 1) : unused argument (data = mtcars)"
Any ideas for how to deal with this?
I think its because the glmnet() function is supposed to take in x and y as separate arguments. If I need separate x and y arguments, how would I write the formula so that glmnet::glmnet() runs for all variables of mtcars?
As the commenter suggests you need to use the glmnet method like so:
fit <- glmnet(as.matrix(mtcars[-1]), mtcars$mpg, alpha=1)
plot(fit)

R - model.frame() and non-standard evaluation

I am puzzled at a behaviour of a function that I am trying to write. My example comes from the survival package but I think that the question is more general than that. Basically, the following code
library(survival)
data(bladder) ## this will load "bladder", "bladder1" and "bladder2"
mod_init <- coxph(Surv(start, stop, event) ~ rx + number, data = bladder2, method = "breslow")
survfit(mod_init)
Will yield an object that I am interested in. However, when I write it in a function,
my_function <- function(formula, data) {
mod_init <- coxph(formula = formula, data = data, method = "breslow")
survfit(mod_init)
}
my_function(Surv(start, stop, event) ~ rx + number, data = bladder2)
the function will return an error at the last line:
Error in eval(predvars, data, env) :
invalid 'envir' argument of type 'closure'
10 eval(predvars, data, env)
9 model.frame.default(formula = Surv(start, stop, event) ~ rx +
number, data = data)
8 stats::model.frame(formula = Surv(start, stop, event) ~ rx +
number, data = data)
7 eval(expr, envir, enclos)
6 eval(temp, environment(formula$terms), parent.frame())
5 model.frame.coxph(object)
4 stats::model.frame(object)
3 survfit.coxph(mod_init)
2 survfit(mod_init)
1 my_function(Surv(start, stop, event) ~ rx + number, data = bladder2)
I am curious whether there is something obvious that I am missing or whether such behaviour is normal. I find it strange, since in the environment of my_function I would have the same objects as in the global environment when running the first portion of the code.
Edit: I also received useful input from Terry Therneau, the author of the survival package. This is his answer:
This is a problem that stems from the non-standard evaluation done by model.frame. The only way out of it that I have found is to add model.frame=TRUE to the original coxph call. I consider it a serious design flaw in R. Non-standard evaluation is like the dark side -- a tempting and easy path that always ends badly.
Terry T.
Diagnose
From the error message:
2 survfit(mod_init, newdata = base_case)
1 my_function(Surv(start, stop, event) ~ rx + number, data = bladder2)
the problem is clearly not with coxph during model fitting, but with survfit.
And from this message:
10 eval(predvars, data, env)
9 model.frame.default(formula = Surv(start, stop, event) ~ rx +
number, data = data)
I can tell that the problem is that during early stage of survfit, the function model.frame.default() can not find a model frame containing relevant data used in formula Surv(start, stop, event) ~ rx + number. Hence it complains.
What is a model frame?
A model frame, is formed from the data argument passed to fitting routine, like lm(), glm() and mgcv:::gam(). It is a data frame with the same number of rows as data, but:
dropping all variables not referenced by formula
adding many attributes, the most important of which is envrionement
Most model fitting routines, like lm(), glm(), and mgcv:::gam(), will keep the model frame in their fitted object by default. This has advantage that if we later call predict, and no newdata is provided, it will find data from this model frame for evaluation. However, a clear disadvantage is that it will substantially increase the size of your fitted object.
However, survival:::coxph() is an exception. It will by default not retain such model frame in their fitted object. Well, clearly, this makes the resulting fitted object much smaller in size, but, expose you to the problem you have encountered. If we want to ask survival:::coxph() to keep this model frame, then use model = TRUE of this function.
Test with survial:::coxph()
library(survival); data(bladder)
my_function <- function(myformula, mydata, keep.mf = TRUE) {
fit <- coxph(myformula, mydata, method = "breslow", model = keep.mf)
survfit(fit)
}
Now, this function call will fail, as you have seen:
my_function(Surv(start, stop, event) ~ rx + number, bladder2, keep.mf = FALSE)
but this function call will succeed:
my_function(Surv(start, stop, event) ~ rx + number, bladder2, keep.mf = TRUE)
Same behaviour for lm()
We can actually demonstrate the same behaviour in lm():
## generate some toy data
foo <- data.frame(x = seq(0, 1, length = 20), y = seq(0, 1, length = 20) + rnorm(20, 0, 0.15))
## a wrapper function
bar <- function(myformula, mydata, keep.mf = TRUE) {
fit <- lm(myformula, mydata, model = keep.mf)
predict.lm(fit)
}
Now this will succeed, by keeping model frame:
bar(y ~ x - 1, foo, keep.mf = TRUE)
while this will fail, by discarding model frame:
bar(y ~ x - 1, foo, keep.mf = FALSE)
Using argument newdata?
Note that my example for lm() is slightly artificial, because we can actually use newdata argument in predict.lm() to get through this problem:
bar1 <- function(myformula, mydata, keep.mf = TRUE) {
fit <- lm(myformula, mydata, model = keep.mf)
predict.lm(fit, newdata = lapply(mydata, mean))
}
Now whether we keep model frame, both will succeed:
bar1(y ~ x - 1, foo, keep.mf = TRUE)
bar1(y ~ x - 1, foo, keep.mf = FALSE)
Then you may wonder: can we do the same for survfit()?
survfit() is a generic function, in your code, you are really calling survfit.coxph(). There is indeed a newdata argument for this function. The documentation reads:
newdata:
a data frame with the same variable names as those that appear in the
‘coxph’ formula. ... ... Default is the mean of the covariates used in the
‘coxph’ fit.
So, let's try:
my_function1 <- function(myformula, mydata) {
mtrace.off()
fit <- coxph(myformula, mydata, method = "breslow")
survival:::survfit.coxph(fit, newdata = lapply(mydata, mean))
}
and we hope this work:
my_function1(Surv(start, stop, event) ~ rx + number, bladder2)
But:
Error in is.data.frame(data) (from #5) : object 'mydata' not found
1: my_function1(Surv(start, stop, event) ~ rx + number, bladder2)
2: #5: survival:::survfit.coxph(fit, lapply(mydata, mean))
3: stats::model.frame(object)
4: model.frame.coxph(object)
5: eval(temp, environment(formula$terms), parent.frame())
6: eval(expr, envir, enclos)
7: stats::model.frame(formula = Surv(start, stop, event) ~ rx + number, data =
8: model.frame.default(formula = Surv(start, stop, event) ~ rx + number, data
9: is.data.frame(data)
Note that although we pass in newdata, it is not used in construction of model frame:
3: stats::model.frame(object)
Only object, a copy of fitted model, is passed to model.frame.default().
This is very different from what happens in predict.lm(), predict.glm() and mgcv:::predict.gam(). In these routines, newdata is passed to model.frame.default(). For example, in lm(), there is:
m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
I don't use survival package, so not sure how newdata works in this package. So I think we really need some expert explaining this.
I think it might be that if your
Surv(start, stop, event) ~ rx + number
is in as a parameter, it does not get properly created. Try put
is.Surv(formula)
as your first line in the function. I suspect it wont work, then I would suggest using apply family of functions.

Resources