I am beginning to write a function that builds linear mixed models with nlme. I am encountering an error: Error in eval(expr, envir, enclos) : object 'value' not found, which I believe is due to R not knowing where to find the data frame variables (e.g., value). If this is, in fact, why the error is occurring, how do I tell the function that value and timepoint belong to the variables in Dat in the (reproducible) code below?
require(nlme)
Dat <- data.frame(
id = sample(10:19),
Time = sample(c("one", "two"), 10, replace = T),
Value = sample(1:10)
)
nlme_rct_lmm <- function (data, value, timepoint,
ID) {
#base_level intercept only model
bl_int_only <- gls(value ~ 1,
data = data,
method = "ML",
na.action="na.omit")
#vary intercept across participants
randomIntercept <- lme(value ~ 1,
data = data,
random = ~1|ID,
method = "ML",
na.action = "na.omit")
#add timepoint as a fixed effect
timeFE <- lme(value ~ timepoint,
data = data,
random = ~1|ID,
method = "ML",
na.action = "na.omit")
}
nlme_rct_lmm(Dat, Value, Time, id)
This isn't (as you and I both expected) a problem with evaluation within different frames; rather, it's an issue of consistency between the names of the variables between the formula and the data. R is case-sensitive, so it matters whether you use value or Value, id or ID, etc.. Furthermore, formula interpretation uses non-standard evaluation (NSE), so if you have a variable value equal to the symbol Value, value ~ 1 does not magically get transmuted to Value ~ 1. What I've outlined below works by passing the names of the response, time, and ID variables to the function, because it's the easiest approach. It's a little bit more elegant to the end-user if you use non-standard evaluation, but that's a bit harder to program (and therefore understand, debug, etc.).
Below the easy/boneheaded approach, I also discuss how to implement the NSE approach (scroll all the way down ...)
Note that your example doesn't return anything; with R, that means that all the results will be discarded when it finishes the function. You might want to return the results as a list (or perhaps your real function will do something other stuff with the fitted models, such as a series of model tests, and return those answers as the results ...)
require(nlme)
Dat <- data.frame(
ID = sample(10:19),
Time = sample(c("one", "two"), 10, replace = T),
Value = sample(1:10)
)
nlme_rct_lmm <- function (data, value, timepoint,
ID) {
nullmodel <- reformulate("1",response=value)
fullmodel <- reformulate(c("1",timepoint),response=value)
remodel <- reformulate(paste("1",ID,sep="|"))
#base_level intercept only model
bl_int_only <- gls(nullmodel,
data = data,
method = "ML",
na.action="na.omit")
#vary intercept across participants
randomIntercept <- lme(nullmodel,
data = data,
random = remodel,
method = "ML",
na.action = "na.omit")
#add timepoint as a fixed effect
timeFE <- lme(fullmodel,
data = data,
random = remodel,
method = "ML",
na.action = "na.omit")
}
nlme_rct_lmm(Dat, "Value", "Time", "ID")
If you want something a bit more elegant (but internally obscure) you can substitute the following lines for defining the models. The inner substitute() calls retrieves the symbols that were passed to the function as arguments; the outer substitute() calls insert those symbols into the formula.
nullmodel <- formula(substitute(v~1,list(v=substitute(value))))
fullmodel <- formula(substitute(v~t,list(v=substitute(value),
t=substitute(timepoint))))
remodel <- formula(substitute(~1|i,list(i=substitute(ID))))
Now this would work, without specifying the variables as strings, as you expected: nlme_rct_lmm(Dat, Value, Time, ID)
Related
I am having trouble producing the desired output with the awesome gtsummary's tbl_uvregression.
The data I am using can be reproduced with this helpful function from http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/ which produces a long data frame from 2x2 tables. I wanted to use "real" data as there might be something in there that triggers the behavior I do not understand:
ExampleDF <- data.frame(
Exposure = factor(c("exposed","exposed","control","control")),
Outcome = factor(c("positive","negative","positive","negative")),
Freq = c(590,90,11,5)
)
###
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
ExampleDFlong <- countsToCases(ExampleDF)
I found that in my big univariate regression table with the whole data set, produced with
tbl_uvregression(
method = glm,
y = Outcome,
method.args = list(family = binomial),
exponentiate = TRUE
)
the ORs and CI reported in the output are "off" (CI: 0.92 to 8.40; P = 0.066) compared to when I calculate them by using the formula like here: https://www.medcalc.org/calc/odds_ratio.php (CI: 1.01 to 8.78; P = 0.048). The applied methods here are, as I understand, "Wald" and "LRT", respectively.
I found this workaround that produces the CIs I was expecting: Wald confidence interval in gtsummary. But the P-Value I am getting is still the "LRT"-one and I have not been able to get the "Wald"-P displayed.
But what's really confusing to me is that when I don't use my original data set but the minimal example from above I get "LRT"-CIs and the "Wald"-P. Something in the data has to be different so that different methods are applied but I cannot see what that might be.
ExampleDFlong %>% tbl_uvregression(
method = glm,
y = Outcome,
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3)
)
To cut a long story short, I am looking for an approach that allows me to choose the methods applied and displayed so that it's easier for me to discuss my results with my biometrician, and that I can be sure I understand my output.
I am trying to de-clutter some scripts by creating functions to complete repetitive tasks in R. One task I complete repeatedly is fitting a linear model to a set of data and creating predictions from that linear model fit. The data I am working with is concentration and flow data from streams, and flow is always the explanatory variable but the response variable changes and therefore I would like to include it as a function input. However, I receive a "non-numeric argument to mathematical function" error when I run the function. I have tried both with and without quotes since the lm() call does not require quotes but that results in the classis "object 'myobject' not found". Here's a simple example.
Update
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter2_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(as.formula(paste('log(', parameter, ') ~ log(', flow, ')')), data=modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
Original Example Error
flows <- seq(0,7,0.01)
dat <- tibble(flow=sample(flows,30),
parameter1_conc=rnorm(30,15,4),
parameter1_conc=rnorm(30,50,8))
regr_func <- function(modeldata,parameter,pred_maxflow,pred_flowint) {
mod <- lm(log(parameter)~log(flow), data = modeldata)
newflow <- data.frame(flow = seq(0, maxflow, flowint))
preds <<- predict(mod, newdata = newflow,
interval = 'prediction')
}
regr_func(modeldata = dat,
parameter = 'parameter1_conc',
pred_maxflow = 20,
pred_flowint = 0.001)
There are 3 issues here. The main one is that log(parameter) in your lm formula does not get substituted for the variable passed in as parameter. That means lm is literally looking for a column called parameter in your data, which doesn't exist. You can fix this by creating a formula with the name substituted in. Although doing this with strings is the most commonly used method to do this, it is a bit more efficient and safer to use substitute. This also allows you to pass your column name without quotes.
The second issue is that the arguments maxflow and flowint should probably be pred_maxflow and pred_flowint to match your function parameters.
Thirdly, using the <<- operator to write to a variable in the calling frame is bad practice. R users expect functions not to have such side effects, and know to store the output of function calls to variables under their control. Only in very rare circumstances should this be done within the function.
Putting all this together, we have:
regr_func <- function(modeldata, parameter, pred_maxflow, pred_flowint) {
f <- `[[<-`(x ~ log(flow), 2, substitute(log(parameter)))
mod <- lm(f, data = modeldata)
newflow <- data.frame(flow = seq(0, pred_maxflow, pred_flowint))
predict(mod, newdata = newflow, interval = 'prediction')
}
And we would call the function like this:
preds <- regr_func(modeldata = dat,
parameter = parameter1_conc,
pred_maxflow = 20,
pred_flowint = 0.001)
resulting in:
head(preds)
#> fit lwr upr
#> 1 Inf NaN NaN
#> 2 3.365491 2.188942 4.542041
#> 3 3.312636 2.219223 4.406049
#> 4 3.281717 2.236294 4.327140
#> 5 3.259780 2.248073 4.271488
#> 6 3.242765 2.256998 4.228531
Created on 2022-06-03 by the reprex package (v2.0.1)
I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))
myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")
mygam.step
NULL
The code that was used to fit mygam from mydata is:
library(gam)
#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")
numbers[match("Response", names(mydata))] = FALSE
#Identify factor variables.
factors = sapply(mydata, class) == "factor"
#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ",
paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
),
" + ",
paste0(paste0(names(mydata)[factors], collapse = " + ")))
mygam = gam(as.formula(myformula), family = "binomial", mydata)
I suspect the issue is with the mygam object.
Explanation
If you read the help(step.gam) it has this paragraph in the explanation of scope argument:
The supplied model ‘object’ is used as the starting model,
and hence there is the requirement that one term from each of
the term formulas be present in ‘formula(object)’. This also
implies that any terms in ‘formula(object)’ not contained
in any of the term formulas will be forced to be present in
every model considered. The function ‘gam.scope’ is helpful
for generating the scope argument for a large model.
In essence this says that the first argument passed to step.gam function (mygam in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.
Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.
Exploring The Code
This idea is reinforced if we look at the code. The code of step.gam function has this loop that runs in case of forward selection.
if (forward) {
trial <- items
trial[i] <- trial[i] + 1
if (trial[i] <= term.lengths[i] && !get.visit(trial,
visited)) {
visited <- cbind(visited, trial)
tform.vector <- form.vector
tform.vector[i] <- scope[[i]][trial[i]]
form.list = c(form.list, list(list(trial = trial,
form.vector = tform.vector, which = i)))
}
}
Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (term.length) that are not yet in your model (items, trial). If you don't - the loop skips.
Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.
The Solution
Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
mygam <- gam(Response ~ 1, family = "binomial", mydata)
The last line is the only change that needs to be made. Everything else is the same as in the original post:
myscope <- gam.scope(mydata, response = 3, arg = "df=4")
mygam.step <- step.gam(mygam, myscope, direction = "forward")
And now the procedure works.
I would like to write a function and calls different sub-functions with parameters specified by string, such as:
genericModel <- function(model, dat, y, x, ...) {
fit <- get(model)(get(y) ~ get(x), data = dat, ...)
return(fit)
}
I am able to get it to work with simple cases:
> d <- data.frame(x.var = rnorm(10), y.var = rnorm(10), w = rep(1, 10))
> genericModel('lm', d, 'y.var', 'x.var')
Call:
get(model)(formula = get(y) ~ get(x), data = dat)
Coefficients:
(Intercept) get(x)
-0.04242 -0.31619
However, I have not been successful in terms of passing other optional arguments by string:
> genericModel('lm', d, 'y.var', 'x.var', weights = 'w')
Error in model.frame.default(formula = get(y) ~ get(x), data = dat, weights = "w", :
variable lengths differ (found for '(weights)')
I know I can do genericModel('lm', d, 'y.var', 'x.var', weights = d$w), but that defeats the purpose of creating a flexible function where I can specify the model and column names by string.
Also I can foresee complications where the optional parameters include both column names of the data.frame(ex:weights = w) and generic options for the sub-function(ex:na.action=na.pass).
EDIT:
Just to clarify, what I am hoping to achieve is:
genericModel('lm', d, 'y.var', 'x.var', weights = 'w')
genericModel('glm', d, 'y.var', 'x.var', family = 'binomial')
To run linear regression and logistic regression, respectively. I need some way to pass the optional arguments when calling genericModel.
Does anyone know how to deal with this? Thanks.
One suggestion: rather than fiddling with strings to specify analysis variables, what you should do is pass the formula. This is also much more flexible, since you'll be able to pass complicated model formulas directly to the underlying functions without any parsing.
If you do this, then obtaining what you want is simple with some language hacking. Get the call to the function, then manipulate it to call the model-fitting function instead.
genericModel <- function(mod, formula, data, ...)
{
cl <- match.call(expand=TRUE)
cl[[1]] <- cl$mod
cl$mod <- NULL
eval(cl, parent.frame())
}
genericModel(lm, mpg ~ hp, data=mtcars, weights=gear)
genericModel(glm, Volume ~ Girth + Height, data=trees, family=Gamma(link=log))
I have a large set of model specifications to test, which share a dv but have unique IVs. In the following example
foo <- data.frame(dv = sample(c(0,1), 100, replace=T),
x1 = runif(100),
x2 = runif(100))
I want the first model to only include x1, the second x2, the third both, and the fourth their interaction. So I thought a sensible way would be to build a list of formula statements:
bar <- list("x1",
"x2",
"x1+x2",
"x1*x2")
which I would then use in a llply call from the plyr package to obtain a list of model objects.
require(plyr)
res <- llply(bar, function(i) glm(dv ~ i, data = foo, family = binomial()))
Unfortunately I'm told
Error in model.frame.default(formula = dv ~ i, data = foo, drop.unused.levels = TRUE):variable lengths differ (found for 'i')
Obviously I'm mixing up something fundamental--do I need to manipulate the original foo list in some fashion?
Your problem is with how you are specifying the formula, since inside the function i is a variable. This would work:
glm(paste("dv ~", i), data = foo, family = binomial())
The problem is that dv ~ i isn't a formula. i is (inside the anonymous function) simply a symbol that represents a variable containing a character value.
Try this:
bar <- list("dv~x1",
"dv~x2",
"dv~x1+x2",
"dv~x1*x2")
res <- llply(bar, function(i) glm(i, data = foo, family = binomial()))
But setting statistical issues aside, it might possibly be easier to use something like ?step or ?stepAIC in the MASS package for tasks similar to this?