Suppose we have the following data:
# simulate data to fit
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))
Let's also suppose the user has fit a model:
# user fits a lm
mod = lm(y~x)
Now suppose I have an R package designed to perform several operations on the object mod. Just for simplicify, suppose we have two functions, one that plots the data, and one that computes the coefficients. However, as an intermediary, suppose we want to perform some operation on the data (in this example, add ten).
Example:
# function that adds ten to all scores
add_ten = function(model) {
data = model$model
data = data + 10
return(data)
}
# functions I defined that do something to the "add_ten" dataset
plot_ten = function(model) {
new_data = data.frame(add_ten(model))
x = all.vars(formula(model))[2]
y = all.vars(formula(model))[1]
ggplot2::ggplot(new_data, aes_string(x=x, y=y)) + geom_point() + geom_smooth()
}
coefs_ten = function(model) {
new_data = data.frame(add_ten(model))
coef(lm(formula(model), new_data))
}
(Obviously, this is pretty silly to do. In actuality, the operation I want to perform is multiple imputation, which is computationally intensive).
Notice in the above example I have to call the add_ten function twice, once for plot_ten and once for coefs_ten. This is inefficient.
So, now to my question, what is the best way to create a reusable object within a function?
I could, of course, create an object to be placed in the user's global environment:
add_ten = function(model) {
# check for add_ten_data in the global environment
if (exists("add_ten_data", where = .GlobalEnv)) return(get("add_ten_data", envir = .GlobalEnv))
data = model$model
data = data + 10
# assign add_ten_data to the global environment
assign('add_ten_data', data, envir = .GlobalEnv)
return(data)
}
I'm happy to do so, but worry about the "netiquette" of putting something in the user's environment. There's also a potential problem if users happen to have an object called "add_ten_data" in their environment.
So, what is the best way of accomplishing this?
Thanks in advance!
You should certainly avoid writing an object to the global environment. If you find that you have to repeat the same computationally expensive task at the top of a number of different functions, it means you are carrying out the computationally expensive task too late.
For example, you could create an S3 class that holds the necessary components to produce a "cheap" plot and a "cheap" extraction of the coefficients. It even has the benefits of generic dispatch:
add_ten <- function(model) model$model + 10
lm_tens <- function(formula, data)
{
model <- if(missing(data)) lm(formula) else lm(formula, data = data)
structure(list(data = data.frame(add_ten(model)), model = model),
class = "tens")
}
plot.tens <- function(tens) {
x = all.vars(formula(tens$data))[2]
y = all.vars(formula(tens$data))[1]
ggplot2::ggplot(tens$data, ggplot2::aes(x = x, y = y)) +
ggplot2::geom_point() +
ggplot2::geom_smooth()
}
coef.tens = function(tens) {
coef(lm(formula(tens$model), data = tens$data))
}
So now we just need to do:
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))
mod <- lm_tens(y ~ x)
coef(mod)
#> (Intercept) x
#> 4.3269914 0.5775404
plot(mod)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Note that we only need to call add_ten once here.
Related
I have a function that takes a dataset, extracts different variables, and then makes linear models from those variables (it expects the response in the last column). I want the data argument of the calls for these models to use objects in the global environment so that I can manipulate them with other functions outside this function. The following gives the expected behavior when provided with a single dataset.
make_mods <- function(dataset) {
make_mod <- function(x){
response <- names(dataset)[length(dataset)]
form <- paste0(response, " ~ ", x)
form <- as.formula(form)
bquote( lm(.(form), data = .(d_sub)) ) # Unevaluated to show output
}
d_sub <- substitute(dataset)
vars <- names(dataset)[-length(dataset)]
mods <- lapply(vars, make_mod)
return(mods)
}
# Make some different datasets
ex1 <- ex2 <- ex3 <- mtcars[c(3,4,6,1)]
new_data <- function(x) {
x + rnorm(length(x), mean = 0, sd = sd(x))
}
ex2[-length(ex2)] <- lapply(ex2[-length(ex2)], new_data)
ex3[-length(ex3)] <- lapply(ex3[-length(ex3)], new_data)
make_mods(ex1)
I also want to be able to use this function within lapply
# List of datasets for testing function with lapply
ex_l <- mget(c("ex1", "ex2", "ex3"))
lapply(ex_l, make_mods)
But here the model calls end up looking like this: lm(mpg ~ disp, data = X[[i]]) and, of course, this model call doesn't evaluate in the default environment (the actual function evaluates the model call in the function). The desired output is a list of lists of models that look like: lm(mpg ~ disp, data = ex_l[["ex1"]]), i.e., they have valid calls with data arguments that reference data frames in the global environment.
I've experimented with passing names to lapply and different wrapper functions for calling make_mods from lapply but it seems like my function, in using substitute only gives the expected behavior when called from the global environment. I'm new to working with scoping and environments. How can I get my function to give the desired lm call both when passed a data frame from the global environment, and when passed data frames from within lapply.
The only thing that I could think of was to add an if statement to my make mods function that tests if the input is a call or not. If it's a call, it expects it to be a call for a dataset in the global environment.
make_mods <- function(dataset) {
make_mod <- function(x){
response <- names(dataset)[length(dataset)]
form <- paste0(response, " ~ ", x)
form <- as.formula(form)
bquote( lm(.(form), data = .(d_sub)) )
}
if(is.call(dataset)) {
d_sub <- dataset
dataset <- eval(dataset)
} else {
d_sub <- substitute(dataset)
}
vars <- names(dataset)[-length(dataset)]
mods <- lapply(vars, make_mod)
return(mods)
}
Then I can use lapply like this:
out <- lapply(names(ex_l), function(x){
g <- bquote(ex_l[[.(x)]])
make_mods(g)
})
names(out) <- names(ex_l)
which gives me this:
$ex1
$ex1[[1]]
lm(mpg ~ disp, data = ex_l[["ex1"]])
$ex1[[2]]
lm(mpg ~ hp, data = ex_l[["ex1"]])
$ex1[[3]]
lm(mpg ~ wt, data = ex_l[["ex1"]])
<<output truncated>>
Maybe not an elegant solution, but it's working.
I am trying to call a function that expects a string as one of the arguments. However, attempting to substitute a variable containing the string throws an error.
library(jtools)
# Fit linear model
fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris)
# Plot interaction effect: works!
interact_plot(fitiris, pred = "Petal.Width", modx = "Species")
# Substitute variable name for string: doesn't work!
predictor <- "Petal.Width"
interact_plot(fitiris, pred = predictor, modx = "Species")
Error in names(modxvals2) <- modx.labels :
attempt to set an attribute on NULL
{jtools} uses non-standard evaluation so you can specify unquoted column names, e.g.
library(jtools)
fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris)
interact_plot(fitiris, pred = Petal.Width, modx = Species)
...but it's not robustly implemented, so the (common!) case you've run into breaks it. If you really need it to work, you can use bquote to restructure the call (with .(...) around what you want substituted), and then run it with eval:
predictor <- "Petal.Width"
eval(bquote(interact_plot(fitiris, pred = .(predictor), modx = "Species")))
...but this is diving pretty deep into R. A better approach is to make the plot yourself using an ordinary plotting library like {ggplot2}.
I'm the developer of this package.
A short note: this function has just been moved to a new package, called interactions, which is in the process of being added to CRAN. If you want to install it before it gets to CRAN (I expect this to happen within the week), you'll need to use this code to download it from Github:
if (!requireNamespace("remotes") {
install.packages("remotes")
}
remotes::install_github("jacob-long/interactions")
In this new version, I've changed the non-standard evaluation to follow the tidyeval model. This means it should be more straightforward to write a function that plugs in arguments to pred, modx, and/or mod2.
For example:
library(interactions)
plot_wrapper <- function(my_model, my_pred, my_modx) {
interact_plot(my_model, pred = !! my_pred, modx = !! my_modx)
}
fiti <- lm(Income ~ Frost + Murder * Illiteracy, data = as.data.frame(state.x77))
plot_wrapper(fiti, my_pred = "Murder", my_modx = "Illiteracy") # Works
pred_var <- "Murder"
modx_var <- "Illiteracy"
plot_wrapper(fiti, my_pred = pred_var, my_modx = modx_var) # Works
Or just to give an example of using variables in a loop...
variables <- c("Murder", "Illiteracy")
for (var in variables) {
print(interact_plot(fiti, pred = !! var, modx = !! (variables[variables != var])))
}
I am trying to run a glmer by looping through columns in my dataset which contain response variables (dat_prob).The code I am using is as follows, adapted from code researched on another stackoverflow question (Looping through columns in R).
Their code:
dat_y<-(dat[,c(2:1130)])
dat_x<-(dat[,c(1)])
models <- list()
#
for(i in names(dat_y)){
y <- dat_y[i]
model[[i]] = lm( y~dat_x )
}
My code:
dat_prob<-(probs[,c(108:188)])
dat_age<-(probs[,c(12)])
dat_dist<-(probs[,c(20)])
fyearcap=(probs[,c(25)])
fstation=(probs[,c(22)])
fnetnum=(probs[,c(23)])
fdepth=(probs[,c(24)])
models <- list()
#
for(i in names(dat_prob)){
y <- dat_prob[i]
y2=as.vector(y)
model[[i]] = glmer( y ~ dat_age * dat_dist + (1|fyearcap) + (1|fstation)+
(1|fnetnum)+ (1|fdepth),family=binomial,REML=TRUE )
}
And I receive this error, similar to the error received in the hyperlinked question:
Error in model.frame.default(drop.unused.levels = TRUE, formula = y ~ :
invalid type (list) for variable 'y'
I have been working through this for hours and now can't see the forest through the trees.
Any help is appreciated.
y <- dat_prob[i] makes y a list (or data frame, whatever). Lists are vectors - try is.vector(list()), so even y2 = as.vector(y) is still a list/data frame (even though you don't use it).
class(as.vector(mtcars[1]))
# [1] "data.frame"
To extract a numeric vector from a data frame, use [[: y <- dat_prob[[i]].
class(mtcars[[1]])
# [1] "numeric"
Though I agree with Roman - using formulas is probably a nicer way to go. Try something like this:
for(i in names(dat_prob)) {
my_formula = as.formula(paste(i,
"~ dat_age * dat_dist + (1|fyearcap) + (1|fstation)+ (1|fnetnum)+ (1|fdepth)"
))
model[[i]] = glmer(my_formula, family = binomial, REML = TRUE)
}
I'm also pretty skeptical of whatever you're doing trying 80 different response variables, but that's not your question...
I'm writing a function that I want to be able to take both a vector and a formula as a first argument. If it is a vector, I do some single variable calculations, if it is a formula, I analyze the first variable by the second variable (the second variable would always be a factor).
Here is my current code:
fun = function(formula,data) {
if (class(with(data,formula))=="formula") {
mod = model.frame(formula,data)
n.group=names(mod)[2]
group <- eval(parse(text=paste("mod$",n.group,sep=""))) #x
response <- model.response(mod) # y
return(table(response,group))
}
else {
return(table(with(data,formula)))
}
}
data(iris)
fun(Sepal.Length~Species,iris) # works correctly
fun(Sepal.Length,iris) # returns an error
The return value is just for illustration.
Cheers!
Try this:
fun.formula <- function(formula, data) {
mod = model.frame(formula, data)
n.group <- names(mod)[2]
group <- eval(parse(text=paste("mod$",n.group,sep=""))) #x
response <- model.response(mod) # y
table(response, group)
}
fun <- function(formula, data) {
ret <- try( table(eval(substitute(formula), data), silent = TRUE)
if (inherits(try, "try-error)) fun.formula(formula, data) else ret
}
# tests
fun(Sepal.Length ~ Species, iris)
fun(Sepal.Length, iris)
That said, this is a rather unusual interface and, instead, it might be better to specify the case where formula is a variable by passing its name as a character string in which case a more usual S3 implementation is possible:
fun2 <- function(formula, data, ...) UseMethod("fun2")
fun2.formula <- fun.formula
fun2.character <- function(formula, data) table(data[[formula]])
# tests
fun2(Sepal.Length ~ Species, iris)
fun2("Sepal.Length", iris) # with this approach use a character string
REVISED Now we use try and added an S3 approach.
Ideally, I would have solved this using an S3 approach, but I couldn't figure out how to do that. The following got the job done:
fun <- function(x,data) {
mod = try(model.frame(x,data),silent=T)
if (inherits(mod, "try-error")) {
x=data[,deparse(substitute(x))]
return(table(x))
}
else {
mod = model.frame(x,data)
n.group=names(mod)[2]
group <- eval(parse(text=paste("mod$",n.group,sep=""))) #x
response <- model.response(mod) # y
return(table(response,group))
}
}
fun(Sepal.Length~Species,iris) # works correctly
fun(Sepal.Length,iris) # works!
I often run JAGS models on simulated data with known parameters. I like the default plot method for mcmc objects. However, I would like to add an abline(v=TRUE_VALUE) for each parameter that is modelled. This would give me a quick check for whether the posterior is reasonable.
Of course I could do this manually, or presumably reinvent the wheel and write my own function. But I was wondering if there is an elegant way that builds on the existing plot method.
Here's a worked example:
require(rjags)
require(coda)
# simulatee data
set.seed(4444)
N <- 100
Mu <- 100
Sigma <- 15
y <- rnorm(n=N, mean=Mu, sd=Sigma)
jagsdata <- list(y=y)
jags.script <- "
model {
for (i in 1:length(y)) {
y[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0, 0.001)
sigma ~ dunif(0, 1000)
tau <- 1/sigma^2
}"
mod1 <- jags.model(textConnection(jags.script), data=jagsdata, n.chains=4,
n.adapt=1000)
update(mod1, 200) # burn in
mod1.samples <- coda.samples(model=mod1,
variable.names=c('mu', 'sigma'),
n.iter=1000)
plot(mod1.samples)
I just want to run something like abline(v=100) for mu and abline(v=15) for sigma. Of course in many other examples, I would have 5, 10, 20 or more parameters of interest. Thus, I'm interested in being able to supply a vector of true values for named parameters.
I've had a look at getAnywhere(plot.mcmc). Would modifying that be a good way to go?
Okay. So I modified plot.mcmc to look like this:
my.plot.mcmc <- function (x, trace = TRUE, density = TRUE, smooth = FALSE, bwf,
auto.layout = TRUE, ask = FALSE, parameters, ...)
{
oldpar <- NULL
on.exit(par(oldpar))
if (auto.layout) {
mfrow <- coda:::set.mfrow(Nchains = nchain(x), Nparms = nvar(x),
nplots = trace + density)
oldpar <- par(mfrow = mfrow)
}
for (i in 1:nvar(x)) {
y <- mcmc(as.matrix(x)[, i, drop = FALSE], start(x),
end(x), thin(x))
if (trace)
traceplot(y, smooth = smooth, ...)
if (density) {
if (missing(bwf)) {
densplot(y, ...); abline(v=parameters[i])
} else densplot(y, bwf = bwf, ...)
}
if (i == 1)
oldpar <- c(oldpar, par(ask = ask))
}
}
Then running the command
my.plot.mcmc(mod1.samples, parameters=c(Mu, Sigma))
produces this
Note that parameters must be a vector of values in the same sort order as JAGS sorts variables, which seems to be alphabetically and then numerically for vectors.
Lessons learnt
Simply writing a new plot.mcmc didn't work by default presumably because of namespaces. So I just created a new function
I had to change set.mfrow to coda:::set.mfrow presumably also because of namespaces.
I changed ask to ask=FALSE, because RStudio permits browsing through figures.
I'd be happy to hear any suggestions about better ways of overriding or adapting existing S3 methods.