Create a function with an argument used in a formula - r

I'm a beginner at creating function and I have some trouble with something probably basic.
I'd like to create a function that takes as argument a data.frame and a name of a variable, and return the linear regression of this variable by the others (no real point with doing that, I'm just trying to learn how to create functions)
my_lm <- function(df, var) lm(var~., data = df)
my_lm(diamonds, price)
But I get this error:
Error in eval(predvars, data, env) : object 'price' not found"
Thanks for your help and sorry for bad english

One solution is to pass price as char, and use formula() to convert a string in the proper object for the lm.
my_lm <- function(df, var) {
f = formula(paste0(var, "~.")) # this creates "price ~ ." in the example
lm(f, data = df)
}
my_lm(diamonds, var="price")
Or, if you have to pass price as "not a string", you need NSE:
my_lm <- function(df, var) {
var = substitute(var)
f = formula(paste0(var, "~."))
lm(f, data = df)
}
my_lm(diamonds, var=price)

Related

control scoping of arguments supplied to lm() from whithin a function called by lapply

I have a function that takes a dataset, extracts different variables, and then makes linear models from those variables (it expects the response in the last column). I want the data argument of the calls for these models to use objects in the global environment so that I can manipulate them with other functions outside this function. The following gives the expected behavior when provided with a single dataset.
make_mods <- function(dataset) {
make_mod <- function(x){
response <- names(dataset)[length(dataset)]
form <- paste0(response, " ~ ", x)
form <- as.formula(form)
bquote( lm(.(form), data = .(d_sub)) ) # Unevaluated to show output
}
d_sub <- substitute(dataset)
vars <- names(dataset)[-length(dataset)]
mods <- lapply(vars, make_mod)
return(mods)
}
# Make some different datasets
ex1 <- ex2 <- ex3 <- mtcars[c(3,4,6,1)]
new_data <- function(x) {
x + rnorm(length(x), mean = 0, sd = sd(x))
}
ex2[-length(ex2)] <- lapply(ex2[-length(ex2)], new_data)
ex3[-length(ex3)] <- lapply(ex3[-length(ex3)], new_data)
make_mods(ex1)
I also want to be able to use this function within lapply
# List of datasets for testing function with lapply
ex_l <- mget(c("ex1", "ex2", "ex3"))
lapply(ex_l, make_mods)
But here the model calls end up looking like this: lm(mpg ~ disp, data = X[[i]]) and, of course, this model call doesn't evaluate in the default environment (the actual function evaluates the model call in the function). The desired output is a list of lists of models that look like: lm(mpg ~ disp, data = ex_l[["ex1"]]), i.e., they have valid calls with data arguments that reference data frames in the global environment.
I've experimented with passing names to lapply and different wrapper functions for calling make_mods from lapply but it seems like my function, in using substitute only gives the expected behavior when called from the global environment. I'm new to working with scoping and environments. How can I get my function to give the desired lm call both when passed a data frame from the global environment, and when passed data frames from within lapply.
The only thing that I could think of was to add an if statement to my make mods function that tests if the input is a call or not. If it's a call, it expects it to be a call for a dataset in the global environment.
make_mods <- function(dataset) {
make_mod <- function(x){
response <- names(dataset)[length(dataset)]
form <- paste0(response, " ~ ", x)
form <- as.formula(form)
bquote( lm(.(form), data = .(d_sub)) )
}
if(is.call(dataset)) {
d_sub <- dataset
dataset <- eval(dataset)
} else {
d_sub <- substitute(dataset)
}
vars <- names(dataset)[-length(dataset)]
mods <- lapply(vars, make_mod)
return(mods)
}
Then I can use lapply like this:
out <- lapply(names(ex_l), function(x){
g <- bquote(ex_l[[.(x)]])
make_mods(g)
})
names(out) <- names(ex_l)
which gives me this:
$ex1
$ex1[[1]]
lm(mpg ~ disp, data = ex_l[["ex1"]])
$ex1[[2]]
lm(mpg ~ hp, data = ex_l[["ex1"]])
$ex1[[3]]
lm(mpg ~ wt, data = ex_l[["ex1"]])
<<output truncated>>
Maybe not an elegant solution, but it's working.

Linear regression function malfunction

variables.null.model <- paste('utalter', 'lcsex', 'utcigreg', 'utbmi', 'month', sep = '+')
variables.full.model <- paste('utalter', 'lcsex', 'utcigreg', 'utbmi', 'month', 'ltedyrs','occ_status', 'marital_status', 'social_cat','GC_linc125_07', 'GC_linc250_07', 'GC_linc500_07', 'GC_linc1000_07', 'GC_linc5000_07', 'GC_pop500_08','utalkkon', 'activity', 'utpyrs', 'cvd', 'utmstati', 'utmfibra', 'utantihy', 'utmeddia', 'utmadins','utwhrat','ul_choln', sep='+')
pollutants_3 <- c('GC_PM10_09', 'GC_PM25_09', 'GC_Coarse_09', 'GC_BS25_09', 'GC_NOX_09', '$GC_NO2_09')
null <- paste(variables.null.model, pollutants_3, sep='+')
full <- paste(variables.full.model, pollutants_3, sep='+')
fun.model.summary <- function(x) {
formula <- as.formula(paste("log_sfrp5 ~", x))
lm <- lm(formula, data = kalonji.na )
coef(summary(lm))
}
lm.summary <- lapply(full, fun.model.summary)
I am working on some air pollution data and would like to run a linear regression function and summarize the coefficients. I have the following code above but I am getting this error:
Error in parse(text = x, keep.source = FALSE) :
:1:269: unexpected '$'
Any ideas how I can fix this?
Your last pollutant is '$GC_NO2_09'. Note the stray $ sign.
But as I said in a comment, I strongly recommend against using character strings here1. Construct the formula directly from R objects by transforming the strings into R identifiers via as.name.
You can combine a list of names into a sum by the use of Reduce and call. E.g.:
make_addition = function (lhs, rhs)
call('+', lhs, rhs)
variables_null_model = c('utalter', 'lcsex', 'utcigreg', 'utbmi', 'month')
interaction_terms_full_model = Reduce(make_addition, lapply(variables_null_model, as.name))
fun_model_summary = function (x) {
formula = call('~', quote(log_sfrp5), call('+', interaction_terms_full_model, as.name(x)))
lm = lm(formula, data = kalonji_na)
coef(summary(lm))
}
lm_summary = lapply(pollutants_3, fun_model_summary)
1 For a bit of background, using strings here subverts the type system and replaces proper, distinct types by untyped strings. This is known as stringly typing and it’s an anti-pattern because it hides bugs. Your question is an example of such a bug.

How to stop reference a data frame in a function

I want to build a function in such a way that once i supplied data='name of data frame' there is no need to write variable=data$variable as just writing variable name from the supplied data frame will serve the purpose
myfunction<-function(variable,data)
{
result=sum(data)/sum(variable)
return(result)
}
for example i have a data frame df
df<-data.frame(x=1:5,y=2:6,z=3:7,u=4:8)
I want to provide following input
myfunction(variable=x,data=df)
instead of below input to serve the purpose
myfunction(variable=df$x,data=df)
We can use non-standard evaluation:
myfunction <- function(variable, data) {
var <- eval(substitute(variable), data)
result = sum(data)/sum(var)
return(result)
}
# Test
myfunction(variable = x, data = df)
#[1] 6
The with or attach functions can help you here, see the ?with and ?attach documentation. Alternatively, you can supply the variable name as a character and use this in the function body. I.e. you can do something like this:
myfunction2 <- function(variable, data) {
result <- sum(data)/sum(data[[variable]])
return(result)
}
df <- data.frame(x=1:5,y=2:6,z=3:7,u=4:8)
myfunction2("x", df)
#[1] 6
Yet another resort is to use non-standard evaluation. A small example of this is something like:
myfunction3 <- function(variable, data) {
var.name <- deparse(substitute(variable))
result <- sum(data)/sum(data[[var.name]])
return(result)
}
myfunction3(variable = x, data = df)
#[1] 6

How to write R function that can take either a vector or formula as first argument?

I'm writing a function that I want to be able to take both a vector and a formula as a first argument. If it is a vector, I do some single variable calculations, if it is a formula, I analyze the first variable by the second variable (the second variable would always be a factor).
Here is my current code:
fun = function(formula,data) {
if (class(with(data,formula))=="formula") {
mod = model.frame(formula,data)
n.group=names(mod)[2]
group <- eval(parse(text=paste("mod$",n.group,sep=""))) #x
response <- model.response(mod) # y
return(table(response,group))
}
else {
return(table(with(data,formula)))
}
}
data(iris)
fun(Sepal.Length~Species,iris) # works correctly
fun(Sepal.Length,iris) # returns an error
The return value is just for illustration.
Cheers!
Try this:
fun.formula <- function(formula, data) {
mod = model.frame(formula, data)
n.group <- names(mod)[2]
group <- eval(parse(text=paste("mod$",n.group,sep=""))) #x
response <- model.response(mod) # y
table(response, group)
}
fun <- function(formula, data) {
ret <- try( table(eval(substitute(formula), data), silent = TRUE)
if (inherits(try, "try-error)) fun.formula(formula, data) else ret
}
# tests
fun(Sepal.Length ~ Species, iris)
fun(Sepal.Length, iris)
That said, this is a rather unusual interface and, instead, it might be better to specify the case where formula is a variable by passing its name as a character string in which case a more usual S3 implementation is possible:
fun2 <- function(formula, data, ...) UseMethod("fun2")
fun2.formula <- fun.formula
fun2.character <- function(formula, data) table(data[[formula]])
# tests
fun2(Sepal.Length ~ Species, iris)
fun2("Sepal.Length", iris) # with this approach use a character string
REVISED Now we use try and added an S3 approach.
Ideally, I would have solved this using an S3 approach, but I couldn't figure out how to do that. The following got the job done:
fun <- function(x,data) {
mod = try(model.frame(x,data),silent=T)
if (inherits(mod, "try-error")) {
x=data[,deparse(substitute(x))]
return(table(x))
}
else {
mod = model.frame(x,data)
n.group=names(mod)[2]
group <- eval(parse(text=paste("mod$",n.group,sep=""))) #x
response <- model.response(mod) # y
return(table(response,group))
}
}
fun(Sepal.Length~Species,iris) # works correctly
fun(Sepal.Length,iris) # works!

I do not understand error "object not found" inside the function

I have roughly this function:
plot_pca_models <- function(models, id) {
library(lattice)
splom(models, groups=id)
}
and I'm calling it like this:
plot_pca_models(data.pca, log$id)
wich results in this error:
Error in eval(expr, envir, enclos) : object 'id' not found
when I call it without the wrapping function:
splom(data.pca, groups=log$id)
it raises this error:
Error in log$id : object of type 'special' is not subsettable
but when I do this:
id <- log$id
splom(models, groups=id)
it behaves as expected.
Please can anybody explain why it behaves like this and how to correct it? Thanks.
btw:
I'm aware of similar questions here, eg:
Help understand the error in a function I defined in R
Object not found error with ddply inside a function
Object disappears from namespace in function
but none of them helped me.
edit:
As requested, there is full "plot_pca_models" function:
plot_pca_models <- function(data, id, sel=c(1:4), comp=1) {
# 'data' ... princomp objects
# 'id' ... list of samples id (classes)
# 'sel' ... list of models to compare
# 'comp' ... which pca component to compare
library(lattice)
models <- c()
models.size <- 1:length(data)
for(model in models.size) {
models <- c(models, list(data[[model]]$scores[,comp]))
}
names(models) <- 1:length(data)
models <- do.call(cbind, models[sel])
splom(models, groups=id)
}
edit2:
I've managed to make the problem reproducible.
require(lattice)
my.data <- data.frame(pca1 = rnorm(100), pca2 = rnorm(100), pca3 = rnorm(100))
my.id <- data.frame(id = sample(letters[1:4], 100, replace = TRUE))
plot_pca_models2 <- function(x, ajdi) {
splom(x, group = ajdi)
}
plot_pca_models2(x = my.data, ajdi = my.id$id)
which produce the same error like above.
The problem is that splom evaluates its groups argument in a nonstandard way.A quick fix is to rewrite your function so that it constructs the call with the appropriate syntax:
f <- function(data, id)
eval(substitute(splom(data, groups=.id), list(.id=id)))
# test it
ir <- iris[-5]
sp <- iris[, 5]
f(ir, sp)
log is a function in base R. Good practice is to not name objects after functions...it can create confusion. Type log$test into a clean R session and you'll see what's happening:
object of type 'special' is not subsettable
Here's a modification of Hong Oi's answer. First I would recommend to include id in the main data frame, i.e
my.data <- data.frame(pca1 = rnorm(100), pca2 = rnorm(100), pca3 = rnorm(100), id = sample(letters[1:4], 100, replace = TRUE))
.. and then
plot_pca_models2 <- function(x, ajdi) {
Call <- bquote(splom(x, group = x[[.(ajdi)]]))
eval(Call)
}
plot_pca_models2(x = my.data, ajdi = "id")
The cause of the confusion is the following line in lattice:::splom.formula:
groups <- eval(substitute(groups), data, environment(formula))
... whose only point is to be able to specify groups without quotation marks, that is,
# instead of
splom(DATA, groups="ID")
# you can now be much shorter, thanks to eval and substitute:
splom(DATA, groups=ID)
But of course, this makes using splom (and other functions e.g. substitute which use "nonstandard evaluation") harder to use from within other functions, and is against the philosophy that is "mostly" followed in the rest of R.

Resources