I am building a wrapper around lm to do some additional calculations. I'd like the wrapper to pass ... to lm, but I am getting into trouble with lm's weights argument.
LmWrapper <- function(df, fmla, ...) {
est <- lm(fmla, df, ...)
list(model = est)
}
If I call the wrapper with a weights argument,
data(airquality)
LmWrapper(airquality, Ozone ~ Wind, weights = Temp)
R does not know where to look for the weights:
Error in eval(expr, envir, enclos) :
..1 used in an incorrect context, no ... to look in
The lm help page says
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
but the wrapper seems to change things.
How do I fix this?
The traceback() for the above error looks like this:
8: eval(expr, envir, enclos)
7: eval(extras, data, env)
6: model.frame.default(formula = fmla, data = df, weights = ..1,
drop.unused.levels = TRUE)
5: stats::model.frame(formula = fmla, data = df, weights = ..1,
drop.unused.levels = TRUE)
4: eval(expr, envir, enclos)
3: eval(mf, parent.frame())
2: lm(fmla, df, ...) at #2
1: LmWrapper(diamonds, price ~ carat, weights = depth)
Calling lm directly, works just fine:
lm(Ozone ~ Wind, airquality, weights = Temp)
So the problem is that lm normally looks up those names in argument data but somehow scoping goes wrong. You can fix that by looking up column references and passing them on manually.
LmWrapper <- function(df, fmla, ...) {
# get names of stuff in ...
argNames = sapply(substitute(list(...))[-1L], deparse)
# look for identical names in df
m = match(names(df), argNames, 0L)
# store other arguments from ... in a list
args = list(eval(parse(text = argNames[-m])))
# name the list
names(args) = names(argNames[-m])
# store complete values in args, instead of just references to columns
# the unlist code is rather ugly, the goal is to create a list where every
# element is a column of interest
args[names(argNames)[m]] = unlist(apply(df[, as.logical(m), drop = FALSE],
2, list), recursive = FALSE)
# also put other stuff in there
args$formula = fmla
args$data = df
# do lm
est = do.call(lm, args)
list(model = est)
}
data(airquality)
airquality$subset = airquality$Solar.R > 200
LmWrapper(airquality, Ozone ~ Wind, weights = Temp, subset = subset,
method = 'qr')
The code above is not the most beautiful, but it works for both subset and weights. Alternatively, you could just handle weights and subset as exceptions.
Thanks for this answer #Vandenman. I just implemented it with two changes that I wanted to share, in case someone else comes across this thread:
1) If there are no arguments in ... apart from columns in the data, the code as above creates an NA element in the list, which throws a warning - I added a condition below to get around that.
2) The model object returned by the code above has a very very long call since it includes each weight etc. If you don't care about the ability to use update(), it might be more insightful to replace it by the actual function call to record what happened, as implemented below.
run_lm <- function(df, formula, ...) {
# get names of stuff in ...
argNames = sapply(substitute(list(...))[-1L], deparse)
# look for identical names in df
m = match(names(df), argNames, 0L)
# store other arguments from ... in a list, if any
dot_args <- eval(parse(text = argNames[-m]))
if (is.null(dot_args)) {args <- list()
} else {
args <- list(dot_args)
# name the list
names(args) = names(argNames[-m])
}
# store complete values in args, instead of just references to columns
# the unlist code is rather ugly, the goal is to create a list where every
# element is a column of interest
args[names(argNames)[m]] = unlist(apply(df[, as.logical(m), drop = FALSE],
2, list), recursive = FALSE)
# also put other stuff in there
args$formula = formula
args$data = df
# do lm
mod <- do.call(lm, args)
mod$call <- sys.call()
mod
}
Related
This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 1 year ago.
I have a function I would like to keep everything fixed apart form a single argument.
ls <- score_model_compound(data, pred, tmp$Prediction, score= "log")
bs <- score_model_compound(data, pred, tmp$Prediction, score="Brier")
ss <- score_model_compound(data, pred, score="spherical")
what I would like is something like
ls = data.frame()
ls <- score_model_compound(data, pred, score= c("log", "Brier", "spherical"))
is there a function I can use, like apply(), which lets me do this?
Thank you
You can create some kind of wrapping function with only the first argument being the one you want to vary and then pass it to lapply:
## Creating the wrapping function
my.wrapping.function <- function(score, data, pred, tmp) {
return(score_model_compound(data = data,
pred = pred,
tmp = tmp,
score = score))
}
## Making the list of variables
my_variables <- as.list(c("log", "Brier", "spherical"))
## Running the function for all the variables (with the set specific arguments)
result_list <- lapply(my_variables,
my.wrapping.function,
data = data, pred = pred, tmp = tmp$Prediction)
And finally, to transform it into a data.frame (or matrix), you can use the do.call(cbind, ...) function on the results:
## Combining the results into a table
my_results_table <- do.call(cbind, result_list)
Does that answer your question?
mapply() to the rescue:
score_v = c('spherical', 'log', 'Brier')
l = mapply(
score_model_compound, # function
score = score_v, # variable argument vector
MoreArgs = list(data = data, # fixed arguments
pred = pred),
SIMPLIFY = FALSE # don't simplify
)
You probably have to tweak it a little yourself, since you didn't provide a reproducible example. mapply(SIMPLIFY = FALSE) will output you a list. If the function returns data.frame's the resulting list of data.frame's can subsequently be bound with e.g. data.table::rbindlidst().
Alternatively you could just use a loop:
l = list()
for (i in seq_along(score_v)) {
sc = score_v[i]
message('Modeling: ', sc)
smc = score_model_compound(data, pred, score = sc)
l[[i]] = smc
names(l)[i] = sc
}
I am using rpart() inside a function myFunction(). rpart() accepts several parameters which are handled using the missing() function:
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
For example, the parameter method can be left unspecified, and is handled inside rpart() using the following code:
if (missing(method)) method <- "whatever default"
How can I pass the argument method as a parameter for myFunction() in the most simple and efficient way so that it handles the default missing argument?
If I do something like
myFunction(foo = 0, method){# somecode; rpart(y ~ x, data = data, method = method)}
then this throws an error,
argument "method" is missing, with no default
I have also tried with functions like rlang::missing() with no success whatsoever.
Of course an option is doing something like passing myFunction(method = NULL) and then using if-else statements to either pass or not pass this argument, but then I have to code each possibility (for 4 arguments that would be 16 calls) and is very clumsy.
Note that I would also like to avoid using the ellipsis, as I want to specifically name my arguments.
MINIMAL REPRODUCIBLE EXAMPLE:
y <- c(0,0.1,0.1,-0.1, 100, 101, 99)
x <- c(1,2,3,4, 100,101,102)
myFunction <- function(x, y,
method,
weights,
subset,
parms){
rpart(formula = y ~ .,
data = data.frame(y, x),
weights = weights,
subset = subset,
parms = parms)
}
myFunction(x,y)
Error in eval(extras, data, env) : argument "weights" is missing,
with no default
Here's a solution using match.call. This kind of pattern is seen quite often inside base R functions.
Consider the following function which we might find inside a package, with optional arguments:
package_fun <- function(x, method1, method2, method3)
{
if(missing(method1)) method1 <- "Unspecified"
if(missing(method2)) method2 <- "Unspecified"
if(missing(method3)) method3 <- "Unspecified"
data.frame(x, method1, method2, method3)
}
Inside our own function, we can build a call to package_fun that swaps in our own optional parameters, swaps out any we don't want to pass, and adds any additional ones we choose. We are left with a single call to package_fun, and don't need to worry about combinatorial explosion:
myFunction <- function(foo = 0, method1, method2, method3)
{
mc <- match.call()
mc[[1]] <- quote(package_fun)
mc <- mc[-which(names(mc) == "foo")]
mc$x <- foo
eval(mc, env = parent.frame())
}
So now we can do:
myFunction(foo = 1, method1 = "Specified", method3 = "Specified")
#> x method1 method2 method3
#> 1 1 Specified Unspecified Specified
From the point of view of your reproducible example, this would look like:
myFunction <- function(x, y,
method,
weights,
subset,
parms){
mc <- match.call()
mc[[1]] <- quote(rpart)
mc$formula <- y ~ .
mc$data <- data.frame(y, x)
mc$x <- NULL
mc$y <- NULL
eval(mc, envir =parent.frame())
}
So we would have:
myFunction(x,y)
#> n= 7
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 7 17136.31 42.87143 *
EDIT: Ok, it has something to do with the data.all.filtered datatype.
The filtered datatype gets created from data.all.raw which works fine with any lapply below. The weird thing is that I can't find out how do the two differ...
data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
data.train.filtered <- lapply(seq(1, 8), FUN = function(i) sf.filterFeatures(data.train.raw[[i]], data.selectedFeatures[[i]]))
st.testFeature <- function(featureVector, treshold) {
if(!is.numeric(featureVector)) {return(T)}
numberOfNonZero <- sum(featureVector > 0)
numberOfZero <- length(featureVector) - numberOfNonZero
return(min(numberOfNonZero, numberOfZero) >= treshold)
}
sf.getGoodFeaturesVector <- function(data, treshold) {
selectedFeatures <- sapply(data, FUN = st.testFeature, treshold <- treshold)
whitelistedFeatures <- names(data) %in% c("id", "tp")
return(selectedFeatures | whitelistedFeatures)
}
sf.filterFeatures <- function(data, selectedFeatures) {
return(data[, selectedFeatures])
}
Any idea what am I doing wrong when manipulating the data that causes subsequent lapply to not to work?
Original post:
I have a list of datasets called data.train.filtered and want to get a list of models (for predicting a feature called tp) trained by rplot on them. The easiest solution I could think of was using lapply but it doesn't work for some reason.
lapply(data.train.filtered, function(dta) rpart(tp ~ ., data = dta))
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
The problem is probably not in the data as using it just for one (any) dataset works fine:
rpart(tp ~ ., data = data.train.filtered[[1]])
Even though accessing just one dataset via index works fine (as shown above) using lapply trough indexes fails just the same way the first example did.
lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]]))
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
The traceback for the index version is following:
10 terms.formula(formula, data = data)
9 terms(formula, data = data)
8 model.frame.default(formula = tp ~ ., data = data.train.filtered[[i]],
na.action = function (x)
{
Terms <- attr(x, "terms") ...
7 stats::model.frame(formula = tp ~ ., data = data.train.filtered[[i]],
na.action = function (x)
{
Terms <- attr(x, "terms") ...
6 eval(expr, envir, enclos)
5 eval(expr, p)
4 eval.parent(temp)
3 rpart(tp ~ ., data = data.train.filtered[[i]])
2 FUN(X[[i]], ...)
1 lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]]))
I'm quite sure I'm missing something extremely trivial here but being quite new to R I just can't find the problem.
PS: I know that I could iterate trough all the datasets via for loop but that feels really dirty and I'd prefer an R idiomatic solution.
The trick is to use lapply() on the original list, not on an index vector. For example:
# toy data:
data.train.filtered <- list()
# create 10 different length data frames:
for(i in 1:10){
n <- rpois(1, 15)
x = rnorm(n)
data.train.filtered[[i]] <- data.frame(x =x,
tp = 3 + 2 * x + rnorm(n)
)
}
library(rpart)
lapply(data.train.filtered, function(dta){rpart(tp ~ ., data = dta)})
using data(iris) and purrr::map:
datas <- split(iris, rep(sample(c(1,2,3)), length.out = nrow(iris))
models <- purrr::map(datas, ~ rpart(Species ~ ., data = .x)) # a better syntax
Ok, I finally managed to find the answer. The problem was that data.train.all was actually not what I thought it was. I had an error in the filtering process which corrupted (silently, thanks R) everything.
The fix was to use:
data.selectedFeatures <- lapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
instead of
data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
Thanks for all the other answers, though.
Gross Loss is a column in DatasetOne but also is a column in DatasetTwo, Three, etc
Age is a column in DatasetOne, but also is a column in DatasetTwo, Three, etc
This is the function I used
ag <- function (x,y,z,d)
{
aggregate(x ~ y, FUN=z, data=d)
}
once i make this function, i do
sample <- ag(GrossLoss, Age, mean, DatasetOne)
It says "Error in eval(expr, envir, enclos) : object 'GrossLoss' not found"
which is fixed if i use attach(DatasetOne)
So i changed the function to
ag <- function (x,y,z,d)
{
attach(d)
aggregate(x ~ y, FUN=z, data=d)
detach(d)
}
But instead of creating a data, it creates a value/environement
How do i solve this? It seems like it is the "data=" bit that is not functioning correctly the same thing happens when i try to make a function for
plot (.., data=d)
as well.
We need to use paste and the formula
ag <- function (x,y,z,d){
aggregate(as.formula(paste(x, y, sep="~")), data=d, FUN = z)
}
and then call the function
ag('GrossLoss', 'Age', mean, DatasetOne)
You could also do this. match.call would evaluate all the arguments inside the function. Then you can eval to use the data frame, d.
ag <- function (x,y,z,d)
{
arguments <- as.list(match.call())
x = eval(arguments$x, d)
y = eval(arguments$y, d)
aggregate(x ~ y, FUN = z, d)
}
Then Call would be:
ag(x, y, mean, d)
I have a simple and a small data set for which I wish to apply a set of functions to each column or variable of the data frame using the sapply function. Below is the code from R Blogs
multi.sapply <- function(...) {
arglist <- match.call(expand.dots = FALSE)$...
var.names <- sapply(arglist, deparse)
has.name <- (names(arglist) != "")
var.names[has.name] <- names(arglist)[has.name]
arglist <- lapply(arglist, eval.parent, n = 2)
x <- arglist[[1]]
arglist[[1]] <- NULL
result <- sapply(arglist, function (FUN, x) sapply(x, FUN), x)
colnames(result) <- var.names[-1]
return(result)
}
Since I am a novice user of R, I would like to know, how can you modify the above code when the data has missing or NA values? So for example:
multi.sapply(mydata,mean, median, min, max)
Works fine but yields NA values for variables that has missing values
The following code however gives me the following error message:
multi.sapply(mydata,mean, median, valid.n, min, max, na.rm = TRUE)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Your help would me much appreciated!