Gross Loss is a column in DatasetOne but also is a column in DatasetTwo, Three, etc
Age is a column in DatasetOne, but also is a column in DatasetTwo, Three, etc
This is the function I used
ag <- function (x,y,z,d)
{
aggregate(x ~ y, FUN=z, data=d)
}
once i make this function, i do
sample <- ag(GrossLoss, Age, mean, DatasetOne)
It says "Error in eval(expr, envir, enclos) : object 'GrossLoss' not found"
which is fixed if i use attach(DatasetOne)
So i changed the function to
ag <- function (x,y,z,d)
{
attach(d)
aggregate(x ~ y, FUN=z, data=d)
detach(d)
}
But instead of creating a data, it creates a value/environement
How do i solve this? It seems like it is the "data=" bit that is not functioning correctly the same thing happens when i try to make a function for
plot (.., data=d)
as well.
We need to use paste and the formula
ag <- function (x,y,z,d){
aggregate(as.formula(paste(x, y, sep="~")), data=d, FUN = z)
}
and then call the function
ag('GrossLoss', 'Age', mean, DatasetOne)
You could also do this. match.call would evaluate all the arguments inside the function. Then you can eval to use the data frame, d.
ag <- function (x,y,z,d)
{
arguments <- as.list(match.call())
x = eval(arguments$x, d)
y = eval(arguments$y, d)
aggregate(x ~ y, FUN = z, d)
}
Then Call would be:
ag(x, y, mean, d)
Related
I am using rpart() inside a function myFunction(). rpart() accepts several parameters which are handled using the missing() function:
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
For example, the parameter method can be left unspecified, and is handled inside rpart() using the following code:
if (missing(method)) method <- "whatever default"
How can I pass the argument method as a parameter for myFunction() in the most simple and efficient way so that it handles the default missing argument?
If I do something like
myFunction(foo = 0, method){# somecode; rpart(y ~ x, data = data, method = method)}
then this throws an error,
argument "method" is missing, with no default
I have also tried with functions like rlang::missing() with no success whatsoever.
Of course an option is doing something like passing myFunction(method = NULL) and then using if-else statements to either pass or not pass this argument, but then I have to code each possibility (for 4 arguments that would be 16 calls) and is very clumsy.
Note that I would also like to avoid using the ellipsis, as I want to specifically name my arguments.
MINIMAL REPRODUCIBLE EXAMPLE:
y <- c(0,0.1,0.1,-0.1, 100, 101, 99)
x <- c(1,2,3,4, 100,101,102)
myFunction <- function(x, y,
method,
weights,
subset,
parms){
rpart(formula = y ~ .,
data = data.frame(y, x),
weights = weights,
subset = subset,
parms = parms)
}
myFunction(x,y)
Error in eval(extras, data, env) : argument "weights" is missing,
with no default
Here's a solution using match.call. This kind of pattern is seen quite often inside base R functions.
Consider the following function which we might find inside a package, with optional arguments:
package_fun <- function(x, method1, method2, method3)
{
if(missing(method1)) method1 <- "Unspecified"
if(missing(method2)) method2 <- "Unspecified"
if(missing(method3)) method3 <- "Unspecified"
data.frame(x, method1, method2, method3)
}
Inside our own function, we can build a call to package_fun that swaps in our own optional parameters, swaps out any we don't want to pass, and adds any additional ones we choose. We are left with a single call to package_fun, and don't need to worry about combinatorial explosion:
myFunction <- function(foo = 0, method1, method2, method3)
{
mc <- match.call()
mc[[1]] <- quote(package_fun)
mc <- mc[-which(names(mc) == "foo")]
mc$x <- foo
eval(mc, env = parent.frame())
}
So now we can do:
myFunction(foo = 1, method1 = "Specified", method3 = "Specified")
#> x method1 method2 method3
#> 1 1 Specified Unspecified Specified
From the point of view of your reproducible example, this would look like:
myFunction <- function(x, y,
method,
weights,
subset,
parms){
mc <- match.call()
mc[[1]] <- quote(rpart)
mc$formula <- y ~ .
mc$data <- data.frame(y, x)
mc$x <- NULL
mc$y <- NULL
eval(mc, envir =parent.frame())
}
So we would have:
myFunction(x,y)
#> n= 7
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 7 17136.31 42.87143 *
I am looking for an elegant (and safe!) way to evaluate an amended call in the parent frame. By "amended" I mean I modified the call in such a way that it refers to something not included in parent frame but in another frame. I guess one could also say: "send something up but only for evaluation".
It is clarified what I want by the example below which works in some circumstances, but not all. The update function (stats:::update.default) uses eval and I added the weights argument with something (res) that is not in the same environment as the evaluation takes place. So I used get("res", pos = -1L) and I hope it is a safe way to refer to the environment res lives in. For models estimated with a variable as formula, both defined methods fail:
mod <- lm(mpg ~ cyl, data = mtcars)
form <- mpg ~ cyl
mod2 <- lm(form, data = mtcars)
wls1 <- function(x) {
res <- residuals(x)^2 # example
result <- update(x, weights = 1/get("res", pos = -1L))
return(result)
}
wls2 <- function(x) {
res <- residuals(x)^2 # example
result <- update(x, weights = 1/res)
return(result)
}
wls3 <- function(x) {
data(ChickWeight)
ChickWeight$cyl <- ChickWeight$weight
ChickWeight$mpg <- ChickWeight$Time
result <- update(x, data = ChickWeight)
return(result)
}
wls1(mod) # works
wls1(mod2) # errors
wls2(mod) # works
wls2(mod2) # erros
wls3(mod) # works
wls3(mod2) # works
How can this be solved in general in a safe way?
I was looking for a function that gives the current environment (something like a fictious this.environment() function) so avoid the pos argument and use the envir of get (I know I can create my own temporary environment and have res associated to it to use something like envir = my.eny).
We can do this by creating a quoted 'language' object for the formula and then update the call of the model
form <- quote(mpg ~ cyl)
wlsN <- function(x, formula) {
x$call$formula <- formula
res <- residuals(x)^2
update(x, weights = 1/res) # it is in the same environment. No need for get
}
wlsN(mod2, form)
#Call:
#lm(formula = mpg ~ cyl, data = mtcars, weights = 1/res)
#Coefficients:
#(Intercept) cyl
# 37.705 -2.841
-checking with other formula
form1 <- quote(disp ~ cyl + vs)
form2 <- quote(mpq ~ gear + carb)
mod1 <- lm(form1, data = mtcars)
mod2 <- lm(form2, data = mtcars)
wlsN(mod1, form1) # works
wlsN(mod2, form2) # works
It's hard to work around the fact that R looks for the value of weights in either data or the environment of the formula - which in the case of the variable named form in your example, is the global environment.
An alternative that riffs on the same theme as akrun's answer:
wls3 <- function(x) {
environment(x$call$formula) <- environment()
res <- residuals(x)^2
result <- update(x, weights=1/res)
}
I can see how this could get ugly in less trivial uses of this workaround such as when the formula of x already has an environment that does not enclose (potentially wrong use of the term) the environment in the call to wls3().
Another alternative (not recommended) is to use assign, e.g.
wls4 <- function(x) {
assign('res', residuals(x)^2, envir=environment(formula(x)))
result <- update(x, weights=1/res)
}
however this has the unintended consequence of leaving the variable res in the global environment.
EDIT: Ok, it has something to do with the data.all.filtered datatype.
The filtered datatype gets created from data.all.raw which works fine with any lapply below. The weird thing is that I can't find out how do the two differ...
data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
data.train.filtered <- lapply(seq(1, 8), FUN = function(i) sf.filterFeatures(data.train.raw[[i]], data.selectedFeatures[[i]]))
st.testFeature <- function(featureVector, treshold) {
if(!is.numeric(featureVector)) {return(T)}
numberOfNonZero <- sum(featureVector > 0)
numberOfZero <- length(featureVector) - numberOfNonZero
return(min(numberOfNonZero, numberOfZero) >= treshold)
}
sf.getGoodFeaturesVector <- function(data, treshold) {
selectedFeatures <- sapply(data, FUN = st.testFeature, treshold <- treshold)
whitelistedFeatures <- names(data) %in% c("id", "tp")
return(selectedFeatures | whitelistedFeatures)
}
sf.filterFeatures <- function(data, selectedFeatures) {
return(data[, selectedFeatures])
}
Any idea what am I doing wrong when manipulating the data that causes subsequent lapply to not to work?
Original post:
I have a list of datasets called data.train.filtered and want to get a list of models (for predicting a feature called tp) trained by rplot on them. The easiest solution I could think of was using lapply but it doesn't work for some reason.
lapply(data.train.filtered, function(dta) rpart(tp ~ ., data = dta))
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
The problem is probably not in the data as using it just for one (any) dataset works fine:
rpart(tp ~ ., data = data.train.filtered[[1]])
Even though accessing just one dataset via index works fine (as shown above) using lapply trough indexes fails just the same way the first example did.
lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]]))
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
The traceback for the index version is following:
10 terms.formula(formula, data = data)
9 terms(formula, data = data)
8 model.frame.default(formula = tp ~ ., data = data.train.filtered[[i]],
na.action = function (x)
{
Terms <- attr(x, "terms") ...
7 stats::model.frame(formula = tp ~ ., data = data.train.filtered[[i]],
na.action = function (x)
{
Terms <- attr(x, "terms") ...
6 eval(expr, envir, enclos)
5 eval(expr, p)
4 eval.parent(temp)
3 rpart(tp ~ ., data = data.train.filtered[[i]])
2 FUN(X[[i]], ...)
1 lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]]))
I'm quite sure I'm missing something extremely trivial here but being quite new to R I just can't find the problem.
PS: I know that I could iterate trough all the datasets via for loop but that feels really dirty and I'd prefer an R idiomatic solution.
The trick is to use lapply() on the original list, not on an index vector. For example:
# toy data:
data.train.filtered <- list()
# create 10 different length data frames:
for(i in 1:10){
n <- rpois(1, 15)
x = rnorm(n)
data.train.filtered[[i]] <- data.frame(x =x,
tp = 3 + 2 * x + rnorm(n)
)
}
library(rpart)
lapply(data.train.filtered, function(dta){rpart(tp ~ ., data = dta)})
using data(iris) and purrr::map:
datas <- split(iris, rep(sample(c(1,2,3)), length.out = nrow(iris))
models <- purrr::map(datas, ~ rpart(Species ~ ., data = .x)) # a better syntax
Ok, I finally managed to find the answer. The problem was that data.train.all was actually not what I thought it was. I had an error in the filtering process which corrupted (silently, thanks R) everything.
The fix was to use:
data.selectedFeatures <- lapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
instead of
data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
Thanks for all the other answers, though.
I am building a wrapper around lm to do some additional calculations. I'd like the wrapper to pass ... to lm, but I am getting into trouble with lm's weights argument.
LmWrapper <- function(df, fmla, ...) {
est <- lm(fmla, df, ...)
list(model = est)
}
If I call the wrapper with a weights argument,
data(airquality)
LmWrapper(airquality, Ozone ~ Wind, weights = Temp)
R does not know where to look for the weights:
Error in eval(expr, envir, enclos) :
..1 used in an incorrect context, no ... to look in
The lm help page says
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
but the wrapper seems to change things.
How do I fix this?
The traceback() for the above error looks like this:
8: eval(expr, envir, enclos)
7: eval(extras, data, env)
6: model.frame.default(formula = fmla, data = df, weights = ..1,
drop.unused.levels = TRUE)
5: stats::model.frame(formula = fmla, data = df, weights = ..1,
drop.unused.levels = TRUE)
4: eval(expr, envir, enclos)
3: eval(mf, parent.frame())
2: lm(fmla, df, ...) at #2
1: LmWrapper(diamonds, price ~ carat, weights = depth)
Calling lm directly, works just fine:
lm(Ozone ~ Wind, airquality, weights = Temp)
So the problem is that lm normally looks up those names in argument data but somehow scoping goes wrong. You can fix that by looking up column references and passing them on manually.
LmWrapper <- function(df, fmla, ...) {
# get names of stuff in ...
argNames = sapply(substitute(list(...))[-1L], deparse)
# look for identical names in df
m = match(names(df), argNames, 0L)
# store other arguments from ... in a list
args = list(eval(parse(text = argNames[-m])))
# name the list
names(args) = names(argNames[-m])
# store complete values in args, instead of just references to columns
# the unlist code is rather ugly, the goal is to create a list where every
# element is a column of interest
args[names(argNames)[m]] = unlist(apply(df[, as.logical(m), drop = FALSE],
2, list), recursive = FALSE)
# also put other stuff in there
args$formula = fmla
args$data = df
# do lm
est = do.call(lm, args)
list(model = est)
}
data(airquality)
airquality$subset = airquality$Solar.R > 200
LmWrapper(airquality, Ozone ~ Wind, weights = Temp, subset = subset,
method = 'qr')
The code above is not the most beautiful, but it works for both subset and weights. Alternatively, you could just handle weights and subset as exceptions.
Thanks for this answer #Vandenman. I just implemented it with two changes that I wanted to share, in case someone else comes across this thread:
1) If there are no arguments in ... apart from columns in the data, the code as above creates an NA element in the list, which throws a warning - I added a condition below to get around that.
2) The model object returned by the code above has a very very long call since it includes each weight etc. If you don't care about the ability to use update(), it might be more insightful to replace it by the actual function call to record what happened, as implemented below.
run_lm <- function(df, formula, ...) {
# get names of stuff in ...
argNames = sapply(substitute(list(...))[-1L], deparse)
# look for identical names in df
m = match(names(df), argNames, 0L)
# store other arguments from ... in a list, if any
dot_args <- eval(parse(text = argNames[-m]))
if (is.null(dot_args)) {args <- list()
} else {
args <- list(dot_args)
# name the list
names(args) = names(argNames[-m])
}
# store complete values in args, instead of just references to columns
# the unlist code is rather ugly, the goal is to create a list where every
# element is a column of interest
args[names(argNames)[m]] = unlist(apply(df[, as.logical(m), drop = FALSE],
2, list), recursive = FALSE)
# also put other stuff in there
args$formula = formula
args$data = df
# do lm
mod <- do.call(lm, args)
mod$call <- sys.call()
mod
}
I have a function that I use to get a "quick look" at a data.frame... I deal with a lot of survey data and this acts as a quick tool to see what's what.
f.table <- function(x) {
if (is.factor(x[[1]])) {
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
x <- na.omit(melt(x,c()))
x <- cast(x, variable ~ value, frequency)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
frequency <- function(x) {
x[x > 1] <- 1
x[is.na(x)] <- 0
x <- round(sum(x)/n, digits=2)
}
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(frequency, mean, sd, min, max))
x <- transform(x, variable=reorder(variable, frequency))
}
return(x)
}
What I find happens is that if I don't define "frequency" outside of the function, it returns wonky results for data frames with continuous variables. It doesn't seem to matter which definition I use outside of the function, so long as I do.
try:
n <- 100
x <- data.frame(a=c(1:25),b=rnorm(100),c=rnorm(100))
x[x > 20] <- NA
Now, select either one of the frequency functions and paste them in and try it again:
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
f.table(x)
Why is that?
Crucially, I think this is where your problem is. cast() is evaluating those functions without reference to the function it was called from. Inside cast() it evaluates fun.aggregate via funstofun and, although I don't really follow what it is doing, is getting stats:::frequency and not your local one.
Hence my comment to your Q. What do you wan the function to do? At the moment it would seem necessary to define a "frequency" function in the global environment so that cast() or funstofun() finds it. Give it a unique name so it is unlikely to clash with anything so it should be the only thing found, say .Frequency(). Without knowing what you want to do with the function (rather than what you thought the function [f.table] should do) it is a bit difficult to provide further guidance, but why not have .FrequencyNum() and .FrequencyFac() defined in the global workspace and rewrite your f.table() wrapper calls to cast to use the relevant one?
.FrequencyFac <- function(X, N) {
round(length(X)/N, digits=2)
}
.FrequencyNum <- function(X, N) {
X[X > 1] <- 1
X[is.na(X)] <- 0
round(sum(X)/N, digits=2)
}
f.table <- function(x, N) {
if (is.factor(x[[1]])) {
x <- na.omit(melt(x, c()))
x <- dcast(x, variable ~ value, .FrequencyFac, N = N)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(.FrequencyNum, mean, sd, min, max), N = N)
##x <- transform(x, variable=reorder(variable, frequency))
## left this out as I wanted to see what cast returned
}
return(x)
}
Which I thought would work, but it is not finding N, and it should be. So perhaps I am missing something here?
By the way, it is probably not a good idea to rely on function that find n (in your version) from outside the function. Always pass in the variables you need as arguments.
I don't have the package that contains melt, but there are a couple potential issues I can see:
Your frequency functions do not return anything.
It's generally bad practice to alter function inputs (x is the input and the output).
There is already a generic frequency function in stats package in base R, which may cause issues with method dispatch (I'm not sure).