I'm using the mtcars dataset in R. I have a list of data frames (mtcars dataset split into number of cylinders). I need to:
Identify the car with the min value for miles per gallon (mpg) within each cylinder type (i.e. 4,6,8).
Create a vector that stores the values of horsepower (hp) for each of the cars found in step 1 (the length of the vector will be 3).
Steps I have performed so far, as follows:
# load the data
data(mtcars)
# split cars data.frame into a list of data frames by cylinder
cars <- split(mtcars, mtcars$cyl)
# find the position within each data frame for the min values of mpg (i.e. first
# column)
positions <- sapply(cars,function(x) which.min(x[,1]))
As I see it, the next step would be to make a loop over each data frame to find the horsepower value for each position. I have tried to make a For loop for this, but I haven't been able to make it work. Maybe there's even a better solution for this problem.
You don't need to split the data and then use sapply. There are many ways to reach that output using much more efficient ways. Here's possible data.table solution
mtcars$Cars <- rownames(mtcars)
library(data.table)
data.table(mtcars)[, list(Car = Cars[which.min(mpg)],
HP = hp[which.min(mpg)]),
by = cyl]
# cyl Car HP
# 1: 6 Merc 280C 123
# 2: 4 Volvo 142E 109
# 3: 8 Cadillac Fleetwood 205
Or maybe using dplyr
library(dplyr)
mtcars %>%
mutate(Cars = rownames(mtcars)) %>%
group_by(cyl) %>%
summarize(Car = Cars[which.min(mpg)], HP = hp[which.min(mpg)])
# Source: local data frame [3 x 3]
#
# cyl Car HP
# 1 4 Volvo 142E 109
# 2 6 Merc 280C 123
# 3 8 Cadillac Fleetwood 205
From the pre-split cars set, you can do it this way with Map and Reduce.
> Reduce(rbind,
Map(function(x) x[which.min(x$mpg), "hp", drop = FALSE],
cars, USE.NAMES = FALSE)
)
hp
# Volvo 142E 109
# Merc 280C 123
# Cadillac Fleetwood 205
If you wanted a vector, you can assign the above code to a variable, say rr, and do
> setNames(rr[,1], rownames(rr))
# Volvo 142E Merc 280C Cadillac Fleetwood
# 109 123 205
This is really easy if you use the plyr library. Here ya go:
library(plyr)
data(mtcars)
mpMins <- ddply(mtcars, .(cyl),summarize, min = min(mpg), .drop = FALSE)
mpMins
cyl min
1 4 21.4
2 6 17.8
3 8 10.4
This only gives you the minimum value of the mpg though, you want the horsepower too
hpMins <- (merge(mpMins, mtcars, by.x = c("min","cyl"), by.y = c("mpg","cyl" )))$hp
hpMins
[1] 205 215 123 109
Strange, there are four values. You said you wanted three. If you go back and check the data though, there are two minimum values of 10.4 for the 8 cylinder category. Remember to be careful when going to summary values (like minimums) to individual observations.
Related
I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:
library(dplyr)
set.seed(-1)
mtcars %>% slice_sample(n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.0 1 0 4 2
But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:
library(arrow)
# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")
open_dataset("~/mtcars") %>%
slice_sample(n = 3) %>%
collect()
# Error in UseMethod("slice_sample") :
# no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Clearly, slice_sample isn't implemented for parquet files and neither is slice:
open_dataset("~/mtcars") %>% nrow() -> n
subsample <- sample(1:n, 3)
open_dataset("~/mtcars") %>%
slice(subsample) %>%
collect()
# Error in UseMethod("slice") :
# no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Now, I know filter is implemented, so I tried that:
open_dataset("~/mtcars") %>%
filter(row_number() %in% subsample) %>%
collect()
# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.
(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)
This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.
Question: is there a way to randomly subsample a parquet file before loading it with collect?
It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.
One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.
open_dataset("~/mtcars") %>%
map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
collect()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 2 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
# 3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
I have a table temp1 that has 2 columns "Hospital.Name" and "heart attack" and 1 variable called "colname"
colname <- "heart attack"
Hospital.Name heart attack
ROUND ROCK MEDICAL CENTER 14.9
CYPRESS FAIRBANKS MEDICAL CENTER 12.0
I am trying to bring the record with the lowest "heart attack" number but I am getting an error on my formula it brings nothing, this is what I have:
temp1[which(temp1[[colname1]] == min(as.numeric(temp1[[colname1]]))),]
[1] Hospital.Name heart attack
<0 rows> (or 0-length row.names)
is bringing no results
but I know the right part of the formula is right because when I use
min(as.numeric(temp1[[colname1]]))
[1] 12
I get the min result of the "heart attack" column
Please help me with my formula:
temp1[which(temp1[[colname1]] == min(as.numeric(temp1[[colname1]]))),]
If I understood you correctly then you want all the information against a row for which one of the variables has minimum value.
You can try which.min if this is what you want to do.
using mtcars data set present in R session:
mtcars[which.min(mtcars$mpg),]
Above will fetch record(row) which has minimum value of mpg field in mtcars data.
#> mtcars[which.min(mtcars$mpg),]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
Now If you use which the way you have used in your dataset, you can have something like this:
mtcars[which(mtcars[[colname1]] == min(mtcars[[colname1]])),]
This will produce two records like below:
#> mtcars[which(mtcars[[colname1]] == min(mtcars[[colname1]])),]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Moral of the story which.min produces first instances of logical match, but which can give you both the instances of the match if there are multiple records of same minimum value.
From Documentation:
Determines the location, i.e., index of the (first) minimum or maximum
of a numeric (or logical) vector.
In your case it might be something like:
temp1[which.min(temp1[,colname]) ,]
In case if its not in numeric, then rather doing lot of things in a step, break it for simplicity.
temp1[,colname] <- as.numeric(temp1[,colname]) ##numeric conversion
temp1[which.min(temp1[,colname]) ,]
where colname = "heart attack" as per your question
If you use below code you can have multiple records, also it seems you have written the right code , your code is not working because you have a typo between colname and colname1
temp1[which(temp1[[colname]] == min(temp1[[colname]])),]
I've got a mixed data set (categorical and continuous variables) and I'd like to do hierarchical clustering using Gower distance.
I base my code on an example from https://www.r-bloggers.com/hierarchical-clustering-in-r-2/, which uses base R dist() for Euclidean distance. Since dist() doesn't compute Gower distance, I've tried using philentropy::distance() to compute it but it doesn't work.
Thanks for any help!
# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)
# Hierarchical clustering with Euclidean distance - works
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)
# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)
The error is in the distance function itself.
I don't know if it's intentional or not, but the current implementation of philentropy::distance with the "gower" method cannot handle any mixed data types, since the first operation is to transpose the data.frame, producing a character matrix which then throws the typing error when passed to the DistMatrixWithoutUnit function.
You might try using the daisy function from cluster instead.
library(cluster)
x <- mtcars[,1:2]
x$cyl <- as.factor(x$cyl)
dist <- daisy(x, metric = "gower")
cls <- hclust(dist)
plot(cls)
EDIT: For future reference it seems like philentropy will be updated to included better type handling in the next version. From the vignette
In future versions of philentropy I will optimize the distance()
function so that internal checks for data type correctness and correct
input data will take less termination time than the base dist()
function.
LLL;
Sorry, I don't know English and I can't explain. Now this is a try.
But the code is good ;-)
library(philentropy)
clusters <- hclust(
as.dist(
distance(mtcars[, 1:2], method = "gower")))
plot(clusters)
Good look
You can do it pretty efficiently with the gower package
library(gower)
d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)
Many thanks for this great question and thanks to all of you who provided excellent answers.
Just to resolve the issue for future readers:
# import example data
data("mtcars")
# store example subset with correct data type
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)),
cyl = as.numeric(as.vector(mtcars$cyl)),
disp = as.numeric(as.vector(mtcars$disp)))
# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)
# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)
# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower",
use.row.names = TRUE)))
plot(clusters)
As you can see, clustering works perfectly fine now.
The problem is that in the example dataset the column cyl stores factor values and not double values as is required for the philentropy::distance() function. Since the underlying code is written in Rcpp, non-conform data types will cause problems. As noted correctly by Esther, I will implement a better way to check type safety in future versions of the package.
head(tibble::as.tibble(mtcars))
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
To overcome this limitation, I stored the columns of interest from the mtcars dataset in a separate data.frame/tibble and converted all columns to double values via as.numeric(as.vector(mtcars$mpg)).
The resulting subset data.frame now stores only double values as required.
mtcars_subset
# A tibble: 32 x 3
mpg cyl disp
<dbl> <dbl> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
5 18.7 8 360
6 18.1 6 225
7 14.3 8 360
8 24.4 4 147.
9 22.8 4 141.
10 19.2 6 168.
# … with 22 more rows
Please also note that if you provide the philentropy::distance() function only 2 input vectors, then only one distance value will be returned and the hclust() function won't be able to compute any clusters with one value. Hence, I added a third column disp to enable visualization of the clusters.
I hope this helps.
One of the quirks of subsetting a data frame is that you have to repeatedly type the name of that data frame when mentioning columns. For example, the data frame cars is mentioned 3 times here:
cars[cars$speed == 4 & cars$dist < 10, ]
## speed dist
## 1 4 2
The data.table package solves this.
library(data.table)
dt_cars <- as.data.table(cars)
dt_cars[speed == 4 & dist < 10]
As does dplyr.
library(dplyr)
cars %>% filter(speed == 4, dist < 10)
I'd like to know if a solution exists for standard-issue data.frames (that is, not resorting to data.table or dplyr).
I think I'm looking for something like
cars[MAGIC(speed == 4 & dist < 10), ]
or
MAGIC(cars[speed == 4 & dist < 10, ])
where MAGIC is to be determined.
I tried the following, but it gave me an error.
library(rlang)
cars[locally(speed == 4 & dist < 10), ]
# Error in locally(speed == 4 & dist < 10) : object 'speed' not found
1) subset This only requires that cars be mentioned once. No packages are used.
subset(cars, speed == 4 & dist < 10)
## speed dist
## 1 4 2
2) sqldf This uses a package but does not use dplyr or data.table which were the only two packages excluded by the question:
library(sqldf)
sqldf("select * from cars where speed = 4 and dist < 10")
## speed dist
## 1 4 2
3) assignment Not sure if this counts but you could assign cars to some other variable name such as . and then use that. In that case cars would only be mentioned once. This uses no packages.
. <- cars
.[.$speed == 4 & .$dist < 10, ]
## speed dist
## 1 4 2
or
. <- cars
with(., .[speed == 4 & dist < 10, ])
## speed dist
## 1 4 2
With respect to these two solutions you might want to check out this article on the Bizarro Pipe: http://www.win-vector.com/blog/2017/01/using-the-bizarro-pipe-to-debug-magrittr-pipelines-in-r/
4) magrittr This could also be expressed in magrittr and that package was not excluded by the question. Note we are using the magrittr %$% operator:
library(magrittr)
cars %$% .[speed == 4 & dist < 10, ]
## speed dist
## 1 4 2
subset is the base function which solves this problem. However, like all base R functions which use non-standard evaluation subset does not perform fully hygienic code expansion. So subset() evaluates the wrong variable when used within non-global scopes (such as in lapply loops).
As an example, here we define the variable var in two places, first in the global scope with value 40, then in a local scope with value 30. The use of local() here is for simplicity, however this would behave equivalently inside a function. Intuitively, we would expect subset to use the value 30 in the evaluation. However upon executing the following code we see instead the value 40 is used (so no rows are returned).
var <- 40
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset, mpg > var)
})
#> [[1]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
#>
#> [[2]]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
This happens because the parent.frame() used in subset() is the environment within the body of lapply() rather than the local block. Because all environments eventually inherit from the global environment the variable var is found there with value 40.
Hygienic variable expansion via quasiquotation (as implemented in the rlang package) solves this problem. We can define a variant of subset using tidy evaluation that works properly in all contexts. The code is derived from and largely identical to that of base::subset.data.frame().
subset2 <- function (x, subset, select, drop = FALSE, ...) {
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
r <- rlang::eval_tidy(rlang::enquo(subset), x)
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
rlang::eval_tidy(rlang::enquo(select), nl)
}
x[r, vars, drop = drop]
}
This version of subset behaves identically to base::subset.data.frame().
subset2(mtcars, gear > 4, disp:wt)
#> disp hp drat wt
#> Porsche 914-2 120.3 91 4.43 2.140
#> Lotus Europa 95.1 113 3.77 1.513
#> Ford Pantera L 351.0 264 4.22 3.170
#> Ferrari Dino 145.0 175 3.62 2.770
#> Maserati Bora 301.0 335 3.54 3.570
However subset2() does not suffer the scoping issues of subset. In our previous example the value 30 is used for var, as we would expect from lexical scoping rules.
local({
var <- 30
dfs <- list(mtcars, mtcars)
lapply(dfs, subset2, mpg > var)
})
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
This allows non-standard evaluation to be used robustly in all contexts, not just in top level contexts as with previous approaches.
This makes functions which use non-standard evaluation much more useful. Before while they were nice to have for interactive use, you needed to use more verbose standard evaluation functions when writing functions and packages. Now the same function can be used in all contexts without needing to modify the code!
For more details on non-standard evaluation please see Lionel Henry's Tidy evaluation (hygienic fexprs) presentation, the rlang vignette on tidy evaluation and the programming with dplyr vignette.
I understand I'm totally cheating, but technically it works :):
with(cars, data.frame(speed=speed,dist=dist)[speed == 4 & dist < 10,])
# speed dist
# 1 4 2
More horror:
`[` <- function(x,i,j){
rm(`[`,envir = parent.frame())
eval(parse(text=paste0("with(x,x[",deparse(substitute(i)),",])")))
}
cars[speed == 4 & dist < 10, ]
# speed dist
# 1 4 2
Solution with overriding [ method for data.frame. In the new method we check class of the i argument and if it is expression or formula we evaluate it in the data.frame context.
##### override subsetting method
`[.data.frame` = function (x, i, j, ...) {
if(!missing(i) && (is.language(i) || is.symbol(i) || inherits(i, "formula"))) {
if(inherits(i, "formula")) i = as.list(i)[[2]]
i = eval(i, x, enclos = baseenv())
}
base::`[.data.frame`(x, i, j, ...)
}
#####
data(cars)
cars[cars$speed == 4 & cars$dist < 10, ]
# speed dist
# 1 4 2
# cars[speed == 4 & dist < 10, ] # error
cars[quote(speed == 4 & dist < 10),]
# speed dist
# 1 4 2
# ,or
cars[~ speed == 4 & dist < 10,]
# speed dist
# 1 4 2
Another solution with more magic. Please, restart R session to avoid interference with previous solution:
locally = function(expr){
curr_call = as.list(sys.call(1))
if(as.character(curr_call[[1]])=="["){
possibly_df = eval(curr_call[[2]], parent.frame())
if(is.data.frame(possibly_df)){
expr = substitute(expr)
expr = eval(expr, possibly_df, enclos = baseenv())
}
}
expr
}
cars[locally(speed == 4 & dist < 10), ]
# speed dist
# 1 4 2
Using attach()
attach(cars)
cars[speed == 4 & dist < 10,]
# speed dist
# 1 4 2
I was very early on in my R learning dissuaded from using attach(), but as long as you're careful not to introduce name conflicts I think it should be OK.
As subset() manual states:
Warning: This is a convenience function intended for use interactively
I learned from this great article not only the secret behind this warning, but a good understanding of substitute(), match.call(), eval(), quote(), call, promise and other related R subjects, that are a little bit complicated.
Now I understand what's the warning above for. A super-simple implementation of subset() could be as follows:
subset = function(x, condition) x[eval(substitute(condition), envir=x),]
While subset(mtcars, cyl==4) returns the table of rows in mtcars that satisfy cyl==4, enveloping subset() in another function fails:
sub = function(x, condition) subset(x, condition)
sub(mtcars, cyl == 4)
# Error in eval(expr, envir, enclos) : object 'cyl' not found
Using the original version of subset() also produces exactly the same error condition. This is due to the limitation of substitute()-eval() pair: It works fine while condition is cyl==4, but when the condition is passed through the enveloping function sub(), the condition argument of subset() will be no longer cyl==4, but the nested condition in the sub() body, and the eval() fails - it's a bit complicated.
But does it exist any other implementation of subset() with exactly the same arguments that would be programming-safe - i.e. able to evaluate its condition while it's called by another function?
The [ function is what you're looking for. ?"[". mtcars[mtcars$cyl == 4,] is equivalent to the subset command and is "programming" safe.
sub = function(x, condition) {
x[condition,]
}
sub(mtcars, mtcars$cyl==4)
Does what you're asking without the implicit with() in the function call. The specifics are complicated, however a function like:
sub = function(x, quoted_condition) {
x[with(x, eval(parse(text=quoted_condition))),]
}
sub(mtcars, 'cyl==4')
Sorta does what you're looking for, but there are edge cases where this will have unexpected results.
using data.table and the [ subset function you can get the implicit with(...) you're looking for.
library(data.table)
MT = data.table(mtcars)
MT[cyl==4]
there are better, faster ways to do this subsetting in data.table, but this illustrates the point well.
using data.table you can also construct expressions to be evaluated later
cond = expression(cyl==4)
MT[eval(cond)]
these two can now be passed through functions:
wrapper = function(DT, condition) {
DT[eval(condition)]
}
Here's an alternative version of subset() which continues to work even when it's nested -- at least as long as the logical subsetting expression (e.g. cyl == 4) is supplied to the top-level function call.
It works by climbing up the call stack, substitute()ing at each step to ultimately capture the logical subsetting expression passed in by the user. In the call to sub2() below, for example, the for loop works up the call stack from expr to x to AA and finally to cyl ==4.
SUBSET <- function(`_dat`, expr) {
ff <- sys.frames()
ex <- substitute(expr)
ii <- rev(seq_along(ff))
for(i in ii) {
ex <- eval(substitute(substitute(x, env=sys.frames()[[n]]),
env = list(x = ex, n=i)))
}
`_dat`[eval(ex, envir = `_dat`),]
}
## Define test functions that nest SUBSET() more and more deeply
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)
## Show that it works, at least when the top-level function call
## contains the logical subsetting expression
a <- SUBSET(mtcars, cyl == 4) ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4) ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4) ## SUBSET() called two levels down
identical(a,b)
# [1] TRUE
> identical(a,c)
# [1] TRUE
a[1:5,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
** For some explanation of the construct inside the for loop, see Section 6.2, paragraph 6 of the R Language Definition manual.
Just because it's such mind-bending fun (??), here is a slightly different solution that addresses a problem Hadley pointed to in comments to my accepted solution.
Hadley posted a gist demonstrating a situation in which my accepted function goes awry. The twist in that example (copied below) is that a symbol passed to SUBSET() is defined in the body (rather than the arguments) of one of the calling functions; it thus gets captured by substitute() instead of the intended global variable. Confusing stuff, I know.
f <- function() {
cyl <- 4
g()
}
g <- function() {
SUBSET(mtcars, cyl == 4)$cyl
}
f()
Here is a better function that will only substitute the values of symbols found in calling functions' argument lists. It works in all of the situations that Hadley or I have so far proposed.
SUBSET <- function(`_dat`, expr) {
ff <- sys.frames()
n <- length(ff)
ex <- substitute(expr)
ii <- seq_len(n)
for(i in ii) {
## 'which' is the frame number, and 'n' is # of frames to go back.
margs <- as.list(match.call(definition = sys.function(n - i),
call = sys.call(sys.parent(i))))[-1]
ex <- eval(substitute(substitute(x, env = ll),
env = list(x = ex, ll = margs)))
}
`_dat`[eval(ex, envir = `_dat`),]
}
## Works in Hadley's counterexample ...
f()
# [1] 4 4 4 4 4 4 4 4 4 4 4
## ... and in my original test cases.
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)
a <- SUBSET(mtcars, cyl == 4) ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4) ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4)
all(identical(a, b), identical(b, c))
# [1] TRUE
IMPORTANT: Please note that this still is not (nor can it be made into) a generally useful function. There's simply no way for the function to know which symbols you want it to use in all of the substitutions it performs as it works up the call stack. There are many situations in which users would want it to use the values of symbols assigned to within function bodies, but this function will always ignore those.
Update:
Here is a new version which fixes two problems:
a) The previous version simply traversed sys.frames() backwards. This version follows parent.frames() until it reaches .GlobalEnv. This is important in, e.g., subscramble, where scramble's frame should be ignored.
b) This version has a single substitute per level. This prevents the second substitute call from substituting symbols from one level higher that were introduced by the first substitute call.
subset <- function(x, condition) {
call <- substitute(condition)
frames <- sys.frames()
parents <- sys.parents()
# starting one frame up, keep climbing until we get to .GlobalEnv
i <- tail(parents, 1)
while(i != 0) {
f <- sys.frames()[[i]]
# copy x into f, except for variable with conflicting names.
xnames <- setdiff(ls(x), ls(f))
for (n in xnames) assign(n, x[[n]], envir=f)
call <- eval(substitute(substitute(expr, f), list(expr=call)))
# leave f the way we found it
rm(list=xnames, envir=f)
i <- parents[i]
}
r <- eval(call, x, .GlobalEnv)
x[r, ]
}
This version passes #hadley's test from the comments:
mtcars $ condition <- 4; subscramble(mtcars, cyl == 4)
Unfortunately the following two examples now behave differently:
cyl <- 6; subset(mtcars, cyl==4)
local({cyl <- 6; subset(mtcars, cyl==4)})
This is a slight modification of Josh's first function. At each frame in the stack, we substitute from x before substituting from the frame. This means that symbols in the data frame take precedence at every step. We can avoid pseudo-gensyms like _dat by skipping subset's frame in the for loop.
subset <- function(x, condition) {
call <- substitute(condition)
frames <- rev(sys.frames())[-1]
for(f in frames) {
call <- eval(substitute(substitute(expr, x), list(expr=call)))
call <- eval(substitute(substitute(expr, f), list(expr=call)))
}
r <- eval(call, x, .GlobalEnv)
x[r, ]
}
This version works in the simple case (it's worth checking that we haven't had a regression):
subset(mtcars, cyl == 4)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
It also works with subscramble and f:
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) scramble(subset(x, condition))
subscramble(mtcars, cyl == 4) $ cyl
# [1] 4 4 4 4 4 4 4 4 4 4 4
f <- function() {cyl <- 4; g()}
g <- function() subset(mtcars, cyl == 4) $ cyl
g()
# [1] 4 4 4 4 4 4 4 4 4 4 4
And even works in some trickier situations:
gear5 <- function(z, condition) {
x <- 5
subset(z, condition & (gear == x))
}
x <- 4
gear5(mtcars, cyl == x)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
The lines inside the for loop might require some explanation. Suppose call is assigned as follows:
call <- quote(y == x)
str(call)
# language y == x
We want to substitute the value 4 for x in call. But the straightforward way doesn't work, since we want the contents of call, not the symbol call.
substitute(call, list(x=4))
# call
So we build the expression we need, using another substitute call.
substitute(substitute(expr, list(x=4)), list(expr=call))
# substitute(y == x, list(x = 4))
Now we have a language object that describes what we want to do. All that's left it to actually do it:
eval(substitute(substitute(expr, list(x=4)), list(expr=call)))
# y == 4