hierarchical clustering with gower distance - hclust() and philentropy::distance() - r

I've got a mixed data set (categorical and continuous variables) and I'd like to do hierarchical clustering using Gower distance.
I base my code on an example from https://www.r-bloggers.com/hierarchical-clustering-in-r-2/, which uses base R dist() for Euclidean distance. Since dist() doesn't compute Gower distance, I've tried using philentropy::distance() to compute it but it doesn't work.
Thanks for any help!
# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)
# Hierarchical clustering with Euclidean distance - works
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)
# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)

The error is in the distance function itself.
I don't know if it's intentional or not, but the current implementation of philentropy::distance with the "gower" method cannot handle any mixed data types, since the first operation is to transpose the data.frame, producing a character matrix which then throws the typing error when passed to the DistMatrixWithoutUnit function.
You might try using the daisy function from cluster instead.
library(cluster)
x <- mtcars[,1:2]
x$cyl <- as.factor(x$cyl)
dist <- daisy(x, metric = "gower")
cls <- hclust(dist)
plot(cls)
EDIT: For future reference it seems like philentropy will be updated to included better type handling in the next version. From the vignette
In future versions of philentropy I will optimize the distance()
function so that internal checks for data type correctness and correct
input data will take less termination time than the base dist()
function.

LLL;
Sorry, I don't know English and I can't explain. Now this is a try.
But the code is good ;-)
library(philentropy)
clusters <- hclust(
as.dist(
distance(mtcars[, 1:2], method = "gower")))
plot(clusters)
Good look

You can do it pretty efficiently with the gower package
library(gower)
d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)

Many thanks for this great question and thanks to all of you who provided excellent answers.
Just to resolve the issue for future readers:
# import example data
data("mtcars")
# store example subset with correct data type
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)),
cyl = as.numeric(as.vector(mtcars$cyl)),
disp = as.numeric(as.vector(mtcars$disp)))
# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)
# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)
# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower",
use.row.names = TRUE)))
plot(clusters)
As you can see, clustering works perfectly fine now.
The problem is that in the example dataset the column cyl stores factor values and not double values as is required for the philentropy::distance() function. Since the underlying code is written in Rcpp, non-conform data types will cause problems. As noted correctly by Esther, I will implement a better way to check type safety in future versions of the package.
head(tibble::as.tibble(mtcars))
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
To overcome this limitation, I stored the columns of interest from the mtcars dataset in a separate data.frame/tibble and converted all columns to double values via as.numeric(as.vector(mtcars$mpg)).
The resulting subset data.frame now stores only double values as required.
mtcars_subset
# A tibble: 32 x 3
mpg cyl disp
<dbl> <dbl> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
5 18.7 8 360
6 18.1 6 225
7 14.3 8 360
8 24.4 4 147.
9 22.8 4 141.
10 19.2 6 168.
# … with 22 more rows
Please also note that if you provide the philentropy::distance() function only 2 input vectors, then only one distance value will be returned and the hclust() function won't be able to compute any clusters with one value. Hence, I added a third column disp to enable visualization of the clusters.
I hope this helps.

Related

How to input data in xgb.train function in R?

I'm going to perform xgboost on R using xgb.train function.
In order to use the xgb.train function, I know that input data must be transformed as using xgb.DMatrix function.
But when I used this function in my data setm I got an error message :
Error in xgb.DMatrix(data = as.matrix(train)) :
[09:01:01] amalgamation/../dmlc-core/src/io/local_filesys.cc:66: LocalFileSystem.GetPathInfo 1 Error:No such file or directory
Following is my full R code. To use input data, How to transform input data?
credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) credit[,i]=as.factor(credit[,i])
str(credit)
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=credit$Creditability, p=0.7, list=FALSE)
train<-credit[intrain, ]
test<-credit[-intrain, ]
d_train<-xgb.DMatrix(data=as.matrix(train))
If you still want to use factors you should use the model.matrix() function to convert your factors to dummy variables.
For example:
my.dat <- mtcars[c("mpg","cyl","disp")]
my.dat$cyl <- as.factor(my.dat$cyl)
# Convert data frame to X matrix
x.train <- model.matrix(mpg~.,data=my.dat)
head(x.train)
Output:
(Intercept) cyl6 cyl8 disp
Mazda RX4 1 1 0 160
Mazda RX4 Wag 1 1 0 160
Datsun 710 1 0 0 108
Hornet 4 Drive 1 1 0 258
Hornet Sportabout 1 0 1 360
Valiant 1 1 0 225
This creates dummy variables cyl6 and cyl8 where 4 cylinder vehicles would be the base group (where cyl6=0 and cyl8=0).
Then you can pass this matrix into the xgb.DMatrix function:
d_train<-xgb.DMatrix(x.train,label=my.dat$mpg)

For loop over a List of Data frames

I'm using the mtcars dataset in R. I have a list of data frames (mtcars dataset split into number of cylinders). I need to:
Identify the car with the min value for miles per gallon (mpg) within each cylinder type (i.e. 4,6,8).
Create a vector that stores the values of horsepower (hp) for each of the cars found in step 1 (the length of the vector will be 3).
Steps I have performed so far, as follows:
# load the data
data(mtcars)
# split cars data.frame into a list of data frames by cylinder
cars <- split(mtcars, mtcars$cyl)
# find the position within each data frame for the min values of mpg (i.e. first
# column)
positions <- sapply(cars,function(x) which.min(x[,1]))
As I see it, the next step would be to make a loop over each data frame to find the horsepower value for each position. I have tried to make a For loop for this, but I haven't been able to make it work. Maybe there's even a better solution for this problem.
You don't need to split the data and then use sapply. There are many ways to reach that output using much more efficient ways. Here's possible data.table solution
mtcars$Cars <- rownames(mtcars)
library(data.table)
data.table(mtcars)[, list(Car = Cars[which.min(mpg)],
HP = hp[which.min(mpg)]),
by = cyl]
# cyl Car HP
# 1: 6 Merc 280C 123
# 2: 4 Volvo 142E 109
# 3: 8 Cadillac Fleetwood 205
Or maybe using dplyr
library(dplyr)
mtcars %>%
mutate(Cars = rownames(mtcars)) %>%
group_by(cyl) %>%
summarize(Car = Cars[which.min(mpg)], HP = hp[which.min(mpg)])
# Source: local data frame [3 x 3]
#
# cyl Car HP
# 1 4 Volvo 142E 109
# 2 6 Merc 280C 123
# 3 8 Cadillac Fleetwood 205
From the pre-split cars set, you can do it this way with Map and Reduce.
> Reduce(rbind,
Map(function(x) x[which.min(x$mpg), "hp", drop = FALSE],
cars, USE.NAMES = FALSE)
)
hp
# Volvo 142E 109
# Merc 280C 123
# Cadillac Fleetwood 205
If you wanted a vector, you can assign the above code to a variable, say rr, and do
> setNames(rr[,1], rownames(rr))
# Volvo 142E Merc 280C Cadillac Fleetwood
# 109 123 205
This is really easy if you use the plyr library. Here ya go:
library(plyr)
data(mtcars)
mpMins <- ddply(mtcars, .(cyl),summarize, min = min(mpg), .drop = FALSE)
mpMins
cyl min
1 4 21.4
2 6 17.8
3 8 10.4
This only gives you the minimum value of the mpg though, you want the horsepower too
hpMins <- (merge(mpMins, mtcars, by.x = c("min","cyl"), by.y = c("mpg","cyl" )))$hp
hpMins
[1] 205 215 123 109
Strange, there are four values. You said you wanted three. If you go back and check the data though, there are two minimum values of 10.4 for the 8 cylinder category. Remember to be careful when going to summary values (like minimums) to individual observations.

Subsets of a dataset as separate dendrograms, but in the same plot

I know I can plot a dendrogram as follows
library(cluster)
d <- mtcars
d[,8:11] <- lapply(d[,8:11], as.factor)
gdist <- daisy(d, metric = c("gower"), stand = FALSE)
dendro <- hclust(gdist, method = "average")
plot(as.dendrogram(dendro))
However I have some groups identified (eg. by an iterative classification method), given as the last column in d
G <- c(1,2,3,3,4,4,5,5,5,5,1,2,1,1,2,4,1,3,4,5,1,7,4,3,3,2,1,1,1,3,5,6)
d$Group <- G
head(d)
mpg cyl disp hp drat wt qsec vs am gear carb Group
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 4
I am trying to plot all the dendrograms together on the same plot with the same scale. The groups with only a single member also needs to be plotted. (group 6 and 7)
I am able to plot individual dendrograms for subset of the data except when number of members in a group is only one. But I don't think this is the right approach.
layout(matrix(1:9, 3,3,byrow=TRUE))
gdist <- as.matrix(gdist)
N <- max(G)
for (i in 1:N){
rc_tokeep <- row.names(subset(d, G==i))
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend))
}
The loop is giving this error for the last two groups. (6 and 7) having only a single member.
Error in hclust(dis, method = "average") :
must have n >= 2 objects to cluster
Essentially I wan't to reproduce these type of plots. The clusters with single members are also plotted here.
If you want to mimic the last few graphs, you can do something like this:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){ #The idea is to catch the groups with one single element to plot them differently
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,
xlim=c(.8,0),axes=FALSE) # giving the same xlim will scale all of them, here i used 0.8 to fit your data but you can change it to whatever
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5) #I don't know how you intend to compute the length of the branch in a group of 1 element, you might want to change that
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
}
}
With your example it gives:
If you want to add the scale you can add a grid in all graphs and a scale in the last one:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}
}
axis(1,at=seq(0,.8,.1)) #Here the axis
And finally if you want to even the spaces between the different branches in the resulting plot, you can use table(d$Group) to get the number of members of each group and use it as a height for each subplot:
N <- max(G)
layout(matrix(c(0,1:7,0),nc=1), height=c(3,table(d$Group),3)) #Plus the height of the empty spaces.
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3)
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3)
}
}
axis(1,at=seq(0,.8,.1))

R flag cases with missingness from regression analysis

When running a regression analysis in R (using glm) cases are removed due to 'missingness' of the data. Is there any way to flag which cases have been removed? I would ideally like to remove these from my original dataframe.
Many thanks
The model fit object returned by glm() records the row numbers of the data that it excludes for their incompleteness. They are a bit buried but you can retrieve them like this:
## Example data.frame with some missing data
df <- mtcars[1:6, 1:5]
df[cbind(1:5,1:5)] <- NA
df
# mpg cyl disp hp drat
# Mazda RX4 NA 6 160 110 3.90
# Mazda RX4 Wag 21.0 NA 160 110 3.90
# Datsun 710 22.8 4 NA 93 3.85
# Hornet 4 Drive 21.4 6 258 NA 3.08
# Hornet Sportabout 18.7 8 360 175 NA
# Valiant 18.1 6 225 105 2.76
## Fit an example model, and learn which rows it excluded
f <- glm(mpg~drat,weight=disp, data=df)
as.numeric(na.action(f))
# [1] 1 3 5
Alternatively, to get the row indices without having to fit the model, use the same strategy with the output of model.frame():
as.numeric(na.action(model.frame(mpg~drat,weight=disp, data=df)))
# [1] 1 3 5
Without a reproducible example I can't provide code tailored to your problem, but here's a generic method that should work. Assume your data frame is called df and your variables are called y, x1, x2, etc. And assume you want y, x1, x3, and x6 in your model.
# Make a vector of the variables that you want to include in your glm model
# (Be sure to include any weighting or subsetting variables as well, per Josh's comment)
glm.vars = c("y","x1","x3","x6")
# Create a new data frame that includes only those rows with no missing values
# for the variables that are in your model
df.glm = df[complete.cases(df[ , glm.vars]), ]
Also, if you want to see just the rows that have at least one missing value, do the following (note the addition of ! (the "not" operator)):
df[!complete.cases(df[ , glm.vars]), ]

Programming-safe version of subset - to evaluate its condition while called from another function

As subset() manual states:
Warning: This is a convenience function intended for use interactively
I learned from this great article not only the secret behind this warning, but a good understanding of substitute(), match.call(), eval(), quote(), ‍call, promise and other related R subjects, that are a little bit complicated.
Now I understand what's the warning above for. A super-simple implementation of subset() could be as follows:
subset = function(x, condition) x[eval(substitute(condition), envir=x),]
While subset(mtcars, cyl==4) returns the table of rows in mtcars that satisfy cyl==4, enveloping subset() in another function fails:
sub = function(x, condition) subset(x, condition)
sub(mtcars, cyl == 4)
# Error in eval(expr, envir, enclos) : object 'cyl' not found
Using the original version of subset() also produces exactly the same error condition. This is due to the limitation of substitute()-eval() pair: It works fine while condition is cyl==4, but when the condition is passed through the enveloping function sub(), the condition argument of subset() will be no longer cyl==4, but the nested condition in the sub() body, and the eval() fails - it's a bit complicated.
But does it exist any other implementation of subset() with exactly the same arguments that would be programming-safe - i.e. able to evaluate its condition while it's called by another function?
The [ function is what you're looking for. ?"[". mtcars[mtcars$cyl == 4,] is equivalent to the subset command and is "programming" safe.
sub = function(x, condition) {
x[condition,]
}
sub(mtcars, mtcars$cyl==4)
Does what you're asking without the implicit with() in the function call. The specifics are complicated, however a function like:
sub = function(x, quoted_condition) {
x[with(x, eval(parse(text=quoted_condition))),]
}
sub(mtcars, 'cyl==4')
Sorta does what you're looking for, but there are edge cases where this will have unexpected results.
using data.table and the [ subset function you can get the implicit with(...) you're looking for.
library(data.table)
MT = data.table(mtcars)
MT[cyl==4]
there are better, faster ways to do this subsetting in data.table, but this illustrates the point well.
using data.table you can also construct expressions to be evaluated later
cond = expression(cyl==4)
MT[eval(cond)]
these two can now be passed through functions:
wrapper = function(DT, condition) {
DT[eval(condition)]
}
Here's an alternative version of subset() which continues to work even when it's nested -- at least as long as the logical subsetting expression (e.g. cyl == 4) is supplied to the top-level function call.
It works by climbing up the call stack, substitute()ing at each step to ultimately capture the logical subsetting expression passed in by the user. In the call to sub2() below, for example, the for loop works up the call stack from expr to x to AA and finally to cyl ==4.
SUBSET <- function(`_dat`, expr) {
ff <- sys.frames()
ex <- substitute(expr)
ii <- rev(seq_along(ff))
for(i in ii) {
ex <- eval(substitute(substitute(x, env=sys.frames()[[n]]),
env = list(x = ex, n=i)))
}
`_dat`[eval(ex, envir = `_dat`),]
}
## Define test functions that nest SUBSET() more and more deeply
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)
## Show that it works, at least when the top-level function call
## contains the logical subsetting expression
a <- SUBSET(mtcars, cyl == 4) ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4) ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4) ## SUBSET() called two levels down
identical(a,b)
# [1] TRUE
> identical(a,c)
# [1] TRUE
a[1:5,]
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
** For some explanation of the construct inside the for loop, see Section 6.2, paragraph 6 of the R Language Definition manual.
Just because it's such mind-bending fun (??), here is a slightly different solution that addresses a problem Hadley pointed to in comments to my accepted solution.
Hadley posted a gist demonstrating a situation in which my accepted function goes awry. The twist in that example (copied below) is that a symbol passed to SUBSET() is defined in the body (rather than the arguments) of one of the calling functions; it thus gets captured by substitute() instead of the intended global variable. Confusing stuff, I know.
f <- function() {
cyl <- 4
g()
}
g <- function() {
SUBSET(mtcars, cyl == 4)$cyl
}
f()
Here is a better function that will only substitute the values of symbols found in calling functions' argument lists. It works in all of the situations that Hadley or I have so far proposed.
SUBSET <- function(`_dat`, expr) {
ff <- sys.frames()
n <- length(ff)
ex <- substitute(expr)
ii <- seq_len(n)
for(i in ii) {
## 'which' is the frame number, and 'n' is # of frames to go back.
margs <- as.list(match.call(definition = sys.function(n - i),
call = sys.call(sys.parent(i))))[-1]
ex <- eval(substitute(substitute(x, env = ll),
env = list(x = ex, ll = margs)))
}
`_dat`[eval(ex, envir = `_dat`),]
}
## Works in Hadley's counterexample ...
f()
# [1] 4 4 4 4 4 4 4 4 4 4 4
## ... and in my original test cases.
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)
a <- SUBSET(mtcars, cyl == 4) ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4) ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4)
all(identical(a, b), identical(b, c))
# [1] TRUE
IMPORTANT: Please note that this still is not (nor can it be made into) a generally useful function. There's simply no way for the function to know which symbols you want it to use in all of the substitutions it performs as it works up the call stack. There are many situations in which users would want it to use the values of symbols assigned to within function bodies, but this function will always ignore those.
Update:
Here is a new version which fixes two problems:
a) The previous version simply traversed sys.frames() backwards. This version follows parent.frames() until it reaches .GlobalEnv. This is important in, e.g., subscramble, where scramble's frame should be ignored.
b) This version has a single substitute per level. This prevents the second substitute call from substituting symbols from one level higher that were introduced by the first substitute call.
subset <- function(x, condition) {
call <- substitute(condition)
frames <- sys.frames()
parents <- sys.parents()
# starting one frame up, keep climbing until we get to .GlobalEnv
i <- tail(parents, 1)
while(i != 0) {
f <- sys.frames()[[i]]
# copy x into f, except for variable with conflicting names.
xnames <- setdiff(ls(x), ls(f))
for (n in xnames) assign(n, x[[n]], envir=f)
call <- eval(substitute(substitute(expr, f), list(expr=call)))
# leave f the way we found it
rm(list=xnames, envir=f)
i <- parents[i]
}
r <- eval(call, x, .GlobalEnv)
x[r, ]
}
This version passes #hadley's test from the comments:
mtcars $ condition <- 4; subscramble(mtcars, cyl == 4)
Unfortunately the following two examples now behave differently:
cyl <- 6; subset(mtcars, cyl==4)
local({cyl <- 6; subset(mtcars, cyl==4)})
This is a slight modification of Josh's first function. At each frame in the stack, we substitute from x before substituting from the frame. This means that symbols in the data frame take precedence at every step. We can avoid pseudo-gensyms like _dat by skipping subset's frame in the for loop.
subset <- function(x, condition) {
call <- substitute(condition)
frames <- rev(sys.frames())[-1]
for(f in frames) {
call <- eval(substitute(substitute(expr, x), list(expr=call)))
call <- eval(substitute(substitute(expr, f), list(expr=call)))
}
r <- eval(call, x, .GlobalEnv)
x[r, ]
}
This version works in the simple case (it's worth checking that we haven't had a regression):
subset(mtcars, cyl == 4)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
It also works with subscramble and f:
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) scramble(subset(x, condition))
subscramble(mtcars, cyl == 4) $ cyl
# [1] 4 4 4 4 4 4 4 4 4 4 4
f <- function() {cyl <- 4; g()}
g <- function() subset(mtcars, cyl == 4) $ cyl
g()
# [1] 4 4 4 4 4 4 4 4 4 4 4
And even works in some trickier situations:
gear5 <- function(z, condition) {
x <- 5
subset(z, condition & (gear == x))
}
x <- 4
gear5(mtcars, cyl == x)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
The lines inside the for loop might require some explanation. Suppose call is assigned as follows:
call <- quote(y == x)
str(call)
# language y == x
We want to substitute the value 4 for x in call. But the straightforward way doesn't work, since we want the contents of call, not the symbol call.
substitute(call, list(x=4))
# call
So we build the expression we need, using another substitute call.
substitute(substitute(expr, list(x=4)), list(expr=call))
# substitute(y == x, list(x = 4))
Now we have a language object that describes what we want to do. All that's left it to actually do it:
eval(substitute(substitute(expr, list(x=4)), list(expr=call)))
# y == 4

Resources