Bad idea? ggplotting an S3 class object - r

Many R objects have S3 methods to plot associated with them. For instance, every R regression tutorial contains something like this:
dat <- data.frame(x=runif(10))
dat$y <- dat$x+runif(10)
my.lm <- lm( y~x, dat )
plot(my.lm)
Which displays regression diagnostics.
Similarly, I have an S3 object for a package which consists of a list which basically holds a few time series. I have a plot.myobject method for it which reaches into the list, yanks out the time series, and plots them on the same graph. I would like to rewrite this as a ggplot2 function so that it will be prettier and perhaps more extensible as well.
Because this package is intended to get people without much R experience up and running quickly, I'd like this to be a one-liner with one argument, as in plot(myobject), ggplot(myobject), or whatever the appropriate version might be. Then once they get hooked, they can learn more about ggplot2 and customize the graph to their heart's content.
My initial temptation was to simply replace the internals of the plot.myobject method to use ggplot2. This, however, seems like it might lose me major style points.
Is this a bad idea, and if so why and what alternative should I use?

There is an existing idiom in ggplot2 to do exactly what you propose. It is called fortify. It takes an object and produces a version of the object in a form that ggplot can work with, i.e. a data.frame. Section 9.3 in Hadley's ggplot2 book describes how to do this, using the S3 object class lm as an example. To see this in action, type fortify.lm into your console to get the following code:
function (model, data = model$model, ...)
{
infl <- influence(model, do.coef = FALSE)
data$.hat <- infl$hat
data$.sigma <- infl$sigma
data$.cooksd <- cooks.distance(model, infl)
data$.fitted <- predict(model)
data$.resid <- resid(model)
data$.stdresid <- rstandard(model, infl)
data
}
<environment: namespace:ggplot2>
Here is my own example of writing a fortify method for tree, originally published on the ggplot2 mailing list
fortify.tree <- function(model, data, ...){
require(tree)
# Uses tree:::treeco to extract data frame of plot locations
xy <- tree:::treeco(model)
n <- model$frame$n
# Lines copied from tree:::treepl
x <- xy$x
y <- xy$y
node = as.numeric(row.names(model$frame))
parent <- match((node%/%2), node)
sibling <- match(ifelse(node%%2, node - 1L, node + 1L), node)
linev <- data.frame(x=x, y=y, xend=x, yend=y[parent], n=n)
lineh <- data.frame(x=x[parent], y=y[parent], xend=x,
yend=y[parent], n=n)
rbind(linev[-1,], lineh[-1,])
}
theme_null <- opts(
panel.grid.major = theme_blank(),
panel.grid.minor = theme_blank(),
axis.text.x = theme_blank(),
axis.text.y = theme_blank(),
axis.ticks = theme_blank(),
axis.title.x = theme_blank(),
axis.title.y = theme_blank(),
legend.position = "none"
)
And the plot code. Notice that the data passed to ggplot is not a data.frame but a tree object.
library(ggplot2)
library(tree)
data(cpus, package="MASS")
cpus.ltr <- tree(log10(perf) ~ syct+mmin+mmax+cach+chmin+chmax, cpus)
p <- ggplot(data=cpus.ltr) +
geom_segment(aes(x=x,y=y,xend=xend,yend=yend,size=n),
colour="blue", alpha=0.5) +
scale_size("n", to=c(0, 3)) +
theme_null
print(p)

As per Hadley's suggestion in comments, I have submitted a generic S3 autoplot() to the ggplot2 Github repository. So if it's accepted and checks out, there should be an autoplot available for this use in the future.
Update
autoplot is now available in ggplot2.

Using plot.myobject is easy to remember and execute. However, if you're talking about myobjects that already have plot.myobject functions, you have to possibly worry about the different versions in the different namespaces. But if it's just for your own myobjects, you don't lose any style points with me. The nlme package, for one, does this extensively, though with lattice graphs instead of ggplot.
Using ggplot.myobject is an alternative; you shouldn't have to worry about other versions, unless other people start doing the same thing. However, as you note, it does break the ggplot usage paradigm.
Another alternative is to use a new name, say, gsk3plot; you never have to worry about other versions, it's not too hard to remember, and you can make alternatives to plot to your heart's content without having to worry about conflicts. This is probably what I'd choose as it makes it clear to the audience that these plots are customizable and this is a function that makes the plot the way that you prefer, and that if they are so inclined, they could dig in and do the same thing.

ggplot and ggplot2 methods generally expect the data to come to them in melt()-ed form. So your methods may need to do a melt (from package plyr) and then "map" the resulting column names to arguments in the ggplot methods.

Related

R function: removing objects from workspace

In my function, I need to store some objects in the workspace using e.g.
matrix <<- mean(matrix)
as I am referring to that object in other functions nested within the "global function". However, I would like to delete these objects at the end of the "global function". How can this be achieved? rm() does not work...
UPDATE:
I outsourced the plotting part of my function and am sourcing this within the "higher order function".
The plotting function sort of looks like this
plot_minaverage <- function(minaverage){
for_minaverage_plot.time <- rep(seq(1,1440),2)
seq <- seq(start.time*60, length.out = 1440)
minaverage_plot_time <- for_minaverage_plot.time[seq]
minaverage_plot_df <- data.frame (minaverage_plot_time, minaverage)
pp <- ggplot(minaverage_plot_df, aes(x=minaverage_plot_time, y = minaverage))+
geom_bar(stat="identity", width = 1, position = position_dodge(width = 0.5))+
theme_bw()+
print(pp)
}
The problem I have is that minverage is computed in the "higher order" function and when I do not store it in workspace as mentioned above, the plotting function cannot access it.
Thanks so much for your thoughts on this!

How to programmatically overlap arbitrary stat_functions in ggplot?

I am looking for a way to automatically plot an arbitrary number of stat_function objects in a single ggplot, each one with a different set of parameters, and coloring them.
Initially I thought of having one big data.table with a large number of samples from each distribution, each set associated with an index, and using geom_density, grouping and coloring by the index.
This is, however, very inefficient. There is, in my opinion, no need to spend time and memory to produce and keep large sets of values if we already have parameters that perfectly describe each distribution.
I present my initial solution below, but is there a more elegant and/or practical way of doing this?
distrData.dt <- data.table( Shape = c(2.1,2.2,2.3), Scale = c(1.1,1.2,1.3), time = c(1,2,3) )
ggplot(data.table(x=c(0:15)), aes(x)) +
apply(distrData.dt,1, FUN = function(x) stat_function(fun = dgamma,arg = list(shape=as.numeric(x[1]),scale=as.numeric(x[2])), mapping = aes_string(color=x[3]) ) ) +
scale_colour_gradient("Time Step", low="blue", high="red", space="Lab")
This is the current result:
It produces the main result, that is, it will plot as many "perfect" densities as the number of parameter sets you give it. However, I am not using aesthetics to pass parameters from the column names ("Shape" and "Scale") or to get the color of each line. As far as I understand, that is not possible, but is there another way?
First of all, your solution is absolutely fine to me: it does the job, and it does it elegantly. I just wanted to both expand on #joran's comment and show one useful trick that's called "function factory", which is perfectly suitable for a case like yours.
So I'm building a function that returns a function with fixed parameters. Note that using force prevents from shape and scale being lazily evaluated, that is necessary since we'll be using a for loop.
I'm using data.frame instead of data.table, but there shouldn't be a significant difference. That vector("list", n) construction is preallocating space for a list, as seen in ?list. I don't think it's obligatory in this particular case (significant overhead will appear for lenghts, say, >100, unlikely here), but it's always better to avoid iteratively growing objects, that's a bad practice.
As a last remark, check the stat_function call: it seems reasonably readable, at least you can see what's the mapping and what's related to dgamma parameters.
dgamma_factory <- function(shape, scale) {
force(shape)
force(scale)
function(x) dgamma(x, shape = shape, scale = scale)
}
l <- vector("list", nrow(distrData.dt))
for (i in seq.int(nrow(distrData.dt))) {
params <- distrData.dt[i, ]
l[[i]] <- stat_function(
fun = dgamma_factory(params$Shape, params$Scale),
mapping = aes_string(color = params$time))
}
ggplot(data.frame(x=c(0:15)), aes(x)) +
l +
scale_colour_gradient("Time Step", low="blue", high="red", space="Lab")

Cannot save plots as pdf when ggplot function is called inside a function

I am going to plot a boxplot from a 4-column matrix pl1 using ggplot with dots on each box. The instruction for plotting is like this:
p1 <- ggplot(pl1, aes(x=factor(Edge_n), y=get(make.names(y_label)), ymax=max(get(make.names(y_label)))*1.05))+
geom_boxplot(aes(fill=method), outlier.shape= NA)+
theme(text = element_text(size=20), aspect.ratio=1)+
xlab("Number of edges")+
ylab(y_label)+
scale_fill_manual(values=color_box)+
geom_point(aes(x=factor(Edge_n), y=get(make.names(true_des)), ymax=max(get(make.names(true_des)))*1.05, color=method),
position = position_dodge(width=0.75))+
scale_color_manual(values=color_pnt)
Then, I use print(p1) to print it on an opened pdf. However, this does not work for me and I get the below error:
Error in make.names(true_des) : object 'true_des' not found
Does anyone can help?
Your example is not very clear because you give a call but you don't show the values of your variables so it's really hard to figure out what you're trying to do (for instance, is method the name of a column in the data frame pl1, or is it a variable (and if it's a variable, what is its type? string? name?)).
Nonetheless, here's an example that should help set you on the way to doing what you want:
Try something like this:
pl1 <- data.frame(Edge_n = sample(5, 20, TRUE), foo = rnorm(20), bar = rnorm(20))
y_label <- 'foo'
ax <- do.call(aes, list(
x=quote(factor(Edge_n)),
y=as.name(y_label),
ymax = substitute(max(y)*1.05, list(y=as.name(y_label)))))
p1 <- ggplot(pl1) + geom_boxplot(ax)
print(p1)
This should get you started to figuring out the rest of what you're trying to do.
Alternately (a different interpretation of your question) is that you may be running into a problem with the environment in which aes evaluates its arguments. See https://github.com/hadley/ggplot2/issues/743 for details. If this is the issue, then the answer might to override the default value of the environment argument to aes, for instance: aes(x=factor(Edge_n), y=get(make.names(y_label)), ymax=max(get(make.names(y_label)))*1.05, environment=environment())

How to deal with a lot of plots in R

I have a for loop which produces 60 plots. I would like to save all this plots in only one file.
If I set par(mfrow=c(10,6)) it says : Error in plot.new() : figure margins too large
What can I do?
My code is as follows:
pdf(file="figure.pdf")
par(mfrow=c(10,6))
for(i in 1:60){
x=rnorm(100)
y=rnorm(100)
plot(x,y)
}
dev.off()
Your default plot, as stated in the loop, does not use the space very effectively. If you look at just a single plot, you can see it has large margins, both between axis and edge and plot area and axis text. Effectively, there is a lot of space-hogging.
Secondly, the default pdf-function creates small pages, 7 by 7 inches. That is not a large sheet to plot on.
Trying to plot a 10 x 6 or 12 x 5 on 7 by 7 inches is therefore trying to squeeze in a lot of whitespace on very little space.
For it to succeed, you must look into the margin-options of par which is mar, mai, oma and omi, and probably some more. Consult the documentation with the command
?par
In addition to this, you could consider not displaying axis-text, tick-marks, tick-labels and titles for every one of the 60 sub-plots, as this too will save you space.
But somebody has already gone through some of this trouble for you. Look into the lattice-package or ggplot2, which has some excellent methods for making table-like subplots.
But there is another pressing issue: What are you trying to display with 60 subplots?
Update
Seeing what you are trying to do, here is a small example of faceting in ggplot2. It uses the Tufte-theme from jrnold's ggthemes, which is copied into here and then modified slightly in the line after the function.
library(ggplot2)
library(scales)
#### Setup the `theme` for the plot, i.e. the appearance of background, lines, margins, etc. of the plot.
## This function returns a theme-object, which ggplot2 uses to control the appearance.
theme_tufte <- function(ticks=TRUE, base_family="serif", base_size=11) {
ret <- theme_bw(base_family=base_family, base_size=base_size) +
theme(
legend.background = element_blank(),
legend.key = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
strip.background = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
panel.grid = element_blank())
if (!ticks) {
ret <- ret + theme(axis.ticks = element_blank())
}
ret
}
## Here I modify the theme returned from the function,
theme <- theme_tufte() + theme(panel.margin=unit(c(0,0,0,0), 'lines'), panel.border=element_rect(colour='grey', fill=NA))
## and instruct ggplot2 to use this theme as default.
theme_set(theme)
#### Some data generation.
size = 60*30
data <- data.frame(x=runif(size), y=rexp(size)+rnorm(size), mdl=sample(60,size, replace=TRUE))
#### Main plotting routine.
ggplot(data, aes(x,y, group=mdl)) ## base state of the plot to be used on all "layers", i.e. which data to use and which mappings to use (x should use x-variable, y should use the y-variable
+ geom_point() ## a layer that renders data as points, creates the scatterplot
+ stat_quantile(formula=y~x) ## another layer that adds some statistics, in this case the 25%, 50% and 75% quantile lines.
+ facet_wrap(~ mdl, ncol=6) ## Without this, all the groups would be displayed in one large plot; this breaks it up according to the `mdl`-variable.
The usual challenge in using ggplot2 is restructuring all your data into data.frames. For this task, the reshape2 and plyr-packages might be of good use.
For you, I would imagine that your function that creates the subplot both calculates the estimation and creates the plot. This means that you have to split the function into calculating the estimation, returning it to a data.frame, which you then can collate and pass to ggplot.
Output the plots to a pdf:
X = matrix(rnorm(60*100), ncol=60)
Y = matrix(rnorm(60*100), ncol=60)
pdf(file="fileName.pdf")
for(j in 1:60){
plot(X[,j], Y[,j])
}
dev.off()
For placing many plots on a page or document (and I have created images with literally thousands of plots in them), it is convenient to separate the work between R--which creates the plots individually--and other software which is better suited for arranging arrays of things. If this reminds you of spreadsheets or word processing tables, then we are thinking alike.
This page, which is a screenshot from a PDF file, contains over 200 statistical graphics. Although it has been greatly reduced (to 40% nominal size) in order to obscure proprietary data, the original has all the detail of the original R graphics and can be zoomed to 1600% without problem.
Two mechanisms have worked reasonably well. For up to several hundred plots, a little macro to import and re-sequence a set of bitmapped image files (.emf or .wmf) into a Word document does fine. For better control, I turn to a comparable Excel macro. It is driven by a sheet that is empty of everything except a row with column headers and a column with row headers. (You can see them at the left and top of the figure.) The macro deletes everything else on that sheet (except for formatting), then munges each possible combination of row and column header into a file name and if it finds that file, it imports it into the corresponding cell. The whole operation takes just a few seconds for several thousand images.
Obviously this communication mechanism between R and the other software is primitive, consisting of a collection of image files having a standard naming convention. But the code needed to implement it all is brief (albeit customized to each situation) and it works reliably. For example, if you encapsulate the plotting code within a function, then it will be called within a loop to create many similar plots. At the end of that function add a few lines to save the plot to a file, something like this:
path <- "W: <whatever>/" # Folder for the output files
ext <- "wmf" # or "emf" or "png" or ... # Format (and extension) of the output
...
if (save) {
outfile <- paste(path, paste(munge(well), munge(parm), sep="_"), sep="/")
outfile <- paste(outfile, ext, sep=".")
savePlot(filename=outfile, type=ext)
}
In this case each plot is identified by two loop variables, well and parm, both of which are strings (they correspond to the column and row headers). The function for creating acceptable filenames merely strips out punctuation, replacing it by an anodyne placeholder:
munge <- function(s) gsub("[[:punct:]]", "_", s)
Once those images have been imported into Word, Excel, or wherever you like, it's fairly easy to reorganize them, place other material around them, etc., and then print the result in PDF format.
There is an art to creating these very large "small multiples" (in Tufte's terminology). To the extent possible, it helps to follow Tufte's principle of increasing the data:ink ratio by erasing inessential material. That makes graphical patterns clear even when the tableau has been greatly reduced in size in order to comprehend all its rows and columns at once. Although the preceding figure is a poor example--the individual plots had to have axes, gridlines, labels, and so on so that they can be read in detail when zoomed--the power of this method to reveal patterns is clear even at this scale. It is crucial to make the plots comparable to one another. In this example, which consists of time series, every plot has the same range on the x-axis; within each row (which corresponds to a different type of observation), the ranges on the y-axes are the same; and all color schemes and methods of symbolization are the same throughout.
You could also use knitr. This didn't instantly convert over to base graphics (and I've got to run now), but using ggplot works easily.
\documentclass{article}
\begin{document}
<<echo = FALSE, fig.keep='high', fig.height=3, fig.width=4>>=
require(ggplot2)
for (i in 1:10) print(ggplot(mtcars, aes(x = disp, y = mpg)) + geom_point())
#
\end{document}
The above code will produce a nice multi-page pdf with all the graphs.
For a very simple solution to this type of issue, I found that setting a large "Windows" device manages to make the window big enough for many uses.
windows(50,50)
par(mfrow=c(10,6))
for(i in 1:60){
x=rnorm(100)
y=rnorm(100)
plot(x,y)
}
Or in my case,
windows(20,20)
plot(Plotting_I_Need_In_Rows_of_4, mfrow=c(4,4))

How can I handle R CMD check "no visible binding for global variable" notes when my ggplot2 syntax is sensible?

EDIT: Hadley Wickham points out that I misspoke. R CMD check is throwing NOTES, not Warnings. I'm terribly sorry for the confusion. It was my oversight.
The short version
R CMD check throws this note every time I use sensible plot-creation syntax in ggplot2:
no visible binding for global variable [variable name]
I understand why R CMD check does that, but it seems to be criminalizing an entire vein of otherwise sensible syntax. I'm not sure what steps to take to get my package to pass R CMD check and get admitted to CRAN.
The background
Sascha Epskamp previously posted on essentially the same issue. The difference, I think, is that subset()'s manpage says it's designed for interactive use.
In my case, the issue is not over subset() but over a core feature of ggplot2: the data = argument.
An example of code I write that generates these notes
Here's a sub-function in my package that adds points to a plot:
JitteredResponsesByContrast <- function (data) {
return(
geom_point(
aes(
x = x.values,
y = y.values
),
data = data,
position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
)
)
}
R CMD check, on parsing this code, will say
granovagg.contr : JitteredResponsesByContrast: no visible binding for
global variable 'x.values'
granovagg.contr : JitteredResponsesByContrast: no visible binding for
global variable 'y.values'
Why R CMD check is right
The check is technically correct. x.values and y.values
Aren't defined locally in the function JitteredResponsesByContrast()
Aren't pre-defined in the form x.values <- [something] either globally or in the caller.
Instead, they're variables within a dataframe that gets defined earlier and passed into the function JitteredResponsesByContrast().
Why ggplot2 makes it difficult to appease R CMD check
ggplot2 seems to encourage the use of a data argument. The data argument, presumably, is why this code will execute
library(ggplot2)
p <- ggplot(aes(x = hwy, y = cty), data = mpg)
p + geom_point()
but this code will produce an object-not-found error:
library(ggplot2)
hwy # a variable in the mpg dataset
Two work-arounds, and why I'm happy with neither
The NULLing out strategy
Matthew Dowle recommends setting the problematic variables to NULL first, which in my case would look like this:
JitteredResponsesByContrast <- function (data) {
x.values <- y.values <- NULL # Setting the variables to NULL first
return(
geom_point(
aes(
x = x.values,
y = y.values
),
data = data,
position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
)
)
}
I appreciate this solution, but I dislike it for three reasons.
it serves no additional purpose beyond appeasing R CMD check.
it doesn't reflect intent. It raises the expectation that the aes() call will see our now-NULL variables (it won't), while obscuring the real purpose (making R CMD check aware of variables it apparently wouldn't otherwise know were bound)
The problems of 1 and 2 multiply because every time you write a function that returns a plot element, you have to add a confusing NULLing statement
The with() strategy
You can use with() to explicitly signal that the variables in question can be found inside some larger environment. In my case, using with() looks like this:
JitteredResponsesByContrast <- function (data) {
with(data, {
geom_point(
aes(
x = x.values,
y = y.values
),
data = data,
position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
)
}
)
}
This solution works. But, I don't like this solution because it doesn't even work the way I would expect it to. If with() were really solving the problem of pointing the interpreter to where the variables are, then I shouldn't even need the data = argument. But, with() doesn't work that way:
library(ggplot2)
p <- ggplot()
p <- p + with(mpg, geom_point(aes(x = hwy, y = cty)))
p # will generate an error saying `hwy` is not found
So, again, I think this solution has similar flaws to the NULLing strategy:
I still have to go through every plot element function and wrap the logic in a with() call
The with() call is misleading. I still need to supply a data = argument; all with() is doing is appeasing R CMD check.
Conclusion
The way I see it, there are three options I could take:
Lobby CRAN to ignore the notes by arguing that they're "spurious" (pursuant to CRAN policy), and do that every time I submit a package
Fix my code with one of two undesirable strategies (NULLing or with() blocks)
Hum really loudly and hope the problem goes away
None of the three make me happy, and I'm wondering what people suggest I (and other package developers wanting to tap into ggplot2) should do.
You have two solutions:
Rewrite your code to avoid non-standard evaluation. For ggplot2, this means using aes_string() instead of aes() (as described by Harlan)
Add a call to globalVariables(c("x.values", "y.values")) somewhere in the top-level of your package.
You should strive for 0 NOTES in your package when submitting to CRAN, even if you have to do something slightly hacky. This makes life easier for CRAN, and easier for you.
(Updated 2014-12-31 to reflect my latest thoughts on this)
Have you tried with aes_string instead of aes? This should work, although I haven't tried it:
aes_string(x = 'x.values', y = 'y.values')
This question has been asked and answered a while ago but just for your information, since version 2.1.0 there is another way to get around the notes: aes_(x=~x.values,y=~y.values).
In 2019, the best way to get around this is to use the .data prefix from the rlang package, which also gets exported to ggplot2. This tells R to treat x.values and y.values as columns in a data.frame (so it won't complain about undefined variables).
Note: This works best if you have predefined columns names that you know will exist in you data input
#' #importFrom ggplot2 .data
my_func <- function(data) {
ggplot(data, aes(x = .data$x, y = .data$y))
}
EDIT: Updated to export .data from ggplot2 instead of rlang based off #Noah comment
If
getRversion() >= "3.1.0"
You can add a call at the top level of the package:
utils::suppressForeignCheck(c("x.values", "y.values"))
from:
help("suppressForeignCheck")
Add this line of code to the file in which you provide package-level documentation:
if(getRversion() >= "2.15.1") utils::globalVariables(c("."))
Example here
how about using get()?
geom_point(
aes(
x = get('x.values'),
y = get('y.values')
),
data = data,
position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
)
Because the manual for ?aes_string says
All these functions are soft-deprecated. Please use tidy evaluation
idioms instead (see the quasiquotation section in aes()
documentation).
So I read that page, and came up with this pattern:
ggplot2::aes(x = !!quote(x.values),
y = !!quote(y.values))
It is about as fugly as an IIFE, and mixes base expressions with tidy-bang-bangs. But does not require the global variables workaround, either, and doesn't use anything that is deprecated afaict. It seems like it also works with calculations in aesthetics and the derived variables like ..count..

Resources