Using Reduce to add layers to a ggplot - r

I have question similar to this one about the use of multiple dataframes for plotting a ggplot. I would like to create a base plot and then add data using a list of dataframes (rationale/usecase described below).
library(ggplot2)
# generate some data and put it in a list
df1 <- data.frame(p=c(10,8,7,3,2,6,7,8),v=c(100,300,150,400,450,250,150,400))
df2 <- data.frame(p=c(10,8,6,4), v=c(150,250,350,400))
df3 <- data.frame(p=c(9,7,5,3), v=c(170,200,340,490))
l <- list(df1,df2,df3)
#create a layer-adding function
addlayer <-function(df,plt=p){
plt <- plt + geom_point(data=df, aes(x=p,y=v))
plt
}
#for loop works
p <- ggplot()
for(i in l){
p <- addlayer(i)
}
#Reduce throws and error
p <- ggplot()
gg <- Reduce(addlayer,l)
Error in as.vector(x, mode) :
cannot coerce type 'environment' to vector of type 'any'
Called from: as.vector(e2)
In writing out this example I realize that the for loop is not a bad option but wouldn't mind the conciseness of Reduce, especially if I want to chain several functions together.
For those who are interested my use case is to draw a number of unconnected lines between points on a map. From a reference dataframe I figured the most concise way to map was to generate a list of subsetted dataframes, each of which corresponds to a single line. I don't want them connected so geom_path is no good.

This seems to work,
addlayer <-function(a, b){
a + geom_point(data=b, aes(x=p,y=v))
}
Reduce(addlayer, l, init=ggplot())
Note that you can also use a list of layers,
ggplot() + lapply(l, geom_point, mapping = aes(x=p,y=v))
However, neither of those two strategies is to be recommended; ggplot2 is perfectly capable of drawing multiple unconnected lines in a single layer (using e.g. the group argument). It is more efficient, and cleaner code.
names(l) = 1:3
m = ldply(l, I)
ggplot(m, aes(p, v, group=.id)) + geom_line()

Related

R GGPLOT2 lapply and function not finding object?

I hope I can get a contextual clue as to what may be wrong here without providing data frame, but can if necessary, but ultimately I want to utilize lapply to create multiple boxplots across multiple Ys and same X, but get the following error, but Termed is definitely in my CMrecruitdat data.frame:
Error in aes_string(x = Termed, y = RecVar, fill = Termed) :
object 'Termed' not found
RecVar <- CMrecruitdat[,c("Req.Open.To.System.Entry", "Req.Open.To.Hire", "Tenure")]
BP <- function (RecVar){
require(ggplot2)
ggplot(CMrecruitdat, aes_string(x=Termed, y=RecVar, fill=Termed))+
geom_boxplot()+
guides(fill=false)
}
lapply(RecVar, FUN=BP)
If you use aes_string, you should pass strings rather than vectors and use strings for all your fields.
RecVar <- CMrecruitdat[,c("Termed", "Req.Open.To.System.Entry", "Req.Open.To.Hire", "Tenure")]
BP <- function (RecVar){
require(ggplot2)
ggplot(RecVar, aes_string(x="Termed", y=RecVar, fill="Termed"))+
geom_boxplot()+
guides(fill=false)
}
lapply(names(RecVar), FUN=BP)

Adding to a list object in a mapping function in R

I am creating a scatter plot matrix using GGally::ggpairs. I am using a custom function (below called my_fn) to create the bottom-left non-diagonal subplots. In the process of calling that custom function, there is information about each of these subplots that is calculated, and that I would like to store for later.
In the example below, each h#cID is a int[] structure with 100 values. In total, it is created 10 times in my_fn (once for each of the 10 bottom-left non-diagonal subplots). I am trying to store all 10 of these h#cID structures into the listCID list object.
I have not had success with this approach, and I have tried a few other variants (such as trying to put listCID as an input parameter to my_fn, or trying to return it in the end).
Is it possible for me to store the ten h#cID structures efficiently through my_fn to be used later? I feel there are several syntax issues that I am not entirely familiar with that may explain why I am stuck, and likewise I would be happy to change the title of this question if I am not using appropriate terminology. Thank you!
library(hexbin)
library(GGally)
library(ggplot2)
set.seed(1)
bindata <- data.frame(
ID = paste0("ID", 1:100),
A = rnorm(100), B = rnorm(100), C = rnorm(100),
D = rnorm(100), E = rnorm(100))
bindata$ID <- as.character(bindata$ID
)
maxVal <- max(abs(bindata[ ,2:6]))
maxRange <- c(-1 * maxVal, maxVal)
listCID <- c()
my_fn <- function(data, mapping, ...){
x <- data[ ,c(as.character(mapping$x))]
y <- data[ ,c(as.character(mapping$y))]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE,
xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame(hcell2xy(h), hexID=h#cell, counts=h#count)
listCID <- c(listCID, h#cID)
print(listCID)
p <- ggplot(hexdf, aes(x=x, y=y, fill=counts, hexID=hexID)) +
geom_hex(stat="identity")
p
}
p <- ggpairs(bindata[ ,2:6], lower=list(continuous=my_fn))
p
If I understand your problem correctly this is quite easily, albeit inelegantly, achieved using the <<- operator.
With it you may assign something like a global variable from inside the scope of your function.
Set listCID <- NULL before executing the function and listCID <<-c(listCID,h#cID) inside the function.
listCID = NULL
my_fn <- function(data, mapping, ...){
x = data[,c(as.character(mapping$x))]
y = data[,c(as.character(mapping$y))]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE, xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame (hcell2xy (h), hexID = h#cell, counts = h#count)
if(exists("listCID")) listCID <<-c(listCID,h#cID)
print(listCID)
p <- ggplot(hexdf, aes(x=x, y=y, fill = counts, hexID=hexID)) + geom_hex(stat="identity")
p
}
For more on scoping please refer to Hadleys excellent Advanced R: http://adv-r.had.co.nz/Environments.html
In general it is not a good practice to try to return two different results with one function. In your case, you want to return the plot and the result of a calculation (the hexbin cIDs).
Better would be to calculate your results in steps. Each step would be a separate function. The result of the first function (calculating the hexbins) can then be used as an input for multiple follow-up functions (finding the cIDs and creating the plot). Next follows one of the many ways in which you could refactor your code:
calc_hexbins() in which you generate all the hexbins. This function could return a named list of hexbins (e.g. list(AB = h1, AC = h2, BC = 43)). This is achieved by enumerating all the possible combinations of your list (A, B, C, D and E). The drawback is that you are duplicating some of the logic that is already in ggpairs().
gen_cids() takes the hexbins as an input and generates all the cIDs. This is a simple operation where you loop (or lappy) over all the elements in your list and take the cID.
create_plot() also takes the hexbins as an input and this is the function in which you actually generate the plot. Here you can add an extra parameter for the list of hexbins (there is a function wrap() in your package GGally to do this). Instead of calculating the hexbins, you can look them up in the named list that you've generated earlier by combining the A and the B in a string.
This avoids hacky methods such as working with attributes or using global variables. These work of course, but are often a headache when maintaining code. Unfortunately, this will also make your code a little longer, but this can be a good thing.

for loop with ggplots produces graphs with identical values but different headings

I have read lots of posts about using loops for ggplot to generate lots of graphs, but cannot find any that explain my problem...
I have a dataframe and am trying to loop over 92 columns, creating a new graph for each column. I want to save each plot as a separate object. When I run my loop (code below) and print the graphs, all the graphs are correct. However, when I change the print() command with assign(), the graphs are not correct. The titles are changing as they should, however the graph-values are all identical (they are all the values for the final graph). I found this out because when I used plot_grid() to generate a figure of 10 plots, the graph titles and axis labels were all correct, but the values were identical!
My data set is large, so I have provided a small data set for illustration below.
Sample datafame:
library(ggplot)
library(cowplot)
df <- as.data.frame(cbind(group=c(rep("A", 4), rep("B", 4)), a=sample(1:100, 8), b=sample(100:200, 8), c=sample(300:400, 8))) #make data frame
cols <- 2:4 #define columns for plots
for(i in 1:length(cols)){
df[,cols[i]] <- as.numeric(as.character(df[,cols[i]]))
} #convert columns to numeric
Plots:
for (i in 1:length(cols)){
g <- ggplot(df, aes(x=group, y=df[,cols[i]])) +
geom_boxplot() +
ggtitle(colnames(df)[cols[i]])
print(g)
assign(colnames(df)[cols[i]], g) #generate an object for each plot
}
plot_grid(a, b, c)
I am thinking that when ggplots make the plot, it only renders the data from the final value of i? Or somthing like that? Is there a way around this?
I wish to do it like this, as there are a lot of graphs I wish to make and then I want to mix and match plots for figures.
Thanks!
I have cleaned up how you generated your sample data frame.
library(ggplot2)
library(cowplot)
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8)) #make data frame
Just using data.frame() will suffice. This makes your code clearer and avoids the need for all that post-processing in your 'for loop' to convert your dataframe to numeric and to remove the factors generated - Note that as.data.frame() and cbind() tend to default to factors if you don't have 'stringsAsFactors = FALSE' and that the numeric to character conversion can be avoided by using cbind.data.frame() rather than cbind().
I have also refactored your 'for loop' that generates your plots. You generate a list of integers called 'cols' (cols <- 2:4 ) which you then reiterate across to generate your plots from each column of data. This is unnecessary, we can just create a range in the for statement conditions - 'for (i in 2:ncol(df))' - this simply reiterates from 2 to 4 (the number of columns in your dataframe) - starting from 2 is required to avoid column 1 which contains metadata. This is preferable because:
i) When reviewing your code the condition used is immediately apparent without searching through the rest of your code
ii) R has a number of functions/parameters similarly named to your variable 'cols' and it is best to avoid confusion.
With the code cleaned up we can now try to locate the cause of the bug:
library(ggplot2)
library(cowplot)
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8)) #make data frame
for (i in 2:ncol(df)){
g <- ggplot(df, aes(x=group, y=df[,i])) +
geom_boxplot() +
ggtitle(colnames(df)[i])
print(g)
assign(colnames(df)[i], g) #generate an object for each plot
}
It's not immediately obvious why your code doesn't work. The suggestion by Imo has merit. Saving your plots to a list would prevent your environment from getting cluttered with objects, however it would not solve this bug. The cause is unintuitive and requires a deep understanding about how the assign() function is evaluated. See the answer provided here by Konrad Rudolph. The following should work and retains the style of your original code. As Konrad suggests in his answer it might be more "R" like to use lapply. Note that we have given the for loop local scope and that we now re-define i locally. Previously the last value of i generated in the loop was being used to generate each object created via the assign() function. Note the use of <<- to assign g to the global environment.
for (i in 2:ncol(df))
local({
i <- i
g <<- ggplot(df, aes(x=group, y=df[,i])) +
geom_boxplot() +
ggtitle(colnames(df)[i])
print(i)
print(g)
assign(colnames(df)[i], g, pos =1) #generate an object for each plot
})
plot_grid(a, b, c)
You owe me a drink.
There are two standard ways to deal with this problem:
1- Work with a long-format data.frame
2- Use aes_string to refer to variable names in the wide format data.frame
Here's an illustration of possible strategies.
library(ggplot2)
library(gridExtra)
# data from other answer
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8))
## first method: long format
m <- reshape2::melt(df, id = "group")
p <- ggplot(m, aes(x=group, y=value)) +
geom_boxplot()
pl <- plyr::dlply(m, "variable", function(.d) p %+% .d + ggtitle(unique(.d$variable)))
grid.arrange(grobs=pl)
## second method: keep wide format
one_plot <- function(col = "a") ggplot(df, aes_string(x="group", y=col)) + geom_boxplot() + ggtitle(col)
pl <- plyr::llply(colnames(df)[-1], one_plot)
grid.arrange(grobs=pl)
## third method: more explicit looping
pl <- vector("list", length = ncol(df)-1)
for(ii in seq_along(pl)){
.col <- colnames(df)[-1][ii]
.p <- ggplot(df, aes_string(x="group", y=.col)) + geom_boxplot() + ggtitle(.col)
pl[[ii]] <- .p
}
grid.arrange(grobs=pl)
Sometimes, when wrapping a ggplot call inside a function/for loop one faces issues with local variables (not the case here, if aes_string is used). In such cases one can define a local environment.
Note that using a construct like aes(y=df[,i]) may appear to work, but can produce very wrong results. Consider a facetted plot, the data.frame will be split into different groups for each panel, and this subsetting can fail miserably to group the right data if numeric values are passed directly to aes() instead of variable names.

Plotting inside function: subset(df,id_==...) gives wrong plot, df[df$id_==...,] is right

I have a df with multiple y-series which I want to plot individually, so I wrote a fn that selects one particular series, assigns to a local variable dat, then plots it. However ggplot/geom_step when called inside the fn doesn't treat it properly like a single series. I don't see how this can be a scoping issue, since if dat wasn't visible, surely ggplot would fail?
You can verify the code is correct when executed from the toplevel environment, but not inside the function. This is not a duplicate question. I understand the problem (this is a recurring issue with ggplot), but I've read all the other answers; this is not a duplicate and they do not give the solution.
set.seed(1234)
require(ggplot2)
require(scales)
N = 10
df <- data.frame(x = 1:N,
id_ = c(rep(20,N), rep(25,N), rep(33,N)),
y = c(runif(N, 1.2e6, 2.9e6), runif(N, 5.8e5, 8.9e5) ,runif(N, 2.4e5, 3.3e5)),
row.names=NULL)
plot_series <- function(id_, envir=environment()) {
dat <- subset(df,id_==id_)
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
# Unsuccessfully trying the approach from http://stackoverflow.com/questions/22287498/scoping-of-variables-in-aes-inside-a-function-in-ggplot
p$plot_env <- envir
plot(p)
# Displays wrongly whether we do the plot here inside fn, or return the object to parent environment
return(p)
}
# BAD: doesn't plot geom_step!
plot_series(20)
# GOOD! but what's causing the difference?
ggplot(data=subset(df,id_==20), mapping=aes(x,y), color='red') + geom_step()
#plot_series(25)
#plot_series(33)
This works fine:
plot_series <- function(id_) {
dat <- df[df$id_ == id_,]
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
return(p)
}
print(plot_series(20))
If you simply step through the original function using debug, you'll quickly see that the subset line did not actually subset the data frame at all: it returned all rows!
Why? Because subset uses non-standard evaluation and you used the same name for both the column name and the function argument. As jlhoward demonstrates above, it would have worked (but probably not been advisable) to have simply used different names for the two.
The reason is that subset evaluates with the data frame first. So all it sees in the logical expression is the always true id_ == id_ within that data frame.
One way to think about it is to play dumb (like a computer) and ask yourself when presented with the condition id_ == id_ how do you know what exactly each symbol refers to. It's ambiguous, and subset makes a consistent choice: use what's in the data frame.
Notwithstanding the comments, this works:
plot_series <- function(z, envir=environment()) {
dat <- subset(df,id_==z)
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
p$plot_env <- envir
plot(p)
# Displays wrongly whether we do the plot here inside fn, or return the object to parent environment
return(p)
}
plot_series(20)
The problem seems to be the subset is interpreting id_ on the RHS of the == as identical to the LHS, to this is equivalent to subletting on T, which of course includes all the rows of df. That's the plot you are seeing.

ggploting multiple graphs from a data list

I would like to do something along the lines of this post: R: saving ggplot2 plots in a list
The problem is I can't get it to work. I seem to be able to get the individual graphs but the facet_wrap throws out an error. I would be content with just outputting all the graphs and then saving them to disk as a jpg or something, so I can scroll through them later.
for(n in 1:5){
pdata <- data.frame(mt1[n])
library(ggplot2)
p <-ggplot(pdata, aes(x=variable, y=value, color=Legend, group=Legend))+ geom_line()+ facet_wrap(~ color)
}
Link to a dput of the data : mt1
Edit:
Added the whole correct file, its a bit long
If we omit the facet error due to a missing variable in your data frames, you can generate and save your plots in different files this way using ggsave :
for(n in 1:5){
pdata <- data.frame(mt1[n]) # better to use mt1[[n]]
p <-ggplot(pdata, aes(x=variable, y=value, color=Legend, group=Legend))+ geom_line()
ggsave(paste0("plot",n,".jpg"), p)
}
Some suggestions for improvement:
First, as #Dason points out, your library(ggplot2) call should be outside your loop.
Second, if you access an element of list by [.], then the result will still be a list. You should do instead: [[.]] which will render the data.frame(.) call unnecessary (as commented above in the code).
Third is a suggestion to use *apply family of functions. Here, using lapply.
To summarise all these points in code:
require(ggplot2) # load package outside once
o <- lapply(seq_along(mtl), function(idx) {
p <- ggplot(mtl[[idx]], aes(x = variable, y = value,
color = Legend, group = Legend))+ geom_line()
ggsave(paste0("plot",idx,".jpg"), p)
})

Resources