Using foreach() in R to speed up loop for ggplot2 - r

I would like to create a PDF file containing hundreds of plots in a certain order.
My strategy was using foreach() and storing each ggplot2 object into the output list, and then printing each ggplot2 object to the output file.
For example, I would like to plot a histogram of prices for every factor "carat" in the diamonds dataset:
library(ggplot2)
library(plyr)
library(foreach) # for parallelization
library(doParallel) # for parallelization
#setup parallel backend to use 4 processors
cl<-makeCluster(4)
registerDoParallel(cl)
# use diamonds dataset
carats.summary <- ddply(diamonds, .(carat), summarise, count = length(carat))
m.list <- foreach(i = 1:length(carats.summary$carat),
.packages = "ggplot2") %dopar% {
jcarat = carats.summary$carat[i]
m <- ggplot(subset(diamonds, carat == jcarat), aes(x = price)) +
geom_histogram()
print(m)
}
With this code, I am hoping to create a list of ggplot2 objects which I can then save into a single pdf file (for example using pdf()) in an ordered manner (for example, in ascending carats).
However, running this results in an error message:
Error in serialize(data, node$con) : error writing to connection
I suspect this is due to the fact that if I tried to append the ggplot2 object to a list, I would get a warning message like this:
lst <- vector(mode = "list")
lst[1] <- m
Warning message:
In lst[1] <- m :
number of items to replace is not a multiple of replacement length
Although this is pure speculation and I could be wrong.
Does anybody have an idea how to use foreach() to save ggplot2 objects onto a list? Or some way to parallelize for loops involving ggplot2?
Thanks in advance.

You shouldn't be printing the object inside the loop, just create the ggplot object. Only print when you have the graphic device open that you want.
m.list <- foreach(i = 1:length(carats.summary$carat),
.packages = "ggplot2") %dopar% {
jcarat = carats.summary$carat[i]
ggplot(subset(diamonds, carat == jcarat), aes(x = price)) +
geom_histogram()
}
then you can get at them with
m.list[[1]]
etc...

Related

R - Defining a function which recognises arguments not as objects, but as being part of the call

I'm trying to define a function which returns a graphical object in R. The idea is that I can then call this function with different arguments multiple times using an for loop or lapply function, then plotting the list of grobs in gridExtra::grid.arrange. However, I did not get that far yet. I'm having trouble with r recognising the arguments as being part of the call. I've made some code to show you my problem. I have tried quoting and unquoting the arguments, using unqoute() in the function ("Object not found" error within a user defined function, eval() function?), using eval(parse()) (R - how to filter data with a list of arguments to produce multiple data frames and graphs), using !!, etc. However, I can't seem to get it to work. Does anyone know how I should handle this?
library(survminer)
library(survival)
data_km <- data.frame(Duration1 = c(1,2,3,4,5,6,7,8,9,10),
Event1 = c(1,1,0,1,1,0,1,1,1,1),
Duration2 = c(1,1,2,2,3,3,4,4,5,5),
Event2 = c(1,0,1,0,1,1,1,0,1,1),
Duration3 = c(11,12,13,14,15,16,17,18,19,20),
Event3 = c(1,1,0,1,1,0,1,1,0,1),
Area = c(1,1,1,1,1,2,2,2,2,2))
# this is working perfectly
ggsurvplot(survfit(Surv(Duration1, Event1) ~ Area, data = data_km))
ggsurvplot(survfit(Surv(Duration2, Event2) ~ Area, data = data_km))
ggsurvplot(survfit(Surv(Duration3, Event3) ~ Area, data = data_km))
myfun <- function(TimeVar, EventVar){
ggsurvplot(survfit(Surv(eval(parse(text = TimeVar), eval(parse(text = EventVar)) ~ Area, data = data_km))
}
x <- myfun("Duration1", "Event1")
plot(x)
You need to study some tutorials about computing on the language. I like doing it with base R, e.g., using bquote.
myfun <- function(TimeVar, EventVar){
TimeVar <- as.name(TimeVar)
EventVar <- as.name(EventVar)
fit <- eval(bquote(survfit(Surv(.(TimeVar), .(EventVar)) ~ Area, data = data_km)))
ggsurvplot(fit)
}
x <- myfun("Duration1", "Event1")
print(x)
#works

Save plots as R objects and displaying in grid

In the following reproducible example I try to create a function for a ggplot distribution plot and saving it as an R object, with the intention of displaying two plots in a grid.
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
output<-list(distribution,var1,dat)
return(output)
}
Call to function:
set.seed(100)
df <- data.frame(x = rnorm(100, mean=10),y =rep(1,100))
output1 <- ggplothist(dat=df,var1='x')
output1[1]
All fine untill now.
Then i want to make a second plot, (of note mean=100 instead of previous 10)
df2 <- data.frame(x = rep(1,1000),y = rnorm(1000, mean=100))
output2 <- ggplothist(dat=df2,var1='y')
output2[1]
Then i try to replot first distribution with mean 10.
output1[1]
I get the same distibution as before?
If however i use the information contained inside the function, return it back and reset it as a global variable it works.
var1=as.numeric(output1[2]);dat=as.data.frame(output1[3]);p1 <- output1[1]
p1
If anyone can explain why this happens I would like to know. It seems that in order to to draw the intended distribution I have to reset the data.frame and variable to what was used to draw the plot. Is there a way to save the plot as an object without having to this. luckly I can replot the first distribution.
but i can't plot them both at the same time
var1=as.numeric(output2[2]);dat=as.data.frame(output2[3]);p2 <- output2[1]
grid.arrange(p1,p2)
ERROR: Error in gList(list(list(data = list(x = c(9.66707664902549, 11.3631137069225, :
only 'grobs' allowed in "gList"
In this" Grid of multiple ggplot2 plots which have been made in a for loop " answer is suggested to use a list for containing the plots
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
plot(distribution)
pltlist <- list()
pltlist[["plot"]] <- distribution
output<-list(pltlist,var1,dat)
return(output)
}
output1 <- ggplothist(dat=df,var1='x')
p1<-output1[1]
output2 <- ggplothist(dat=df2,var1='y')
p2<-output2[1]
output1[1]
Will produce the distribution with mean=100 again instead of mean=10
and:
grid.arrange(p1,p2)
will produce the same Error
Error in gList(list(list(plot = list(data = list(x = c(9.66707664902549, :
only 'grobs' allowed in "gList"
As a last attempt i try to use recordPlot() to record everything about the plot into an object. The following is now inside the function.
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
plot(distribution)
distribution<-recordPlot()
output<-list(distribution,var1,dat)
return(output)
}
This function will produce the same errors as before, dependent on resetting the dat, and var1 variables to what is needed for drawing the distribution. and similarly can't be put inside a grid.
I've tried similar things like arrangeGrob() in this question "R saving multiple ggplot2 plots as R-object in list and re-displaying in grid " but with no luck.
I would really like a solution that creates an R object containing the plot, that can be redrawn by itself and can be used inside a grid without having to reset the variables used to draw the plot each time it is done. I would also like to understand wht this is happening as I don't consider it intuitive at all.
The only solution I can think of is to draw the plot as a png file, saved somewhere and then have the function return the path such that i can be reused - is that what other people are doing?.
Thanks for reading, and sorry for the long question.
Found a solution
How can I reference the local environment within a function, in R?
by inserting
localenv <- environment()
And referencing that in the ggplot
distribution <- ggplot(data=dat, aes(dat[,var1]),environment = localenv)
made it all work! even with grid arrange!

Use of recordPlot() and replayPlot() in Parallel in R to save plot in the same PDF

I would like to plot data in parallel using foreach in R but I didn't find any way to get all my plots in the same pdf file. I thought of using recordPlot to save my plots in a list and then print them in a pdf device but it doesn't work.
I have the following error :
Error in replayPlot(x) : loading snapshot from a different session
I tried as well with ggplot but this is to slow with my large dataset.
Here is a piece of code showing my problem :
# Creating a dataframe : df
df=as.data.frame(matrix(nrow=1, ncol=10))
df=apply(df, 2, function(x) runif(100))
# Plotting function
par.plot=function(dat){
plot(dat)
p=recordPlot()
return(p)}
#Applying the function in parallel
library("parallel")
library("foreach")
library("doParallel")
cl <- makeCluster(detectCores())
registerDoParallel(cl, cores = detectCores())
plot.lst = foreach(i = 1:nrow(df)) %dopar% {
par.plot(df[i,])
}
# Trying to get 1st plot
plot.lst[[1]]
Error in replayPlot(x) : loading snapshot from a different session
Replacing %dopar% by %do% is working when I try to get my plots, because they seems to have been generated in the same environment.
I know I can call a pdf device inside the loop to generate a file for each iteration, but I would like to know if there is a way to get one file for all my plots at the output of my function.
Or do you know an easy way to merge my pdf files afterwards ?
Thanks for your help.
Charles
In my opinion your question can be devided into two distinctive parts:
1. Using the replayPlot function in th%dopar% without getting the weird error
2. Somehow getting 1 file at the end
The first question is easy to answer. The reason you get this error is that the R somehow remembers where (in OS level) the plots has been generated. You can get the same effect by using Rstudio server and trying to replay some of the recorded plots after couple of hours of closing the browser tab. In brief, the issue is that R remembers the PID of the process that generated the plot (Don't know why though!):
# generate a plot
plot(iris[, 1:2]
# record the plot
myplot <- recordPlot()
# check the PID
attr(x = myplot, which = "pid")
the good thing is you can overwrite this by assigning your current PID:
attr(x = myplot, which = "pid") <- Sys.getpid()
so you should only change the last line of your code to the following:
pdf(file = "plot.lst.pdf"))
graphics.off()
lapply(plot.lst, function(x){
attr(x = x, which = "pid") <- Sys.getpid()
replayPlot(x)})
graphics.off()
The part above entirely solves your problem, but in case you are interested in merging PDF files, follow this discussion:
Merging existing PDF files using R

Exclude Node in semPaths {semPlot}

I'm trying to plot a sem-path with R.
Im using an OUT file provinent from Mplus with semPaths {semPLot}.
Apparently it seems to work, but i want to remove some latent variables and i don't know how.
I am using the following syntax :
Out from Mplus : https://www.dropbox.com/s/vo3oa5fqp7wydlg/questedMOD2.out?dl=0
outfile1 <- "questedMOD.out"
```
semPaths(outfile1,what="est", intercepts=FALSE, rotation=4, edge.color="black", sizeMan=5, esize=TRUE, structural="TRUE", layout="tree2", nCharNodes=0, intStyle="multi" )
There may be an easier way to do this (and ignoring if it is sensible to do it) - one way you can do this is by removing nodes from the object prior to plotting.
Using the Mplus example from your question Rotate Edges in semPaths/qgraph
library(qgraph)
library(semPlot)
library(MplusAutomation)
# This downloads an output file from Mplus examples
download.file("http://www.statmodel.com/usersguide/chap5/ex5.8.out",
outfile <- tempfile(fileext = ".out"))
# Unadjusted plot
s <- semPaths(outfile, intercepts = FALSE)
In the above call to semPaths, outfile is of class character, so the line (near the start of code for semPaths)
if (!"semPlotModel" %in% class(object))
object <- do.call(semPlotModel, c(list(object), modelOpts))
returns the object from semPlot:::semPlotModel.mplus.model(outfile). This is of class "semPlotModel".
So the idea is to create this object first, amend it and then pass this object to semPaths.
# Call semPlotModel on your Mplus file
obj <- semPlot:::semPlotModel.mplus.model(outfile)
# obj <- do.call(semPlotModel, list(outfile)) # this is more general / not just for Mplus
# Remove one factor (F1) from object#Pars - need to check lhs and rhs columns
idx <- apply(obj#Pars[c("lhs", "rhs")], 1, function(i) any(grepl("F1", i)))
obj#Pars <- obj#Pars[!idx, ]
class(obj)
obj is now of class "semPlotModel" and can be passed directly to semPaths
s <- semPaths(obj, intercepts = FALSE)
You can use str(s) to see the structure of this returned object.
Assuming that you use the following sempath code to print your SEM
semPaths(obj, intercepts = FALSE)%>%
plot()
you can use the following function to remove any node by its label:
remove_nodes_and_edges <- function(semPaths_obj,node_tbrm_vec){
relevent_nodes_index <- semPaths_obj$graphAttributes$Nodes$names %in% node_tbrm_vec
semPaths_obj$graphAttributes$Nodes$width[relevent_nodes_index]=0
semPaths_obj$graphAttributes$Nodes$height[relevent_nodes_index]=0
semPaths_obj$graphAttributes$Nodes$labels[relevent_nodes_index]=""
return(semPaths_obj)
}
and use this function in the plotting pipe in the following way
semPaths(obj, intercepts = FALSE) %>%
remove_nodes_and_edges(c("Y1","Y2","Y3")) %>%
plot()

How can I suppress the creation of a plot while calling a function in R?

I am using a function in R (specifically limma::plotMDS) that produces a plot and also returns a useful value. I want to get the returned value without producing the plot. Is there an easy way to call the function but suppress the plot that it creates?
You can wrap the function call like this :
plotMDS.invisible <- function(...){
ff <- tempfile()
png(filename=ff)
res <- plotMDS(...)
dev.off()
unlink(ff)
res
}
An example of call :
x <- matrix(rnorm(1000*6,sd=0.5),1000,6)
rownames(x) <- paste("Gene",1:1000)
x[1:50,4:6] <- x[1:50,4:6] + 2
# without labels, indexes of samples are plotted.
mds <- plotMDS.invisible(x, col=c(rep("black",3), rep("red",3)) )

Resources