Memory leakage in using `ggplot` on large binned datasets - r

I am making various ggplots on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.
In the following example, the memory.size() is recorded at the start. Then the large dataset is simulated as dt. dt's x2 is plotted against x1 with binning. Plotting is repeated with different subsets of dt. The size of the ploted object is checked by object.size() and stored. After the plotting objects have been created, rm(dt) is executed, followed by a double gc(). At this point, memory.size() is recorded again. At the end, the memory.size() at the end is compared to that at the beginning and printed.
In view of the small size of the plotted object, it is expected that the memory.size() at the end should be similar to that at the beginning. But no. memory.size() does not go down anymore until I restart a new R session.
REPRODUCIBLE EXAMPLE
library(data.table)
library(ggplot2)
library(magrittr)
# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv) {
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) {return(mean(xBreaks[c(i, i+1)]))})[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) {return(mean(yBreaks[c(i, i+1)]))})[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)
}
before.mem.size <- memory.size()
# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) {return(runif(nrow))}) %>% set_names(paste0("x", seq(ncol))))
# Graph plotting
dummyEnv <- new.env()
with(dummyEnv, {
fcn <- function(tab) {
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)
}
lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
})
rm(dummyEnv)
# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()
# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))
I have tried the following modifications to the code:
Inside fcn, removing ggplot and returning a NULL instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot.
The less plots requested / less columns / less rows passed to fcn, the less is the memory leakage.
Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).
After the process, even after I call rm(list = ls()), the memory is still non-recoverable.
I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt to make different plots.
Thanks for attention!

Related

ggarrange generates an empty pdf file

I am dealing with a function that takes a big data frame (36 rows and 194 columns) which performs a Principal Component Analysis and then generates a list of plots where I have the combination of 26 Principal Components which are 325 in total, using 'expand.grid'.
My problem is that when I am using ggarrange(), from ggpubr, to merge all the plots in only one pdf file, this file is empty.
My code:
a = 26
row.pairs = 325
PC.Graph <- function(df, col1, col2, tag, id){
df1 <- df[,-c(col1:col2)]
pca <- prcomp(df1, scale. = T)
pc.summ <- summary(pca)
a <- sum(pc.summ$importance[3,] < 0.975)
b <- c(1:a)
pc.grid <- expand.grid(b, b)
pc.pairs <- pc.grid[pc.grid$Var1 < pc.grid$Var2,]
row.pairs <- nrow(pc.pairs)
components <- c(1:row.pairs)
S.apply.FUN <- function(x){
c <- sapply(pc.pairs, "[", x, simplify = F)
pcx <- c$Var1
pcy <- c$Var2
df2 <- df
row.names(df2) <- df[, tag]
name = paste("PCA_", pcx, "_vs_", pcy)
autoplot(pca, data = df2, colour = id, label = T, label.repel = T, main = name,
x = pcx, y = pcy)
}
all.plots <- Map(S.apply.FUN, components)
pdf(file = "All_PC.pdf", width = 50, height = 70)
print(ggarrange(all.plots))
dev.off()
}
PC.Graph(Final_DF, col1 = 1, col2 = 5, tag = "Sample", id = "Maturation")
You would have to pass a plotlist to ggarrange, but I am not sure you would get any useful plot out of that plot area in the PDF file, so I would advise you to split the plotlist into chunks (e.g. of 20) and plot these to multiple pages.
Specifically, I would export all.plots from your PC.Graph function (and remove the code to write to PDF there).
I would also change the expand.grid(b, b) to t(combn(b, 2)), since you don't need to plot the PC combinations twice.
Then I would do something like this:
# export the full list of plots
plots <- PC.Graph(Final_DF, col1 = 1, col2 = 5, tag = "Sample", id = "Maturation")
# split the plotlist
splitPlots <- split(plots, ceiling(seq_along(plots)/20))
plotPlots <- function(x){
out <- cowplot::plot_grid(plotlist = x, ncol = 5, nrow = 4)
plot(out)
}
pdf(file = "All_PC.pdf", width = 50, height = 45)
lapply(splitPlots, plotPlots)
dev.off()

Creating a function that prints out multiple data-frames in r

I have a function that I have written to create a simulation that demonstrates the central limit theorem. I'm not sure if its possible or if I am better off just making separate functions but currently it only stores that data frame containing the mean values of all the trials.
# create function to perform CLT simulation
# where n = sample size, t = number of trials, pop = which population is being used, popmean = population mean,
cltsim <- function(n, t, pop, popmean, popsd, poptitle){
popsim <- data.frame()
# Run the simulation
for(i in n) { # for each value of n
col <- c()
for(j in t) { #loop through each value of t
trial <- 1:j
counter <- j #set up an egg timer based on whichever t value we're on
value <- c()
while(counter > 0) { # and extract n samples from the population
bucket <- sample(pop, i, replace = TRUE)
xbar <- mean(bucket) #calculate the sample mean
value <- c(value, xbar) # and add it to a vector
counter <- counter - 1 #egg timer counts down and loops back until it hits 0
}
sbar <- sd(value) #calculate the sample standard deviation
col <- cbind(trial, value, sbar, i, j) #merge all info together
popsim <- rbind(popsim, col) # attach it to empty dataframe
}
}
#clean up so just the finished data frame is left
rm(col, bucket, value, counter, i, j, n, sbar, t, xbar, trial)
#tidy up data frame in order to graph it
names(popsim) <- c("trial#", "value", "sdev", "samples", "trials")
#view the rows of data in popsim data table
popsim
}
when I try to add any more code that requires creating datatables it doesnt store them, below are the blocks of code I wish to add to the function
g1 <- ggplot(popsim, aes(x = value)) + geom_density(fill = "#09AB30") +
facet_grid(samples ~ trials, labeller = label_both) +
ggtitle(paste("Demonstrating The Central Limit Theorem with Simulation using", poptitle)) +
geom_vline(xintercept = popmean, linetype = "dashed")
g1
and
#create data frame of simulated sample standard deviations \
sdmatrix <- matrix(unique(popsim$sdev), nrow = 4, ncol = 4)
sdf <- as.data.frame(sdmatrix, row.names = c("t10", "t100", "t1000", "t10000"))
names(sdf) <- c("s1", "s10", "s30", "s50")
sdf <- t(sdf)
rm(sdmatrix)
sdf
exvals <- pop1sd/sqrt(c(1, 10, 30, 50))
dfex <- as.data.frame(exvals, row.names = c("s1", "s10", "s30", "s50"))
names(dfex) <- "Predicted Standard Deviations"
dfex
Ive had a look around and I cant find a solution anywhere, am I better off just writing different functions for them? Any advice or input on how to make this lot of code more effective/efficient would be greatly appreciated.
thanks in advance

How to create multipe boxplots in one by only chosing certain rows from a data frame

What I would like to do is creating several boxplots (all displayed in a single boxplot) only from certain values of my original data frame.
My data frame looks as follows:
enter image description here
So now I want R to visualise Parameter ~ Station (Parameter are all variables coloured green and Station is the "station id")
Is there a way to tell R that I want all my Parameters on the x-axis ONLY for BB0028 for example, which would mean that I only take the first 6 values of mean_area, mean_area_exc, esd, feret, min and max into account in the boxplot?
That would look like this:
enter image description here
I tried it in very complicated way to add single boxplots one by one but I am sure there must be a more simple way.
This is what I tried:
bb28 <- df[c(1:6),]
bb28area <- boxplot(bb28$mean_area ~ bb28$BBnr)
bb28area_exc <- boxplot(bb28$mean_area_exc ~ bb28$BBnr)
bb28esd <- boxplot(bb28$mean_esd ~ bb28$BBnr)
bb28feret <- boxplot(bb28$mean_feret ~ bb28$BBnr)
bb28min <- boxplot(bb28$mean_min ~ bb28$BBnr)
bb28max <- boxplot(bb28$mean_max ~ bb28$BBnr)
boxplot(bb28$mean_area ~ bb28$BBnr)
boxplot(bb28$mean_area_exc ~ bb28$BBnr, add=TRUE, at = 1:1+0.45)
Also it doesn't look very nice because in the plot the x-axis does not adjust to the new boxplot which is cut off then:
enter image description here
I hope you can help me with simple a proper code to get my plot.
Thank you!
Cheers, Merle
Maybe the function multi.boxplot below is what you are looking for. It uses base R only.
Data.
First, make up a dataset, since you have not provided us with one in a copy&paste friendly format.
set.seed(1234)
n <- 50
BBnr <- sort(sprintf("BB%04d", sample(28:30, n, TRUE)))
bb28 <- data.frame(col1 = 1:n, col2 = n:1, BBnr = BBnr)
tmp <- matrix(runif(3*n), ncol = 3)
colnames(tmp) <- paste("mean", c("this", "that", "other"), sep = "_")
bb28 <- cbind(bb28, tmp)
rm(BBnr, tmp)
Code.
multi.boxplot <- function(x, by, col=0, ...){
x <- as.data.frame(x)
uniq.by <- unique(by)
len <- length(uniq.by) - 1
n <- ncol(x)
n1 <- n + 1
col <- rep(col, n)[seq_len(n)]
boxplot(x[[ 1 ]] ~ by, at = 0:len*n1 + 1,
xlim = c(0, (len + 1)*n1), ylim = range(unlist(x)), xaxt = "n", col=col[1], ...)
for(i in seq_len(n)[-1])
boxplot(x[[i]] ~ by, at = 0:len*n1 + i, xaxt = "n", add = TRUE, col=col[i], ...)
axis(1, at = 0:len*n1 + n1/2, labels = uniq.by, tick = TRUE)
}
inx <- grep("mean", names(bb28))
multi.boxplot(bb28[inx], by = bb28$BBnr, col = rainbow(length(inx)))

Display multiple time series with rCharts hPlot

Using a simple data frame to illustrate this problem:
df <- data.frame(x=c(1,2,3), y1=c(1,2,3), y2=c(3,4,5))
Single time series plot is easy:
hPlot(y="y1", x="x", data=df)
Cannot figure out how to plot both y1 and y2 together. Tried this but it returns an error
> hPlot(x='x', y=c('y1','y2'), data=df)
Error in .subset2(x, i, exact = exact) : subscript out of bounds
Checking the code in hPlot where it uses [[ to extract one column from input data.frame, does it mean it only works for single time series?
hPlot <- highchartPlot <- function(..., radius = 3, title = NULL, subtitle = NULL, group.na = NULL){
rChart <- Highcharts$new()
# Get layers
d <- getLayer(...)
data <- data.frame(
x = d$data[[d$x]],
y = d$data[[d$y]]
)
Try to use long formatt data with group:
hPlot(x = "x", y = "value", group = "variable", data = reshape2::melt(df, id.vars = "x"))

Quantiles by factor levels in R

I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1, for each level of a factor strata.
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
I tried using two methods, neither of which produce a usable result. Firstly, I tried using aggregate to apply qfun to each level of strata:
qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))
This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (e.g., using unlist does not line the new variable values up with the correct rows in the data frame).
A second approach was to do this in steps:
tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3
Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate they aren't in the correct order in the data frame. We can check this by putting the quantile "bins" into the data frame.
# get quantile bins
qfun2 <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE)
quantile
}
tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33
Many of the values of var1 are outside of the bins of quantile2. I feel like i'm missing something simple. Any suggestions would be greatly appreciated.
I think your issue is that you don't really want to aggregate, but use ave, (or data.table or plyr)
qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))
#using plyr
library(plyr)
qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))
#using data.table (my preference)
dat[, qq := qfun(var1), by = strata]
Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x was a list of 1 element for each strata.
Use ave on your dat data frame. Full example with your simulated data and qfun function:
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
And my addition...
dat$q <- ave(dat$var1,dat$strata,FUN=qfun)

Resources