How to find byte sizes of R figures on pages? - r

I would like to monitor the basic quality of the figures produced in R on individual pages such as byte size of each page,...
I can now do only quality assurance of average pages, see the following chapter about it.
I think there must be something builtin for the task than average measures.
Code which produces 4 pages in Rplots.pdf where I would like to know the byte size of each page in an output here; any other statistics of the page outputs is also welcome;
you can get the basic memory monitoring by objects here but I would like it to correspond to the outputs in PDF
# https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html
require(stats) # for lowess, rpois, rnorm
plot(cars)
lines(lowess(cars))
plot(sin, -pi, 2*pi) # see ?plot.function
## Discrete Distribution Plot:
plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
main = "rpois(100, lambda = 5)")
## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
points(x, cex = .5, col = "dark red")
## TODO summarise here the byte size of figures in the figures (1-4)
# Output: Rplot.pdf where 4 pages; I want to know the size of each page in bytes
I am currently doing the basic quality assurance in command-line but would like to move some of it to R, to observe bugs faster.
Expected output: byte size, for instance like 4th column of ls -l
To get bytesize of average individual page in an output document
Limitations
Requirement of the homogeneity of the data in pages. This method only works if the pages are all from the same sample.
Otherwise, it is troublesome because it is only average, not describing then the individual phenomenons.
Other possible weaknesses
PDF-elements and meta data. Consider PDF-file as whole, not focusing on the graphic objects itself. So this limits the absolute value use because the filesize contains also headers and other meta data which are not about the graphic objects.
Code
filename <- "main.pdf"
filesize <- file.size(filename)
# http://unix.stackexchange.com/q/331175/16920
pages <- Rpoppler::PDF_info(filename)$Pages
# print page size (= filesize / pages)
pagesize <- filesize / pages
## data of example file
num 7350960
int 62
num 118564
Input: just any 62-pages document
Output: average individual page size (118564)
Testing and's answer
Output but you cannot change the input easily to your wanted PDF-file
files size_bytes
[1,] "./test_page_size_pdf/page01.pdf" "4,123,942"
[2,] "./test_page_size_pdf/page02.pdf" " 4,971"
[3,] "./test_page_size_pdf/page03.pdf" " 4,672"
[4,] "./test_page_size_pdf/page04.pdf" " 5,370"
Input: just any 64-pages document
Expected output: 67 (= 64 + 3) pages, not 4 analysed
R: 3.3.2
OS: Debian 8.5

Download and install the pdftk utility if it is not already on your system and then try one of the following alternatives this from within R.
1) It will return a data frame with the page file sizes in bytes and other information.
myfile <- "Rplots.pdf"
system(paste("pdftk", myfile, "burst"))
file.info(Sys.glob("pg_*.pdf"))
It will also generate a file doc_data.txt with some miscellaneous information that may or may not be of interest.
1a) This alternative will not generate any files. It will simply return the character sizes of the pages as a numeric vector.
myfile <- "Rplots.pdf"
pages <- as.numeric(read.dcf(pipe(paste("pdftk", myfile, "dump_data")))[, "NumberOfPages"])
cmds <- sprintf("pdftk %s cat %d output - | wc -c", myfile, seq_len(pages))
unname(sapply(cmds, function(cmd) scan(pipe(cmd), quiet = TRUE)))
The above should work if pdftk and wc are on your path. Note that on Windows you can find wc in the Rtools distribution and is typically at "C:\\Rtools\\bin\\wc" once Rtools is installed.
2) This alternative is similar to (1) but uses the animation package:
library(animation)
ani.options(pdftk = "/path/to/pdftk")
pdftk("Rplots.pdf", "burst", "pg_%04d.pdf", "")
file.info(Sys.glob("pg_*.pdf"))

To measure the size of each page in a pdf-file I suggest this:
test_size <- TRUE
pdf_name <- "masterpiece"
if(test_size){
dir.create("test_page_size_pdf")
pdf_address <- paste0("./test_page_size_pdf/page%02d.pdf")
} else { pdf_address <- paste0("./", pdf_name, ".pdf")}
pdf(pdf_address, width=10, height=6, onefile=!test_size)
par(mar=c(1,1,1,1), oma=c(1,1,1,1))
plot(rnorm(10^6, 100, 5), type="l")
plot(sin, -pi, 2*pi)
plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
main = "rpois(100, lambda = 5)")
plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
points(x, cex = .5, col = "dark red")
dev.off()
if(test_size){
files <- paste0("./test_page_size_pdf/", list.files("./test_page_size_pdf/"))
size_bytes <- format(file.size(files), big.mark = ",")
file.remove(files)
file.remove("test_page_size_pdf")
cbind(files, size_bytes)
}
The size of a pdf-page in R depends on three things: the content of the plot(), the options used in the pdf() function and the plotting options which are here defined in par().
All this is difficult to estimate. You mention also that you like to have something similar to the shell function ls, which run on files as well. So in this solution I create a temporary folder dir.create() in which we save every page of the pdf separately in a file. We implement this with the option onefile. When the plotting is finish every pdf-page-file as well as the temporary folder will be deleted. And you can see the result in the console.
If you are finish with the testing and want the result in a single file you just have to change in the first line of this script the variable test_size <- FALSE. By the way; I have some doubt that the size of a page is a proxy for the quality of an image. Pdf is a vector format, so the size correspondent with the number of elements: see the size of the first page in my example where I plot 1mio points.

Related

Suppress graph output of a function [duplicate]

I am trying to turn off the display of plot in R.
I read Disable GUI, graphics devices in R but the only solution given is to write the plot to a file.
What if I don't want to pollute the workspace and what if I don't have write permission ?
I tried options(device=NULL) but it didn't work.
The context is the package NbClust : I want what NbClust() returns but I do not want to display the plot it does.
Thanks in advance !
edit : Here is a reproducible example using data from the rattle package :)
data(wine, package="rattle")
df <- scale (wine[-1])
library(NbClust)
# This produces a graph output which I don't want
nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")
# This is the plot I want ;)
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")
You can wrap the call in
pdf(file = NULL)
and
dev.off()
This sends all the output to a null file which effectively hides it.
Luckily it seems that NbClust is one giant messy function with some other functions in it and lots of icky looking code. The plotting is done in one of two places.
Create a copy of NbClust:
> MyNbClust = NbClust
and then edit this function. Change the header to:
MyNbClust <-
function (data, diss = "NULL", distance = "euclidean", min.nc = 2,
max.nc = 15, method = "ward", index = "all", alphaBeale = 0.1, plotetc=FALSE)
{
and then wrap the plotting code in if blocks. Around line 1588:
if(plotetc){
par(mfrow = c(1, 2))
[etc]
cat(paste(...
}
and similarly around line 1610. Save. Now use:
nc = MyNbClust(...etc....)
and you see no plots unless you add plotetc=TRUE.
Then ask the devs to include your patch.

Save automatically produced plots in R

I'm using a function in R able to analyse my data and produce several plots.
The function is "snpzip" from adegenet package.
I would like to save automatically the three plots that the function produces as part of the output. Do you have any suggestion on how to do it?
I want to point to the fact that I know how to save a single plot, for instance with png or pdf followed by dev.off(). My problem is that when I run snpzip(snps, phen, method = "centroid"), the outcomes are three plots (which I would like to save).
I report here the same example as in the "adegenet" package:
simpop <- glSim(100, 10000, n.snp.struc = 10, grp.size = c(0.3,0.7),
LD = FALSE, alpha = 0.4, k = 4)
snps <- as.matrix(simpop)
phen <- simpop#pop
outcome <- snpzip(snps, phen, method = "centroid")
If you use a filename with a C integer format in it, then R will substitute the page number for that part of the name, generating multiple files. For example,
png("page%d.png")
plot(1)
plot(2)
plot(3)
dev.off()
will generate 3 files, page1.png, page2.png, and page3.png. For pdf(), you also need onefile=FALSE:
pdf("page%d.pdf", onefile = FALSE)
plot(1)
plot(2)
plot(3)
dev.off()

Can we change the resolution of image within seqtreedisplay() function call?

I am generating tree images with the seqtreedisplay() function from the R package TraMineR but the default resolution is 72 dpi. I need to create 300 dpi image. Is it be possible do it within the seqtreedisplay() function call using like a "res" argument?
Thanks for help
You can control the resolution of the output file produced by the seqtreedisplay function by passing a device.args argument (that will be treated as an element of the ... list).
The device.args argument should be a list of arguments that will be passed to the used device (jpeg when image.format="jpg", and png otherwise).
To get a 300 dpi resolution, you need to set res=300, but also to increase the the width and height.
I illustrate with the mvad data:
data(mvad)
## Defining a state sequence object
mvad.seq <- seqdef(mvad[, 17:86])
## Growing a seqtree using Hamming distances:
seqt <- seqtree(mvad.seq~ male + Grammar + funemp + gcse5eq + fmpr + livboth,
data=mvad, R=1000, pval=0.05, seqdist.arg=list(method="HAM"))
## Generating the plot as a 300 dpi image in mytree.jpg
seqtreedisplay(seqt, filename = "mytree.jpg", type="d", border=NA, image.format = "jpg",
device.args=list(width=480*300/72, height=480*300/72, res=300))
Below is my previous answer that does not work because seqtreedisplay internally first generates texts and plots in bitmap format before saving them in the image.format.
A solution would be to select a vectorial format (e.g. pdf or eps) for the outcome of seqtreedisplay and then to convert this vectorial file into a raster format with the desired resolution.
Assuming you have installed ImageMagick (and Gostscript on which ImageMagick relies to convert to/from pdf or eps), you could use the convert.g function of TraMineRextras for this conversion. I illustrate below using the mvaddata:
## Drawing the tree as a pdf file and converting into jpeg
seqtreedisplay(seqt, filename = "mytree.pdf", type="d", border=NA, image.format = "pdf")
path <- getwd() ## retrieve the path
convert.g(path = path, fileroot = "mytree", from = "pdf", to = "jpeg",
options = "-units PixelsPerInch -density 300x300")
The resulting jpeg file will be in a jpeg subdirectory of the current folder.

KnitR HTML output showing incorrect/strange results. Inline code and modifying options not yielding the correct output

I'm creating a report on statistical analysis of several distributions; more specifically random populations and how their samples differ from them with the latter adhering to properties of normal distributions while their larger populations remain skewed in most cases.
Although I'm more than satisfied with the rest of the output, I'm unable to figure out why certain numeric values and their visualisations are differing from the ones done through the command line. Here's some of the reproduced code for the discrepancy(first I generate a 1000 random exponentials):
set.seed(1000)
pop <- rexp(1000, 0.2)
In extracting say, the mean of pop, I get the exact correct result through the console, which is 4.76475. This is the value I should be getting through the markdown output, but instead knitr displays it as 5.015616.
mean(pop)
[1] 4.76475
```{r, echo = T}
mean(pop)
```
[1] 5.015616
Its not just the mean, but in almost all of the rest of the required statistical variables for the population as well as sample. In addition, I also get wrong visualisations in the knitted output:
Original/correct plot
Knitted plot
The plots themselves are being displayed discrepant because of the incorrect results. I thought this is a problem with the digits setting, but digits(options) isn't really solving it, neither is default scipen = 0 setting. I've tried inserting inline code but its still showing me the incorrect values. Referred to knitR's manual if a chunk setting was missing but couldn't really find a fault there. Is there something missing here or a bug related to random distributions?
EDIT: I noticed another peculiar property. I created a new markdown file to see if the results varied according to each new output that I created. Let's name this as test.Rmd but it contains the same commands that I've reproduced here with the same seed. And I'm getting a totally different result now, still different from the original value from the command session.
EDIT: Roman's point seem to be working. Knitted result are coming closer to original values but are still not exactly matching. The seed set to 357 gave me a mean(pop) of 4.881604 which is only a decimal point away from the original value. But why is seed being the game changer here? I thought it has to be 1000.
EDIT: Here's some of the code from the .Rmd file as requested by Phil.
# Load packages
library(ggplot2)
library(knitr)
library(gridExtra)
# Generate random exponentials
set.seed(357)
pop = rexp(1000,0.2) # lambs is 0.2 with n = 1000
pop.table <- as.data.frame(pop)
# Take a sample simulating 1000 averages of 40 exponentials
sample.exp = NULL
for (i in 1:1000){
sample.exp = c(sample, rexp(40, 0.2)} # n = 40 here
sample.df <- as.data.frame(sample.exp)
# Generate means and compare
mean(pop) # 4.881604
mean(sample.exp) # 4.992426
# Generate variances and compare
var(pop) # 26.07005
var(sample.exp) # 0.6562298
# Some plots
plot.means.pop <- ggplot(pop.table, aes(pop.table$pop)) + geom_histogram(binwidth = 0.9, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(pop.table$pop), colour = 'red')) + labs(title = 'Population Mean', x = 'Exponential', y = 'Frequency') + theme(legend.position = 'none') +theme(plot.title = element_text(hjust = 0.5))
plot.means.sample <- ggplot(sample.df, aes(sample.df$sample.exp)) + geom_histogram(binwidth = 0.2, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(sample.df$sample.exp)), colour = 'red', size = 0.8) + labs(title = 'Sample Mean', x = 'Exponential', y = 'Frequency') + guides(fill = F) + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(plot.means.sample, plot.means.pop, ncol = 2, nrow = 1)
So thats pretty much the main portion of the file that is giving me 'close' values if not errors or the exact results from the command line. Note: The values annotated are new values after setting the seed to 357 and I've set the same for the global environment. The values that I'm receiving at console are:
4.76475 for population mean
5.00238 for sample mean
21.80913 for population variance
0.6492991 for sample variance
When asking a question on Stack Overflow it's essential to provide a minimal reproducible example. In particular, have a good read of the first answer and this advice and this will guide you through the process.
I think we've all struggled to help you (and we want to!) because we can't reproduce your issue. Compare the following R and Rmd code when run or knitted, respectively:
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
## [1] 5.015616
var(pop)
## [1] 26.07005
and the Rmd:
---
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = TRUE,
warning = TRUE
)
```
```{r}
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
var(pop)
```
Which produces the following output:
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
## [1] 5.015616
var(pop)
## [1] 26.07005
As you can see, the result are identical from a clean R session and a clean knitr session. This is as expected, because the set.seed(), when set the same, should provide the same results every time (see the set.seed man page). When you change the seed to 357, the results vary together:
| mean | var |
console (`R`) | 4.88... | 22.88... |
knitr (`Rmd`) | 4.88... | 22.88... |
In your second code block your knitr chunk result is correct for the 1000 seed, but the console result of 4.76 is incorrect, suggesting to me your console is producing the incorrect output. This could be for one of a few reasons:
You forgot to set the seed in the console before running the rexp() function. If you run this line without setting the seed the result will vary every time. Ensure you run the set.seed(1000) first or use an R script and source this to ensure steps are run through in order.
There's something in your global R environment that is affecting your results. This is less likely because you cleared your R environment, but this is one of the reasons it's important to create a new session from time to time, either by closing and re-opening RStudio or pressing CTRL + Shift + F10
There might be something set in your RProfile.site or .Rprofile that are setting an option on startup that's affecting your results. Have a look at Customizing startup to open and check your startup options, and if necessary correct them.
The output you're seeing isn't because of scipen because there are no numbers in scientific/engineering notation, and it's not digits because the differences you're seeing are more than differences in rounding.
If these suggestions still don't solve your issue, post the minimal reproducible example and try on other computers.

Why can't the pdf file created by gage (a R packge) be opened

I am trying to use Gage package implemented in R to analyze my RNA-seq data. I followed the tutorial and got my data.kegg.p file and I used the following script to generate the heatmap for the top gene set
for (gs in rownames(data.kegg.p$greater)[1]) {
outname = gsub(" |:|/", "_", substr(gs, 10, 100))
geneData(genes = kegg.gs[[gs]], exprs = essData, ref = 1,
samp = 2, outname = outname, txt = T, heatmap = T,
Colv = F, Rowv = F, dendrogram = "none", limit = 3, scatterplot = T)
}
I did get a pdf file named "NOD-like_receptor_signaling_pathway.geneData.heatmap.pdf", but when I open this file with acrobat reader or photoshop, it gives the error information that this file has been disrupted and cannot be recovered. Could anyone help check this file (https://www.dropbox.com/s/wrsml6n1pbrztnm/NOD-like_receptor_signaling_pathway.geneData.heatmap.pdf?dl=0) to see whether it is really disrupted and is it possible to find a way to recover it?
I also attached the R workspace file (https://www.dropbox.com/s/6n5m9x5hyk38ff1/A549.RData?dl=0). The object "a4" is the data with the format ready for gage analysis. It contains the data of the reference sample (nc) the treated sample (a549). It can be accepted by gage for analysis but generate the heatmap pdf file which cannot be opened (above). Would you mind helping me check whether these data can be properly used to generated the correct gage result?
Best regards.
I'm running into a similar problem myself. Not 100% sure but I think this problem occurs when there is no heatmap to plot. In my case, I was doing as.group comparison with ref and sample selections. I think the software treats this circumstance as a sample n of 1 and can't really show a differential heatmap. When I tried using 1ongroup setting, I was able to visualize the pdf file.

Resources