ggplot2 - printing plot balloons memory - r

Is it expected that printing a large-ish ggplot to PDF will cause the RSession memory to balloon? I have a ggplot2 object that is around 72 megabytes. My RSession grows to over 2 gig when printing to PDF. Is this expected? Are there ways to optimize performance? I find that the resulting PDFs are huge ~25meg and I have to use an external program to shrink them down (50kb with no visual loss!). Is there a way to print to PDF with lower quality graphics? Or perhaps some parameter to print or ggplot that I haven't considered?

For large data sets, I find it helpful to pre-process the data before putting together the ggplot (even if ggplot offers the same calculations).
ggplot has to be very general: it cannot predict what stat or geom you want to add later on, so it is very difficult to optimize things there (the split-apply-combine strategy can lead to exploding intermediat memory requirements). OTOH, you know what you want and can pre-calculate accordingly.
The large pdf indicates that you either have a lot of overplotting or you produce objects that are too small to be seen. In both cases, you could gain a lot by applying appropriate summary statistics (e.g. hexbin or boxplot instead of scatterplot).
I think we cannot tell you more without details of what you are doing. So please create a minimal example and/or upload the compressed plot you are producing.

Addressing the second part of your question, R makes no attempt to optimize PDFs. If you are overplotting a lot of points, this results in some ridiculous behavior. You can use qpdf to post-process the PDF.
Addressing the first question anecdotally, it does seem that plots on medium-sized datasets take up a lot of memory, but that is merely my experience. Others may have more opinions as to why or more facts as to whether this is so.

Saving in a bitmap format like png can reduce the filesize considerably. Note that this is only appropriate for certain uses of the final image, in particular, it can't be zoomed in as far as a pdf can. But if the final image size is known it can be a useful method.

Related

raster::gridDistance() fails on medium to large raster files

I have some moderate sized raster files (max size ~190 MB) that I would like to calculate grid distances for using raster::gridDistance()
I'm finding that the operation is slow and/or R crashes for the largest of my files. Please note: I'm not seeking memory management advice (e.g. maxing out memory.limit(), breaking into smaller rasters or pursuing parallel processing methods) as these are sidestepping my issue. If grid distances should not be attempted for 190+ MB size files, then I will just break the job into more manageable pieces.
The raster::gridDistance() documentation mentions that I can try to solve "errors in the case of complex objects spread over different chunks... by varying the chunk size, see function setOptions()" and that "Additional distance measures and options (directions, cost-distance) are available in the {gdistance} package", but I have been hesitant to pursue these without better understanding the limitations/considerations.
Thanks to this question R - terra::distance() equivalent of raster::gridDistance(..., origin = x, omit = y) I understand that there is an alternative method using terra::gridDistance(), but I am not able to discern if the operation is any more efficient or suitable for my needs than raster::gridDistance()
I haven't posted a reprex or session info as my question is really as follows:
Is terra::gridDistance() (or some other alternative like those offered by {gdistance}) really a more efficient (faster) or customizable way for calculating a grid distance using moderate-large raster files?
If not, what are considerations for changing how the grid distance is calculated (varying chunk size or other means) using raster::gridDistance() and setOptions()?
If there is interest, I can reformat my question so that it better fits guidelines with a reprex etc. Also, I am posting the question here rather than Geographic Information Systems because the original linked question was posted here.
I understand that there is an alternative method using terra::gridDistance(), but I am not able to discern if the operation is any more efficient or suitable for my needs
Well, did you try it? That could have been more efficient than writing a long question.
The help file does not mention the limitations that raster::gridDistance has, so you should be good to go. But note that the method was renamed to terra::gridDist()
The "terra" package is the replacement for "raster" package; so "terra" is the best starting point more generally, I think.

improving rGL HTML performance with multiple figures (mfrow3d) + rglWidgets

i am using RGL to produce a panel of multiple figures through the mfrow3d command.
for the most part, the html produced from the call to writeWebGL is exemplary.
the one caveat is that for multiple figures (be it 6 or 16), i have noticed a bit of lag when attempting to manipulate any one of these figures (to pan/zoom/look around).
an example can be found here: http://fluxions.dydx.ie:1338/schiz.html (warning, 100MB html file haha).
i wanted to ask people here if there is anything i can do in terms of using the "reuse" argument that may speed up performance.
additionally, i wanted to ask if there is any benefit to using rglWidgets and if there is a small example someone could provide in porting a writeWebGL call produced from the following:
https://johnmuschelli.com/WebGL_Interactive_Paper/supp_1/supp_1_wrap.Rmd
to rglwidgets (in hopes that the reuse argument in widgets may improve performance due to my use of mfrow3d).
i am not familiar on how to capture a multi-figure layout with multiple calls to contour3d as a scene that widgets can use.
dr duncan murdoch has gotten back to me and said there probably is not a way to do this, so i guess i will close it.
he is very helpful and i thank him for his support.

OCR tables in R, tesseract and pre-pocessing images

I am trying to extract tables from old books using tesseract in R. Here is an example: Image
The quality of the image is quite poor and the recognition rate was quite bad at first. However, I managed to increase it with gimp: Rescaling, grey scale, auto threshold for colours, Gaussian blur and/or sharpen filters.
I also gave a shot to Fred's imageMagick scripts - textcleaner - and used imageMagick to successfully remove the black lines.
This is what I'm doing in R:
library(tesseract)
library(magick)
img <- image_read('img.png')
img_data <- ocr(img, engine = tesseract('eng', options = list(tessedit_char_whitelist = '0123456789.-',
tessedit_pageseg_mode = 'auto',
textord_tabfind_find_tables = '1',
textord_tablefind_recognize_tables = '1')))
cat(img_data)
Given that I only want to deal with digits, I set tessedit_char_whitelist and, while I get better results, they are still not reliable.
What would you do in this case? Any other thoughts to improve accuracy before I try to train tesseract? I've never done it - let alone with digits only. Any idea/experience on how to do it? I've checked this out: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I'm still a bit baffled.
I worked on a project that used Tesseract to read data fields off of video frames and create an indexed spreadsheet from them. What I found to work well was to crop each text field (using ffmpeg) out each image, process (with ImageMagick, using similar techniques you mentioned), OCR, and then I had Python (something similar could be done in R) create a spreadsheet from the OCR results. The benefit of this method is that Tesseract only has to deal with small, single line text images, which in my case seemed to improve results (with the -psm 7 option). The downside is it's quite processing intensive. Perhaps creating an image for each line of the page would help.
I did find that training Tesseract for a new font/language helped my results immensely. It can be tedious and time consuming, but it significantly improved my results, sometimes going from 0% correct to 100% correct. This site helped me understand the process. I just followed their steps and it worked, sure enough. From my experience in creating training images, it helped a lot to crop out single characters, with about at least a dozen of each character to create a good training sample. And try to have a similar number of samples for each character; it seemed like if you did many many more of one character Tesseract would give that character as a result (incorrectly) more often.

Huge file to print when using many plots

I am using knitr through Lyx to create a document. In this document, I use knitr to print about 20 images (through R), and 5 calls from R, along with about 20 pages of text.
I save the pdf file, and it is only 1500 KB, and I can view and recompile it easily. But as soon as I go to print, the printer reads about 200MB of information. It takes a super long time (2+ hours) to print.
I was wondering if you knew the solution for this, or even the cause. I’ve been trying to remedy it by just copying the plots and putting them in as figures, but this obviously defeats the purpose of reproducible research. When I put the plots as pictures, we get down to a pdf size of 367 KB. I am fairly certain it is knitr generated plots that are causing the increase in data. When I changed the plots to pictures, it printed in about 5 minutes (which is still a long time, but much shorter than hours).
I've had this issue before, and I believe that it has something to do with plotting of multiple chains for traceplots. Are these known to take forever to print?
Has anyone else experienced this or know the solution for it?
The default for latex output is PDFs for plots. Presumably there are some effects within the PDF which are very expensive to render for your printer. I would specify an alternative graphics device such as png either per chunk using chunk options or as default for the whole file using opts_chunk$set. The relevant option is dev though you may need to change dpi too.
More details on the knitr page

What is the most useful output format for graphs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Before any of you run at the closing vote let me say that I understand that this question may be subjective, and the expected answer may begin by "it depends". Nevertheless, it is an actually relevant problem I run into, as I am creating more and more graphs, and I don't necessarily know the exact way I am going to use them, or don't have the time to test for the final use case immediately.
So I am leveraging the experience of SO R users to get good reasons to choose one over the other, between jpg(), bmp(), png(), tiff(), pdf() and possibly with which options. I don't have the experience in R and the knowledge in the different formats to choose wisely.
Potential use cases:
quick look after or during run time of algorithms
presentations (.ppt mainly)
reports (word or latex)
publication (internet)
storage (without too much loss and to transform it later for a specific use)
anything relevant I forgot
Thanks! I'm happy to make the question clearer.
To expand a little on my comment, there is no real easy answer, but my suggestions:
My first totally flexible choice would be to simply store the final raw data used in the plot(s) and a bit of R code for generating the plot(s). That way you could easily enough send the output to whatever device that suits your particular purpose. It would not be that arduous a task to set yourself up a couple of basic templates based on png()/pdf() that you could call upon.
Use the svg() device. As noted by #gung, storing the output using pdf() , svg() , cairo_ps() or cairo_pdf() are your only real options for retaining scalable, vector images. I would tend to lean towards svg() rather than pdf() due to the greater editing options available using programs like Inkscape. It is also becoming a quite widely supported format for internet publication (see - http://caniuse.com/svg )
If on the other hand you're a latex user, most headaches seem to be solved by going straight to pdf() - you can usually import and convert pdf files using Inkscape or command line utilities like Imagemagick if you have to format shift.
For Word/Powerpoint interaction, if you are running R on Windows, you can also export directly using win.metafile() which will give you scalable/component emf images which you can import into Word or Powerpoint directly. I have heard of people running R through Wine or using intermediary steps on Linux to get emf files out for later use. For Mac, there are roundabout pathways as well.
So, to summarise, in order of preference.
Don't store images at all, store code to generate images
Use svg/pdf and convert formats as required.
Use a backup win.metafile export directly for those cases where you can't escape using Word/Powerpoint and are primarily going to be based on Windows systems.
So far the answers for this question have all recommended outputting plots in vector based formats. This will give you the best output, allowing you to resize your image as you need for whatever medium your image will end up in (whether that be a webpage, document, or presentation), but this comes at a computational cost.
For my own work, I often find it is much more convenient to save my plots in a raster format of sufficient resolution. You probably want to do this whenever your data takes a non-trivial amount of time to plot.
Some examples of where I find a raster format is more convenient:
Manhattan plots: A plot showing p-value significance for hundreds of thousands-millions of DNA markers across a genome.
Large Heatmaps: Clustering the top 5000 differentially expressed genes between two groups of people, one with a disease, and one healthy.
Network Rendering: When drawing a large number of nodes connected to each other by edges, redrawing the edges (as vectors) can slow down your computer.
Ultimately it comes down to a trade-off in your own sanity. What annoys you more? your computer grinding to a halt trying to redraw an image? or figuring out the exact dimensions to render an image in raster format so it doesn't look awful for your final publishing medium?
The most basic distinction to bear in mind here is raster graphics versus vector graphics. In general, vector graphics will preserve options for you later. Of the options you listed, jpeg, bmp, tiff, and png are raster formats; only pdf will give you vector graphics. Thus, that is probably the best default of your listed options.

Resources