Huge file to print when using many plots - r

I am using knitr through Lyx to create a document. In this document, I use knitr to print about 20 images (through R), and 5 calls from R, along with about 20 pages of text.
I save the pdf file, and it is only 1500 KB, and I can view and recompile it easily. But as soon as I go to print, the printer reads about 200MB of information. It takes a super long time (2+ hours) to print.
I was wondering if you knew the solution for this, or even the cause. I’ve been trying to remedy it by just copying the plots and putting them in as figures, but this obviously defeats the purpose of reproducible research. When I put the plots as pictures, we get down to a pdf size of 367 KB. I am fairly certain it is knitr generated plots that are causing the increase in data. When I changed the plots to pictures, it printed in about 5 minutes (which is still a long time, but much shorter than hours).
I've had this issue before, and I believe that it has something to do with plotting of multiple chains for traceplots. Are these known to take forever to print?
Has anyone else experienced this or know the solution for it?

The default for latex output is PDFs for plots. Presumably there are some effects within the PDF which are very expensive to render for your printer. I would specify an alternative graphics device such as png either per chunk using chunk options or as default for the whole file using opts_chunk$set. The relevant option is dev though you may need to change dpi too.
More details on the knitr page

Related

Ways to make printing (on real, tangible paper) an Rmd prettier?

I find that looking at an Rmd as an HTML on a screen is nice because you can scroll through and use all the interactive bits. But when I want to actually print out an Rmd with the code visible, even short tasks tend to take up quite a bit of space and many sheets of paper. Even the defualt Rmd that pops up in rstudio that plots cars takes up an entire page. Granted, you could make the plot a lot smaller and save some space. But with some more complicated and formatted code, you start to take up a lot of space quite quickly. Is there some better way to format these so that they take up less paper? The colored code in the digital form would be nice too.
Yes I agree that sending people the digital version is probably better but some people I work with insist on print outs.

R knit - knit a section of a document

I am working on a large knitr document that takes some time to calculate the initial results. Is there a way to knit a subsection of the document, ie. after the longer calculations have taken place for the purpose of development? I would prefer not to have to pre-calculate the results, save them and then read them into the document in order to expedite developing the final plots and tables.
Thanks,
John
Consider breaking your document up into smaller 'child' documents. Then you can process just the child you are interested in. As I mentioned in the comments, alternatively, you might consider using the caching features of knitr.

OCR tables in R, tesseract and pre-pocessing images

I am trying to extract tables from old books using tesseract in R. Here is an example: Image
The quality of the image is quite poor and the recognition rate was quite bad at first. However, I managed to increase it with gimp: Rescaling, grey scale, auto threshold for colours, Gaussian blur and/or sharpen filters.
I also gave a shot to Fred's imageMagick scripts - textcleaner - and used imageMagick to successfully remove the black lines.
This is what I'm doing in R:
library(tesseract)
library(magick)
img <- image_read('img.png')
img_data <- ocr(img, engine = tesseract('eng', options = list(tessedit_char_whitelist = '0123456789.-',
tessedit_pageseg_mode = 'auto',
textord_tabfind_find_tables = '1',
textord_tablefind_recognize_tables = '1')))
cat(img_data)
Given that I only want to deal with digits, I set tessedit_char_whitelist and, while I get better results, they are still not reliable.
What would you do in this case? Any other thoughts to improve accuracy before I try to train tesseract? I've never done it - let alone with digits only. Any idea/experience on how to do it? I've checked this out: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I'm still a bit baffled.
I worked on a project that used Tesseract to read data fields off of video frames and create an indexed spreadsheet from them. What I found to work well was to crop each text field (using ffmpeg) out each image, process (with ImageMagick, using similar techniques you mentioned), OCR, and then I had Python (something similar could be done in R) create a spreadsheet from the OCR results. The benefit of this method is that Tesseract only has to deal with small, single line text images, which in my case seemed to improve results (with the -psm 7 option). The downside is it's quite processing intensive. Perhaps creating an image for each line of the page would help.
I did find that training Tesseract for a new font/language helped my results immensely. It can be tedious and time consuming, but it significantly improved my results, sometimes going from 0% correct to 100% correct. This site helped me understand the process. I just followed their steps and it worked, sure enough. From my experience in creating training images, it helped a lot to crop out single characters, with about at least a dozen of each character to create a good training sample. And try to have a similar number of samples for each character; it seemed like if you did many many more of one character Tesseract would give that character as a result (incorrectly) more often.

What is the most useful output format for graphs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Before any of you run at the closing vote let me say that I understand that this question may be subjective, and the expected answer may begin by "it depends". Nevertheless, it is an actually relevant problem I run into, as I am creating more and more graphs, and I don't necessarily know the exact way I am going to use them, or don't have the time to test for the final use case immediately.
So I am leveraging the experience of SO R users to get good reasons to choose one over the other, between jpg(), bmp(), png(), tiff(), pdf() and possibly with which options. I don't have the experience in R and the knowledge in the different formats to choose wisely.
Potential use cases:
quick look after or during run time of algorithms
presentations (.ppt mainly)
reports (word or latex)
publication (internet)
storage (without too much loss and to transform it later for a specific use)
anything relevant I forgot
Thanks! I'm happy to make the question clearer.
To expand a little on my comment, there is no real easy answer, but my suggestions:
My first totally flexible choice would be to simply store the final raw data used in the plot(s) and a bit of R code for generating the plot(s). That way you could easily enough send the output to whatever device that suits your particular purpose. It would not be that arduous a task to set yourself up a couple of basic templates based on png()/pdf() that you could call upon.
Use the svg() device. As noted by #gung, storing the output using pdf() , svg() , cairo_ps() or cairo_pdf() are your only real options for retaining scalable, vector images. I would tend to lean towards svg() rather than pdf() due to the greater editing options available using programs like Inkscape. It is also becoming a quite widely supported format for internet publication (see - http://caniuse.com/svg )
If on the other hand you're a latex user, most headaches seem to be solved by going straight to pdf() - you can usually import and convert pdf files using Inkscape or command line utilities like Imagemagick if you have to format shift.
For Word/Powerpoint interaction, if you are running R on Windows, you can also export directly using win.metafile() which will give you scalable/component emf images which you can import into Word or Powerpoint directly. I have heard of people running R through Wine or using intermediary steps on Linux to get emf files out for later use. For Mac, there are roundabout pathways as well.
So, to summarise, in order of preference.
Don't store images at all, store code to generate images
Use svg/pdf and convert formats as required.
Use a backup win.metafile export directly for those cases where you can't escape using Word/Powerpoint and are primarily going to be based on Windows systems.
So far the answers for this question have all recommended outputting plots in vector based formats. This will give you the best output, allowing you to resize your image as you need for whatever medium your image will end up in (whether that be a webpage, document, or presentation), but this comes at a computational cost.
For my own work, I often find it is much more convenient to save my plots in a raster format of sufficient resolution. You probably want to do this whenever your data takes a non-trivial amount of time to plot.
Some examples of where I find a raster format is more convenient:
Manhattan plots: A plot showing p-value significance for hundreds of thousands-millions of DNA markers across a genome.
Large Heatmaps: Clustering the top 5000 differentially expressed genes between two groups of people, one with a disease, and one healthy.
Network Rendering: When drawing a large number of nodes connected to each other by edges, redrawing the edges (as vectors) can slow down your computer.
Ultimately it comes down to a trade-off in your own sanity. What annoys you more? your computer grinding to a halt trying to redraw an image? or figuring out the exact dimensions to render an image in raster format so it doesn't look awful for your final publishing medium?
The most basic distinction to bear in mind here is raster graphics versus vector graphics. In general, vector graphics will preserve options for you later. Of the options you listed, jpeg, bmp, tiff, and png are raster formats; only pdf will give you vector graphics. Thus, that is probably the best default of your listed options.

ggplot2 - printing plot balloons memory

Is it expected that printing a large-ish ggplot to PDF will cause the RSession memory to balloon? I have a ggplot2 object that is around 72 megabytes. My RSession grows to over 2 gig when printing to PDF. Is this expected? Are there ways to optimize performance? I find that the resulting PDFs are huge ~25meg and I have to use an external program to shrink them down (50kb with no visual loss!). Is there a way to print to PDF with lower quality graphics? Or perhaps some parameter to print or ggplot that I haven't considered?
For large data sets, I find it helpful to pre-process the data before putting together the ggplot (even if ggplot offers the same calculations).
ggplot has to be very general: it cannot predict what stat or geom you want to add later on, so it is very difficult to optimize things there (the split-apply-combine strategy can lead to exploding intermediat memory requirements). OTOH, you know what you want and can pre-calculate accordingly.
The large pdf indicates that you either have a lot of overplotting or you produce objects that are too small to be seen. In both cases, you could gain a lot by applying appropriate summary statistics (e.g. hexbin or boxplot instead of scatterplot).
I think we cannot tell you more without details of what you are doing. So please create a minimal example and/or upload the compressed plot you are producing.
Addressing the second part of your question, R makes no attempt to optimize PDFs. If you are overplotting a lot of points, this results in some ridiculous behavior. You can use qpdf to post-process the PDF.
Addressing the first question anecdotally, it does seem that plots on medium-sized datasets take up a lot of memory, but that is merely my experience. Others may have more opinions as to why or more facts as to whether this is so.
Saving in a bitmap format like png can reduce the filesize considerably. Note that this is only appropriate for certain uses of the final image, in particular, it can't be zoomed in as far as a pdf can. But if the final image size is known it can be a useful method.

Resources