What is the most useful output format for graphs? [closed] - r

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Before any of you run at the closing vote let me say that I understand that this question may be subjective, and the expected answer may begin by "it depends". Nevertheless, it is an actually relevant problem I run into, as I am creating more and more graphs, and I don't necessarily know the exact way I am going to use them, or don't have the time to test for the final use case immediately.
So I am leveraging the experience of SO R users to get good reasons to choose one over the other, between jpg(), bmp(), png(), tiff(), pdf() and possibly with which options. I don't have the experience in R and the knowledge in the different formats to choose wisely.
Potential use cases:
quick look after or during run time of algorithms
presentations (.ppt mainly)
reports (word or latex)
publication (internet)
storage (without too much loss and to transform it later for a specific use)
anything relevant I forgot
Thanks! I'm happy to make the question clearer.

To expand a little on my comment, there is no real easy answer, but my suggestions:
My first totally flexible choice would be to simply store the final raw data used in the plot(s) and a bit of R code for generating the plot(s). That way you could easily enough send the output to whatever device that suits your particular purpose. It would not be that arduous a task to set yourself up a couple of basic templates based on png()/pdf() that you could call upon.
Use the svg() device. As noted by #gung, storing the output using pdf() , svg() , cairo_ps() or cairo_pdf() are your only real options for retaining scalable, vector images. I would tend to lean towards svg() rather than pdf() due to the greater editing options available using programs like Inkscape. It is also becoming a quite widely supported format for internet publication (see - http://caniuse.com/svg )
If on the other hand you're a latex user, most headaches seem to be solved by going straight to pdf() - you can usually import and convert pdf files using Inkscape or command line utilities like Imagemagick if you have to format shift.
For Word/Powerpoint interaction, if you are running R on Windows, you can also export directly using win.metafile() which will give you scalable/component emf images which you can import into Word or Powerpoint directly. I have heard of people running R through Wine or using intermediary steps on Linux to get emf files out for later use. For Mac, there are roundabout pathways as well.
So, to summarise, in order of preference.
Don't store images at all, store code to generate images
Use svg/pdf and convert formats as required.
Use a backup win.metafile export directly for those cases where you can't escape using Word/Powerpoint and are primarily going to be based on Windows systems.

So far the answers for this question have all recommended outputting plots in vector based formats. This will give you the best output, allowing you to resize your image as you need for whatever medium your image will end up in (whether that be a webpage, document, or presentation), but this comes at a computational cost.
For my own work, I often find it is much more convenient to save my plots in a raster format of sufficient resolution. You probably want to do this whenever your data takes a non-trivial amount of time to plot.
Some examples of where I find a raster format is more convenient:
Manhattan plots: A plot showing p-value significance for hundreds of thousands-millions of DNA markers across a genome.
Large Heatmaps: Clustering the top 5000 differentially expressed genes between two groups of people, one with a disease, and one healthy.
Network Rendering: When drawing a large number of nodes connected to each other by edges, redrawing the edges (as vectors) can slow down your computer.
Ultimately it comes down to a trade-off in your own sanity. What annoys you more? your computer grinding to a halt trying to redraw an image? or figuring out the exact dimensions to render an image in raster format so it doesn't look awful for your final publishing medium?

The most basic distinction to bear in mind here is raster graphics versus vector graphics. In general, vector graphics will preserve options for you later. Of the options you listed, jpeg, bmp, tiff, and png are raster formats; only pdf will give you vector graphics. Thus, that is probably the best default of your listed options.

Related

improving rGL HTML performance with multiple figures (mfrow3d) + rglWidgets

i am using RGL to produce a panel of multiple figures through the mfrow3d command.
for the most part, the html produced from the call to writeWebGL is exemplary.
the one caveat is that for multiple figures (be it 6 or 16), i have noticed a bit of lag when attempting to manipulate any one of these figures (to pan/zoom/look around).
an example can be found here: http://fluxions.dydx.ie:1338/schiz.html (warning, 100MB html file haha).
i wanted to ask people here if there is anything i can do in terms of using the "reuse" argument that may speed up performance.
additionally, i wanted to ask if there is any benefit to using rglWidgets and if there is a small example someone could provide in porting a writeWebGL call produced from the following:
https://johnmuschelli.com/WebGL_Interactive_Paper/supp_1/supp_1_wrap.Rmd
to rglwidgets (in hopes that the reuse argument in widgets may improve performance due to my use of mfrow3d).
i am not familiar on how to capture a multi-figure layout with multiple calls to contour3d as a scene that widgets can use.
dr duncan murdoch has gotten back to me and said there probably is not a way to do this, so i guess i will close it.
he is very helpful and i thank him for his support.

OCR tables in R, tesseract and pre-pocessing images

I am trying to extract tables from old books using tesseract in R. Here is an example: Image
The quality of the image is quite poor and the recognition rate was quite bad at first. However, I managed to increase it with gimp: Rescaling, grey scale, auto threshold for colours, Gaussian blur and/or sharpen filters.
I also gave a shot to Fred's imageMagick scripts - textcleaner - and used imageMagick to successfully remove the black lines.
This is what I'm doing in R:
library(tesseract)
library(magick)
img <- image_read('img.png')
img_data <- ocr(img, engine = tesseract('eng', options = list(tessedit_char_whitelist = '0123456789.-',
tessedit_pageseg_mode = 'auto',
textord_tabfind_find_tables = '1',
textord_tablefind_recognize_tables = '1')))
cat(img_data)
Given that I only want to deal with digits, I set tessedit_char_whitelist and, while I get better results, they are still not reliable.
What would you do in this case? Any other thoughts to improve accuracy before I try to train tesseract? I've never done it - let alone with digits only. Any idea/experience on how to do it? I've checked this out: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I'm still a bit baffled.
I worked on a project that used Tesseract to read data fields off of video frames and create an indexed spreadsheet from them. What I found to work well was to crop each text field (using ffmpeg) out each image, process (with ImageMagick, using similar techniques you mentioned), OCR, and then I had Python (something similar could be done in R) create a spreadsheet from the OCR results. The benefit of this method is that Tesseract only has to deal with small, single line text images, which in my case seemed to improve results (with the -psm 7 option). The downside is it's quite processing intensive. Perhaps creating an image for each line of the page would help.
I did find that training Tesseract for a new font/language helped my results immensely. It can be tedious and time consuming, but it significantly improved my results, sometimes going from 0% correct to 100% correct. This site helped me understand the process. I just followed their steps and it worked, sure enough. From my experience in creating training images, it helped a lot to crop out single characters, with about at least a dozen of each character to create a good training sample. And try to have a similar number of samples for each character; it seemed like if you did many many more of one character Tesseract would give that character as a result (incorrectly) more often.

Speeding up ggplot2: does it make sense to pre-render plots? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am building an interactive function that will repeatedly build and plot reasonably complicated ggplot2 plots.
Users provide input (rotation angles for a PCA loadings matrix, actually), and I'd like to show them rotated results asap.
Unfortunately, plotting the plots with ggplot2 is quite sluggish.
Note:
There is emphatically not a lot of data (<100 data points or so), so pre-processing won't help (that's the issue on this and a lot of other SO ggplot2 performance posts).
I have to stick with ggplot2 for now. (I know, I know, ggobi etc. ...).
I do know the range of possible inputs in advance (0-360): that's a very finite number.
I have cached the ggplot-generating functions with memoise, but that doesn't seem to help much; the problem seems to be the actual plotting on the graphics device.
(I also noticed that the internal graphics device of RStudio is particularly sluggish).
So, I thought, maybe I'd be an idea to somehow pre-render all necessary plots, maybe by saving the svg() graphics device to a file or something, and then to plot those cached versions as necessary.
On a scale of 1-10, how stupid of an idea is that?
Any better ideas?
Will this even speed up the plotting, or will the graphics device still be the bottleneck?
Why can't we have hardware acceleration in R :(.
Update
this is not hosted software (for now), just working locally and it should work from any number of clients and on any number of platforms.
I am aware of (the much faster) ggvis and ggobi, but these are for now not an option (development bandwidth is too small).
There are actually several, relatively complicated, nested (grid.arranged) plotting functions, and those were memoised at some point – with no noticeable speed increase.
Opening pre-rendered files in an external file viewer seems to endanger cross-platform appeal – correct?

ggplot2 - printing plot balloons memory

Is it expected that printing a large-ish ggplot to PDF will cause the RSession memory to balloon? I have a ggplot2 object that is around 72 megabytes. My RSession grows to over 2 gig when printing to PDF. Is this expected? Are there ways to optimize performance? I find that the resulting PDFs are huge ~25meg and I have to use an external program to shrink them down (50kb with no visual loss!). Is there a way to print to PDF with lower quality graphics? Or perhaps some parameter to print or ggplot that I haven't considered?
For large data sets, I find it helpful to pre-process the data before putting together the ggplot (even if ggplot offers the same calculations).
ggplot has to be very general: it cannot predict what stat or geom you want to add later on, so it is very difficult to optimize things there (the split-apply-combine strategy can lead to exploding intermediat memory requirements). OTOH, you know what you want and can pre-calculate accordingly.
The large pdf indicates that you either have a lot of overplotting or you produce objects that are too small to be seen. In both cases, you could gain a lot by applying appropriate summary statistics (e.g. hexbin or boxplot instead of scatterplot).
I think we cannot tell you more without details of what you are doing. So please create a minimal example and/or upload the compressed plot you are producing.
Addressing the second part of your question, R makes no attempt to optimize PDFs. If you are overplotting a lot of points, this results in some ridiculous behavior. You can use qpdf to post-process the PDF.
Addressing the first question anecdotally, it does seem that plots on medium-sized datasets take up a lot of memory, but that is merely my experience. Others may have more opinions as to why or more facts as to whether this is so.
Saving in a bitmap format like png can reduce the filesize considerably. Note that this is only appropriate for certain uses of the final image, in particular, it can't be zoomed in as far as a pdf can. But if the final image size is known it can be a useful method.

How do programs like mathematica draw graphs and how can I make such a program?

I've been wondering how programs like mathematica and mathlab, and so on, plot graphs of functions so gracefully and fast. Can anyone explain to me how they do this and, furthermore, how I can do this? Is it related to an aspect or course in Computer Programming or Math? Which then?
Well, with some encouragement from belisarius, here's a my comment as an answer: try looking at matplotlib. From the home page:
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.
It was originally inspired by MATLAB's plotting capabilities, though it's grown a lot since then. It's solid software - and it's open source, under a BSD license, so not only can you read the source, you can hack on it and use it in whatever you like.
Another place you could look is gnuplot. It's not one of the common open source licenses, but it's certainly open source, with some permissions to modify and such.
Gnuplot is a portable command-line driven graphing utility for linux, OS/2, MS Windows, OSX, VMS, and many other platforms. The source code is copyrighted but freely distributed (i.e., you don't have to pay for it). It was originally created to allow scientists and students to visualize mathematical functions and data interactively, but has grown to support many non-interactive uses such as web scripting. It is also used as a plotting engine by third-party applications like Octave. Gnuplot has been supported and under active development since 1986.
It does 3D plotting as well, which matplotlib doesn't do, and it's been around a lot longer. The reason I thought of matplotlib first is that it's intended as a library for a higher-level language, not a stand-alone application, so I'm guessing it might a bit easier for you to read.
One other suggestion, just to get an idea of the sorts of things Mathematica is doing under the hood, is to look at the documentation for Plot. In particular, if you look at the available options, you can deduce things.
MaxRecursion Automatic the maximum number of recursive subdivisions allowed
Method Automatic the method to use for refining curves
PerformanceGoal $PerformanceGoal aspects of performance to try to optimize
PlotPoints Automatic initial number of sample points
From the MaxRecursion and PlotPoints, you can see that it's doing an initial sampling then somehow deciding which regions need to be subdivided (resampled) to get an accurate view of the plot. And from there on, it's magic: there is some Method for this, and a PerformanceGoal to guide it...
For MATLAB, because of its cross-platform requirement there is no alternatives as using OpenGL. MATLAB runtime is written in C++ and non-axis GUI uses Java Swing. Therefore MATLAB Plot is probably a C++/OpenGL/Swing mixture.
In reality MATLAB graphics is much less complex then a video game graphics. I think it is easier to find tutorials on video game graphics and then "downsize" it to MATLAB functionality, like drawing a single line with the same color.
The most important concept is probably Transformation Matrix.
Basically most programs that plot any type of graph (particularly any graphs of reasonable complexity) will use some type of third party libraries.
The specific library used would depend on the programming language that is being used.
For example:
For a .Net application you might use Crystal reports. http://en.wikipedia.org/wiki/Crystal_Reports
For Java you might use JFreeChart. http://www.jfree.org/jfreechart/
And so on...
You will likely find numerious libraries for whatever language you decide to code in.
If you want to accomplish this functionality in your specific project I suggest using a library especially if you are a beginner. The internal complexities of how these graph libraries are implemented would be significant because of many issues such as cross platform compatibility, graphic rendering optimizations (ie: making sure the graphics render quickly and ‘prettily’), the maths associated with the positioning of elements on the graph and so forth.
Lastly I doubt you will find specific courses in this subject (or require them) as again excluding VERY specific cases programmers will always use libraries that already exist.
Why code it yourself when someone has
already solved the problem for you?
A good place to start is to understand that there is a grammar to graphics and what you want to construct upon receiving a plot command is a symbolic representation of the graph. For Mathematica, you can do something like
FullForm[Plot[Sin[x], {x, 0, 2 Pi}]]
to see the internal representation Mathematica uses. Basically you need to describe the line segments (2D) or meshes (3D) you want to draw in terms of their color and coordinates. Also, there needs to information about the scale of the graph and how to draw tick marks, label axes, etc.
This leads us to the heart of the question, how do you determine the line segment you want to draw from a function and a range? If you dig around in the help file for plot, you see a few things. First there is a plot points option and a MaxRecursion option. This leads me to believe (and this is just an educated guess, but it is how I would do it) that Mathematica plots the initial number of points on an even interval over the range to get a starting value. The next part is to identify regions where change exceeds some threshold and then to sample more points until the "change" between any two points in your line segment is below a threshold. Mathematica does this recursively, hence the MaxRecursion option.
So far I have been pretty vague about defining rate of change. A more useful way to describe change is to take 3 pts on your line segment. Assume a linear relationship between the 1st and 3rd point and, assuming this linear relationship, make a prediction about what the 2nd point would be. If the error of this prediction is sufficiently low, then consider the next group of three points. If the error is above a threshold, then you should sample some more points in this region until the threshold is met. In this way you will require relatively few points where the curve is relatively straight and more at the "interesting" parts where it bends in new directions. The smoothness of the curve you draw will be proportional to the error you are willing to tolerate in the linear prediction of points.

Resources