Implementations of "The grammar of graphics" in statistical packages - r

I am aware that Leland Wikinson's ideas, as exposed in his book "The Grammar of
Graphics" underlie ggplot2 implementation in R.
But are there other implementations of the same ideas in other statistical packages (SAS or other)?

In SPSS, the Chart Builder was built upon the same foundation. I could be wrong, but I think SPSS implemented it as "GPL." IMHO, Hadley's ggplot2 is much easier to learn and there are mounds of examples online; I haven't seen many examples of graphics built with GPL.
Not to mention, R can be obtained for the price of free.

It is not exactly the same, but SAS has (since 9.2) the ODS Graphics system in place. This graphics system is also based on the same foundation, and if you look at the examples here, you immediately notice the similarities in layout and buildup of these graphs and the ones in ggplot2.
THe idea is here to just get the graphs from the analysis, so you specify the content of the graph at the same time you specify your analysis. Then there is the template language for the ODS graphics to allow you to create your custom graphs. This is something I still miss a bit in ggplot2. But ggplot2 is quite a lot easier.

It seems IBM does some visualization tools with grammar of graphics inside.
They say their backend -- Rapidly Adaptive Visualization Engine (RAVE) -- is based on it.
And recently I found this overview-article about VizJSON -- a language to describe charts, which is apparently some variation of JSON. (I don't really know about SPSS, Many Eyes and the connection between them and other IBM's software. Probably SPSS is the back-end for Many Eyes... Probably VizJSON is the next step to their GPL... Whatever -- it is closed proprietary stuff anyway)
Also there is D3.js. It is open, BSD license. It is a javascript library. Here "javascript" does not mean "web only": you can make SVG files with your plots (and probably they will or already do support more). But it means that you need to know bunch of Web technologies: HTML, Javascript, DOM, CSS etc (+ maybe javascript's package manager..). And also people say it is quite a low-level library.
There is a more high-level tool, based on D3.js -- Vega.
I am not very savvy in these tools and cannot be totally sure about this information ;)

Python now has its own ggplot port.
Also, Tableau is a visualization system based firmly on the Grammar of Graphics (Wilkinson himself works there now). But I'm not sure if this counts, since it's not part of a pre-existing statistical package.

Related

Rcpp or rdyncall

I'm looking for some kind of references which explain the pro's en con's of using Rcpp when compared to using rdyncall.
Can someone who has used both explain the basic differences from an R package developers perspective who is interested in providing R wrappers to C++ code.
Thanks.
I think we mention rdyncall in the (brief) comparison to other approaches the intro vignette / JSS paper. It is a neat package, but aims for a much lower-level connection. As I understand it, it gives you C-level APIs with least amount of fuzz, as motivated by say, the rgl package. there is very good and detailed paper about rdyncall in a recent R Journal issue.
And unless I miss something, it does nothing for you on the C++ side. Whereas Rcpp makes use as .Call() to pass complete R objects back and forth, and manages to map a wide variety of R and C++ types automatically for you---with the possibility add your own mappers.

Are there any guidelines for when reproducible code should be included into a publication?

Given the stress toward reproducible science, I was wondering if my recent work warrants the inclusion of example code in the publication. The datasets that I am using are quite big, so it wouldn't make sense to publish those necessarity - However, the statistical methods that I apply within R are not generally known to my audience (although I would think that they should be).
I'm using empirical orthogonal function analysis (EOF) and generalized additive models (GAM) within my analysis. GAM, in particular, is widely used in ecological studies, but less so within the physical sciences - my work spans both disciplines.
I definitely refer to the R packages that I use, and it wouldn't really be difficult for a reviewer / reader to look for those references (and included examples) themselves. So, my question is, what situations are most appropriate for the inclusion of reproducible code in a publication?
Code is the most accurate representation of what you actually did. Therefore, in my view you should always aim to publish code alongside your article.
However, editor resistance to this is pretty strong. The fear is that if the reviewer had access to the code, then the journal looks pretty bad if a substantive coding mistake is later found. This is not a hypothetical fear, given the Levitt paper, etc.
Knuth has some strong views on literate programming that you should be able to cite as justification. If you can't convince the journal to accept your code as an integral piece of the publication, consider publishing it on your personal website (the approach taken by e.g. Raj Chetty for many of his papers) or publish it as an R package.
Finally, here's a note I wrote to my programming students:
Consider publishing your code. Doing so will act as a commitment
device which will encourage good habits--habits that make your own
work easier. Publishing your code also makes it easier for others to
extend your analysis, which can result in more citations of your work.
Releasing your code is good academic practice as well: it is the
truest testament to your analysis. And offering your program to the
world shows off the beautiful coding skills which you are about to
acquire.
A basic tenet of science is reproducibility. So the answer would be to "include" code required to conduct your analysis to every paper/publication that is based on data analysis.
I say "include" because you don't need to put the R code directly into the paper. Many if not most journals allow supplementary material which is an option. Alternative, supply your script to one of the many Science data archiving sites (Such as Figshare) and then (and here is the killer!) cite your own script using the DOI that Figshare gives to your deposited script. If you can post the data too, then all the better; Figshare doesn't really care too much about big data sets.
The above applies to code where you are using other packages and your R script does things like loads and formats data, calls functions from other packages and then plots or displays output/results. If you have developed new R code to implement a particular method then I would say package the code as an R package and submit that to CRAN or r-forge or something like that.
From your description, the former (deposit the analysis script in a repo) would be most appropriate.
We recently had a discussion at our research institute regarding reproducible research. The incentive came from the Nature editorial (http://arstechnica.com/science/2012/02/science-code-should-be-open-source-according-to-editorial/) which argued that all your code should be published. I whole heartedly agree with this. Even though your dataset is very big, publishing the R code that you used to create your results makes it crystal clear what you did. Often times the methods of a paper do not contain sufficient detail to reproduce the result, the code is quite a help in this case.

How to make interactive charts together with R language

I'm helping my friend make a website. He previously used R language to generate statistical charts. Now he want to generate some dynamic chart so that when users move mouse over certain part of the chart there will be some description/complementary information pops up for them to read. What kind of technology/tools/packages I can use for this purpose?
PS: I've explored some possible ways, yet none of them fits my needs. I've tried rggobi + ggobi. They can't coz they are not for web applications. iPlot can't do it coz it generates histogram only. I've thought about asking R produces some intermediate date which I can pass to some JavaScript packages like HighCharts. Yet, apparently R is much powerful than JS. R can generates some advanced type of charts which JS just can't do.
You should use R to generate the data and then export it in a format that a javascript framework for graphs can understand.
This way you could benefit from the advanced statistical analysis provided by R and the presentation layer of javascript.
Lots of solutions exist for this problem, but i've heard lots of good things about Raphael and its chart plugin, which you may want to investigate
The playwith package offers facilities to manipulate rgl graphics. A couple of links:
http://code.google.com/p/playwith/w/list
http://www.r-bloggers.com/playing-with-the-%E2%80%98playwith%E2%80%99-package/
Look at the sendplot package or the RSVGTipsDevice package.

How do programs like mathematica draw graphs and how can I make such a program?

I've been wondering how programs like mathematica and mathlab, and so on, plot graphs of functions so gracefully and fast. Can anyone explain to me how they do this and, furthermore, how I can do this? Is it related to an aspect or course in Computer Programming or Math? Which then?
Well, with some encouragement from belisarius, here's a my comment as an answer: try looking at matplotlib. From the home page:
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.
It was originally inspired by MATLAB's plotting capabilities, though it's grown a lot since then. It's solid software - and it's open source, under a BSD license, so not only can you read the source, you can hack on it and use it in whatever you like.
Another place you could look is gnuplot. It's not one of the common open source licenses, but it's certainly open source, with some permissions to modify and such.
Gnuplot is a portable command-line driven graphing utility for linux, OS/2, MS Windows, OSX, VMS, and many other platforms. The source code is copyrighted but freely distributed (i.e., you don't have to pay for it). It was originally created to allow scientists and students to visualize mathematical functions and data interactively, but has grown to support many non-interactive uses such as web scripting. It is also used as a plotting engine by third-party applications like Octave. Gnuplot has been supported and under active development since 1986.
It does 3D plotting as well, which matplotlib doesn't do, and it's been around a lot longer. The reason I thought of matplotlib first is that it's intended as a library for a higher-level language, not a stand-alone application, so I'm guessing it might a bit easier for you to read.
One other suggestion, just to get an idea of the sorts of things Mathematica is doing under the hood, is to look at the documentation for Plot. In particular, if you look at the available options, you can deduce things.
MaxRecursion Automatic the maximum number of recursive subdivisions allowed
Method Automatic the method to use for refining curves
PerformanceGoal $PerformanceGoal aspects of performance to try to optimize
PlotPoints Automatic initial number of sample points
From the MaxRecursion and PlotPoints, you can see that it's doing an initial sampling then somehow deciding which regions need to be subdivided (resampled) to get an accurate view of the plot. And from there on, it's magic: there is some Method for this, and a PerformanceGoal to guide it...
For MATLAB, because of its cross-platform requirement there is no alternatives as using OpenGL. MATLAB runtime is written in C++ and non-axis GUI uses Java Swing. Therefore MATLAB Plot is probably a C++/OpenGL/Swing mixture.
In reality MATLAB graphics is much less complex then a video game graphics. I think it is easier to find tutorials on video game graphics and then "downsize" it to MATLAB functionality, like drawing a single line with the same color.
The most important concept is probably Transformation Matrix.
Basically most programs that plot any type of graph (particularly any graphs of reasonable complexity) will use some type of third party libraries.
The specific library used would depend on the programming language that is being used.
For example:
For a .Net application you might use Crystal reports. http://en.wikipedia.org/wiki/Crystal_Reports
For Java you might use JFreeChart. http://www.jfree.org/jfreechart/
And so on...
You will likely find numerious libraries for whatever language you decide to code in.
If you want to accomplish this functionality in your specific project I suggest using a library especially if you are a beginner. The internal complexities of how these graph libraries are implemented would be significant because of many issues such as cross platform compatibility, graphic rendering optimizations (ie: making sure the graphics render quickly and ‘prettily’), the maths associated with the positioning of elements on the graph and so forth.
Lastly I doubt you will find specific courses in this subject (or require them) as again excluding VERY specific cases programmers will always use libraries that already exist.
Why code it yourself when someone has
already solved the problem for you?
A good place to start is to understand that there is a grammar to graphics and what you want to construct upon receiving a plot command is a symbolic representation of the graph. For Mathematica, you can do something like
FullForm[Plot[Sin[x], {x, 0, 2 Pi}]]
to see the internal representation Mathematica uses. Basically you need to describe the line segments (2D) or meshes (3D) you want to draw in terms of their color and coordinates. Also, there needs to information about the scale of the graph and how to draw tick marks, label axes, etc.
This leads us to the heart of the question, how do you determine the line segment you want to draw from a function and a range? If you dig around in the help file for plot, you see a few things. First there is a plot points option and a MaxRecursion option. This leads me to believe (and this is just an educated guess, but it is how I would do it) that Mathematica plots the initial number of points on an even interval over the range to get a starting value. The next part is to identify regions where change exceeds some threshold and then to sample more points until the "change" between any two points in your line segment is below a threshold. Mathematica does this recursively, hence the MaxRecursion option.
So far I have been pretty vague about defining rate of change. A more useful way to describe change is to take 3 pts on your line segment. Assume a linear relationship between the 1st and 3rd point and, assuming this linear relationship, make a prediction about what the 2nd point would be. If the error of this prediction is sufficiently low, then consider the next group of three points. If the error is above a threshold, then you should sample some more points in this region until the threshold is met. In this way you will require relatively few points where the curve is relatively straight and more at the "interesting" parts where it bends in new directions. The smoothness of the curve you draw will be proportional to the error you are willing to tolerate in the linear prediction of points.

R and SPSS difference

I will be analysing vast amount of network traffic related data shortly, and will pre-process the data in order to analyse it. I have found that R and SPSS are among the most popular tools for statistical analysis. I will also be generating quite a lot of graphs and charts. Therefore, I was wondering what is the basic difference between these two softwares.
I am not asking which one is better, but just wanted to know what are the difference in terms of workflow between the two (besides the fact that SPSS has a GUI). I will be mostly working with scripts in either case anyway so I wanted to know about the other differences.
Here is something that I posted to the R-help mailing list a while back, but I think that it gives a good high level overview of the general difference in R and SPSS:
When talking about user friendlyness
of computer software I like the
analogy of cars vs. busses:
Busses are very easy to use, you just
need to know which bus to get on,
where to get on, and where to get off
(and you need to pay your fare). Cars
on the other hand require much more
work, you need to have some type of
map or directions (even if the map is
in your head), you need to put gas in
every now and then, you need to know
the rules of the road (have some type
of drivers licence). The big advantage
of the car is that it can take you a
bunch of places that the bus does not
go and it is quicker for some trips
that would require transfering between
busses.
Using this analogy programs like SPSS
are busses, easy to use for the
standard things, but very frustrating
if you want to do something that is
not already preprogrammed.
R is a 4-wheel drive SUV (though
environmentally friendly) with a bike
on the back, a kayak on top, good
walking and running shoes in the
pasenger seat, and mountain climbing
and spelunking gear in the back.
R can take you anywhere you want to go
if you take time to leard how to use
the equipment, but that is going to
take longer than learning where the
bus stops are in SPSS.
There are GUIs for R that make it a bit easier to use, but also limit the functionality that can be used that easily. SPSS does have scripting which takes it beyond being a mere bus, but the general phylosophy of SPSS steers people towards the GUI rather than the scripts.
I work at a company that uses SPSS for the majority of our data analysis, and for a variety of reasons - I have started trying to use R for more and more of my own analysis. Some of the biggest differences I have run into include:
Output of tables - SPSS has basic tables, general tables, custom tables, etc that are all output to that nifty data viewer or whatever they call it. These can relatively easily be transported to Word Documents or Excel sheets for further analysis / presentation. The equivalent function in R involves learning LaTex or using a odfWeave or Lyx or something of that nature.
Labeling of data --> SPSS does a pretty good job with the variable labels and value labels. I haven't found a robust solution for R to accomplish this same task.
You mention that you are going to be scripting most of your work, and personally I find SPSS's scripting syntax absolutely horrendous, to the point that I've stopped working with SPSS whenever possible. R syntax seems much more logical and follows programming standards more closely AND there is a very active community to rely on should you run into trouble (SO for instance). I haven't found a good SPSS community to ask questions of when I run into problems.
Others have pointed out some of the big differences in terms of cost and functionality of the programs. If you have to collaborate with others, their comfort level with SPSS or R should play a factor as you don't want to be the only one in your group that can work on or edit a script that you wrote in the future.
If you are going to be learning R, this post on the stats exchange website has a bunch of great resources for learning R: https://stats.stackexchange.com/questions/138/resources-for-learning-r
The initial workflow for SPSS involves justifying writing a big fat cheque. R is freely available.
R has a single language for 'scripting', but don't think of it like that, R is really a programming language with great data manipulation, statistics, and graphics functionality built in. SPSS has 'Syntax', 'Scripts' and is also scriptable in Python.
Another biggie is that SPSS squeezes its data into a spreadsheety table structure. Dealing with other data structures is probably very hard, but comes naturally to R. I wouldn't know where to start handling network graph type data in SPSS, but there's a package to do it for R.
Also with R you can integrate your workflow with your reporting by using Sweave - you write a document with embedded bits of R code that generate plots or tables, run the file through the system and out comes the report as a PDF. Great for when you want to do a weekly report, or you do a body of work and then the boss gives you an updated data set. Re-run, read it over, its done.
But you know, your call...
Well, are you a decent programmer? If you are, then it's worthwhile to learn R. You can do more with your data, both in terms of manipulation and statistical modeling, than you can with SPSS, and your graphs will likely be better too. On the other hand, if you've never really programmed before, or find the idea of spending several months becoming a programmer intimidating, you'll probably get more value out of SPSS. The level of stuff that you can do with R without diving into its power as a full-fledged programming language probably doesn't justify the effort.
There's another option -- collaborate. Do you know someone you can work with on your project (you don't say whether it's academic or industry, but either way...), who knows R well?
There's an interesting (and reasonably fair) comparison between a number of stats tools here
http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/
I work with both in a company and can say the following:
If you have a large team of different people (not all data scientists), SPSS is useful because it is plain (relatively) to understand. For example, if users are going to run a model to get an output (sales estimates, etc), SPSS is clear and easy to use.
That said, I find R better in almost every other sense:
R is faster (although, sometimes debatable)
As stated previously, the syntax in SPSS is aweful (I can't stress this enough). On the other hand, R can be painful to learn, but there are tons of resources online and in the end it pays much more because of the different things you can do.
Again, like everyone else says, the sky is the limit with R. Tons of packages, resources and more importantly: indepedence to do as you please. In my organization we have some very high level functions that get a lot done. The hard part is creating them once, but then they perform complicated tasks that SPSS would tangle in a never ending web of canvas. This is specially true for things like loops.
It is often overlooked, but R also has plenty of features to cooperate between teams (github integration with RStudio, and easy package building with devtools).
Actually, if everyone in your organization knows R, all you need is to maintain a basic package on github to share everything. This of course is not the norm, which is why I think SPSS, although a worst product, still has a market.
I have not data for it, but from my experience I can tell you one thing:
SPSS is a lot slower than R. (And with a lot, I really mean a lot)
The magnitude of the difference is probably as big as the one between C++ and R.
For example, I never have to wait longer than a couple of seconds in R. Using SPSS and similar data, I had calculations that took longer than 10 minutes.
As an unrelated side note: In my eyes, in the recent discussion on the speed of R, this point was somehow overlooked (i.e., the comparison with SPSS). Furthermore, I am astonished how this discussion popped up for a while and silently disappeared again.
There are some great responses above, but I will try to provide my 2 cents. My department completely relies on SPSS for our work, but in recent months, I have been making a conscious effort to learn R; in part, for some of the reasons itemized above (speed, vast data structures, available packages, etc.)
That said, here are a few things I have picked up along the way:
Unless you have some experience programming, I think creating summary tables in CTABLES destroys any available option in R. To date, I am unaware package that can replicate what can be created using Custom Tables.
SPSS does appear to be slower when scripting, and yes, SPSS syntax is terrible. That said, I have found that scipts in SPSS can always be improved but using the EXECUTE command sparingly.
SPSS and R can interface with each other, although it appears that it's one way (only when using R inside of SPSS, not the other way around). That said, I have found this to be of little use other than if I want to use ggplot2 or for some other advanced data management techniques. (I despise SPSS macros).
I have long felt that "reporting" work created in SPSS is far inferior to other solutions. As mentioned above, if you can leverage LaTex and Sweave, you will be very happy with your efficient workflows.
I have been able to do some advanced analysis by leveraging OMS in SPSS. Almost everything can be routed to a new dataset, but I have found that most SPSS users don't use this functionality. Also, when looking at examples in R, it just feels "easier" than using OMS.
In short, I find myself using SPSS when I can't figure it out quickly in R, but I sincerely have every intention of getting away from SPSS and using R entirely at some point in the near future.
SPSS provides a GUI to easily integrate existing R programs or develop new ones. For more info, see the SPSS Community on IBM Developer Works.
#Henrik, I did the same task you have mentioned (C++ and R) on SPSS. And it turned out that SPSS is faster compared to R on this one. In my case SPSS is aprox. 7 times faster. I am surprised about it.
Here is a code I used in SPSS.
data list free
/x (f8.3).
begin data
1
end data.
comp n = 1e6.
comp t1 = $time.
loop #rep = 1 to 10.
comp x = 1.
loop #i=1 to n.
comp x = 1/(1+x).
end loop.
end loop.
comp t2 = $time.
comp elipsed = t2 - t1.
form elipsed (f8.2).
exe.
Check out this video why is good to combine SPSS and R...
Link
http://bluemixanalytics.wordpress.com/2014/08/29/7-good-reasons-to-combine-ibm-spss-analytics-and-r/
If you have a compatible copy of R installed, you can connect to it from IBM SPSS Modeler and carry out model building and model scoring using custom R algorithms that can be deployed in IBM SPSS Modeler. You must also have a copy of IBM SPSS Modeler - Essentials for R installed. IBM SPSS Modeler - Essentials for R provides you with tools you need to start developing custom R applications for use with IBM SPSS Modeler.
The truth is: both packages are useful if you do data analysis professionally. Sure, R / RStudio has more statistical methods implemented than SPSS. But SPSS is much easier to use and gives more information per each button click. And, therefore, it is faster to exploit whenever a particular analysis is implemented in both R and SPSS.
In the modern age, neither CPU nor memory is the most valuable resource. Researcher's time is the most valuable resource. Also, tables in SPSS are more visually pleasing, in my opinion.
In summary, R and SPSS complement each other well.

Resources