Unit testing outside of a package in R

Unit testing outside of a package in R - r

Often times we're using R in contexts where we want to have reproducibility in the face of modifications and where we have complex code bases but outside of writing a package. It seems like testthat and the other testing packages are geared towards unit testing for package code, which makes sense since that's the most common case where you need a lot of testing where you're not in full control of all the data, but I was wondering if there was a good package or method for unit testing in R outside of the context of writing a package.
For example, a lot of times in a package context you're testing something of the form:
foo <- function(bar){
# do something to bar
return(bar.outcome)
}
and so then you're testing for expected output from the function, that things are the right type, that there is proper error handling. The way you do this is that you create a directory in your package for tests and write them there and then devtools can use load_all and testthat to run them and produce results.
One thing I would like to be able to do is run these same sort of tests outside of the context of a package, such as in a script. This is important because a lot of R code that is written in academia doesn't generalize a whole lot to different contexts or data without considerable difficulty, so that having a package doesn't make much sense, but at the same time unit testing would make it easier to extend the code in future packages. That's the easy case.
The harder case is actually something you rarely do in packages, which is test things about the shape, kind, and state of the data. So, for example, I often read R code written in academia with comments like
data <- data %>% doSomething() #1023 rows
parameter_df <- // read file
print(parameter_df) # 5 columns
data <- data %>% doSomething(param = parameter_df)
lapply(df, class) #should be char, char, char, numeric, Date
I like the idea that "every time you want to write a print statement, write a test instead", but I actually don't have a good framework for how this should be done in R. Especially in this harder case where you're not testing a function, you're testing to make sure that the data flowing through your program is correct.
The context here is that R is used in a lot of contexts where the point of a script is replication in the scientific sense, but where there is possibly great gain from people being able to easily extend other's scripts which are released as part of replication materials for new projects, which is much harder to do, especially in complex code, whenever there are no tests and code can be very fragile and fail in nontrivial and silent ways.

https://github.com/ropensci/assertr provides a framework better suited for testing data analysis workflows.

Related

Converting cosine distance function in R to Rcpp

I've been developing an R package for single cell RNA-seq analysis, and one of the functions I used repeatedly calculates the cosine dissimilarity matrix for a given matrix of m cells by n genes. The function I wrote is as follows:
CosineDist <- function(input = NULL) {
if (is.null(input)) { stop("You forgot to provide an input matrix") }
dist_mat <- as.dist(1 - input %*% t(input) / (sqrt(rowSums(input^2) %*% t(rowSums(input^2)))))
return(dist_mat)
}
This code works fine for smaller datasets, but when I run it on anything over 20,000 rows it takes forever and then crashes my R session due to memory issues. I believe that porting this to Rcpp would make it both faster and more memory efficient (I know this is a bit of a naive belief, but my knowledge of C++ in general is limited). Finally, the output of the function, though it does not have to be a distance matrix object when returned, does need to be able to be converted to that format after its generation.
How should I got about converting this function to Rcpp and then calling it as I would any of the other functions in my package? Alternatively, is this the best way to go about solving the speed / memory problem?

Hard to help you, since as the comments pointed out you are basically searching for an Rcpp intro.
I'll try to give you some hints, which I already mentioned partly in the comments.
In general using C/C++ can provide a great speedup (dependent on the task of course). But I've reached for (loop intensive, not optimized code) 100x+ speedups.
Since adding C++ can be complicated and sometimes cause problems, before you go this way check the following:
1. Is your R code optimized?
You can make lot of bad choices here (e.g. loops are slow in R). Just by optimizing your R code speedups of 10x or much more can often be easily reached.
2. Are there better implementations in other packages?
Especially if it is helper functions or common functionalities, often other packages have these already implemented. Benchmark different existing solutions with the 'microbenchmark' package. It is easier to just use an optimized function from another R package then doing everything on your own. (maybe the other package implementations are already in C++). I mostly try to look for mainstream and popular packages (since these are better tested and they are unlikely to suddenly drop from CRAN).
3. Profile your code
Take a look what parts exactly cause the speed / memory problems. Might be that you can keep parts in R and only create a function for the critical parts in C++. Or you find another package that has a R function that is implemented in C for exactly this critical part.
In the end I'd say, I prefer using Rcpp/C++ over C code. Think this is the easier way to go. For the Rcpp learning part you have to go with a dedicated tutorial (and not a SO question).

OLS in Python with Dummy Variables - Best Solution?

I have a problem I am trying to solve in Python, and I have found multiple solutions (I think) but I am trying to figure out which one is the best. I am hoping to choose libraries that will be supported fully in the future so I do not have to re-write this service.
I want to do an ordinary multi-variate least squares regression with both categorical and continuous dependent variables. The code has to be written in Python, as it is being integrated into a web service. I have been following Pandas quite a bit but never used it, so this seems to be one approach:
SOLUTION 1. https://github.com/pydata/pandas/blob/master/examples/regressions.py
Obviously, numpy/scipy are ideal, but I cant find an example that uses dummy variables (does anyone have one???). I did find this though,
SOLUTION 2. http://www.scipy.org/Cookbook/OLS
which I could modify to support dummy variables, but I do not want to do that if someone else has done it already + I want the numbers to be very similar to R, as I have done most of my analysis offline and I can use these results for unit tests.
And in the example (2) above, I see that I could technically use rpy/rpy2, although that is not optimal because my web service requires yet another piece of technology (R). The good thing about using the interface is the numbers would be identical to my results from R.
SOLUTION 3. http://www.scipy.org/Cookbook/OLS (but using Rpy/Rpy2)
Anyways, I am interested in what everyone's approach would be out of these three solutions, if there are any I am missing ...... and if Panda's is mature enough to start using in a production web service. The key thing here is that I do not want to have to support/patch bug fixes or write anything from scratch if possible. I'm too busy and probably not smart enough :)
Thanks.

You can use statsmodels, which provides many different models and result statistics
If you want to use an R like formula interface, here are some examples and you can look at the corresponding documentation :
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/contrasts.html
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/example_formulas.html
If you want a pure numpy version, then here is an old example that does everything from scratch
http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html#ols-with-dummy-variables
The models are integrated with pandas, and can use pandas DataFrame as the data structure for the dependent and independent variables (endog and exog in statsmodels naming convention).

writing functions vs. line-by-line interpretation in an R workflow

Much has been written here about developing a workflow in R for statistical projects. The most popular workflow seems to be Josh Reich's LCFD model. With a main.R containing code:
source('load.R')
source('clean.R')
source('func.R')
source('do.R')
so that a single source('main.R') runs the entire project.
Q: Is there a reason to prefer this workflow to one in which the line-by-line interpretive work done in load.R, clean.R, and do.R is replaced by functions which are called by main.R?
I can't find the link now, but I had read somewhere on SO that when programming in R one must get over their desire to write everything in terms of function calls---that R was MEANT to be written is this line-by-line interpretive form.
Q: Really? Why?
I've been frustrated with the LCFD approach and am going to probably write everything in terms of function calls. But before doing this, I'd like to hear from the good folks of SO as to whether this is a good idea or not.
EDIT: The project I'm working on right now is to (1) read in a set of financial data, (2) clean it (quite involved), (3) Estimate some quantity associated with the data using my estimator (4) Estimate that same quantity using traditional estimators (5) Report results. My programs should be written in such a way that it's a cinch to do the work (1) for different empirical data sets, (2) for simulation data, or (3) using different estimators. ALSO, it should follow literate programming and reproducible research guidelines so that it's simple for a newcomer to the code to run the program, understand what's going on, and how to tweak it.

I think that any temporary stuff created in source'd files won't get cleaned up. If I do:
x=matrix(runif(big^2),big,big)
z=sum(x)
and source that as a file, x hangs around although I don't need it. But if I do:
ff=function(big){
x = matrix(runif(big^2),big,big)
z=sum(x)
return(z)
}
and instead of source, do z=ff(big) in my script, the x matrix goes out of scope and so gets cleaned up.
Functions enable neat little re-usable encapsulations and don't pollute outside themselves. In general, they don't have side-effects. Your line-by-line scripts could be using global variables and names tied to the data set in current use, which makes them unre-usable.
I sometimes work line-by-line, but as soon as I get more than about five lines I see that what I have really needs making into a proper reusable function, and more often than not I do end up re-using it.

I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.
1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.
On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.
2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.
On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.
3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.
On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.
Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.

No one has mentioned an important consideration when writing functions: there's not much point in writing them unless you're repeating some action again and again. In some parts of an analysis, you'll being doing one-off operations, so there's not much point in writing a function for them. If you have to repeat something more than a few times, it's worth investing the time and effort to write a re-usable function.

Workflow:
I use something very similar:
Base.r: pulls primary data, calls on other files (items 2 through 5)
Functions.r: loads functions
Plot Options.r: loads a number of general plot options I use frequently
Lists.r: loads lists, I have a lot of them because company names, statements and the like change over time
Recodes.r: most of the work is done in this file, essentially it's data cleaning and sorting
No analysis has been done up to this point. This is just for data cleaning and sorting.
At the end of Recodes.r I save the environment to be reloaded into my actual analysis.
save(list=ls(), file="Cleaned.Rdata")
With the cleaning done, functions and plot options ready, I start getting into my analysis. Again, I continue to break it up into smaller files that are focused into topics or themes, like: demographics, client requests, correlations, correspondence analysis, plots, ect. I almost always run the first 5 automatically to get my environment set up and then I run the others on a line by line basis to ensure accuracy and explore.
At the beginning of every file I load the cleaned data environment and prosper.
load("Cleaned.Rdata")
Object Nomenclature:
I don't use lists, but I do use a nomenclature for my objects.
df.YYYY # Data for a certain year
demo.describe.YYYY ## Demographic data for a certain year
po.describe ## Plot option
list.describe.YYYY ## lists
f.describe ## Functions
Using a friendly mnemonic to replace "describe" in the above.
Commenting
I've been trying to get myself into the habit of using comment(x) which I've found incredibly useful. Comments in the code are helpful but oftentimes not enough.
Cleaning Up
Again, here, I always try to use the same object(s) for easy cleanup. tmp, tmp1, tmp2, tmp3 for example and ensuring to remove them at the end.
Functions
There has been some commentary in other posts about only writing a function for something if you're going to use it more than once. I'd like to adjust this to say, if you think there's a possibility that you may EVER use it again, you should throw it into a function. I can't even count the number of times I wished I wrote a function for a process I created on a line by line basis.
Also, BEFORE I change a function, I throw it into a file called Deprecated Functions.r, again, protecting against the "how the hell did I do that" effect.

I often divide up my code similarly to this (though I usually put Load and Clean in one file), but I never just source all the files to run the entire project; to me that defeats the purpose of dividing them up.
Like the comment from Sharpie, I think your workflow should depends a lot on the kind of work you're doing. I do mostly exploratory work, and in that context, keeping the data input (load and clean) separate from the analysis (functions and do), means that I don't have to reload and reclean when I come back the next day; I can instead save the data set after cleaning and then import it again.
I have little experience doing repetitive munging of daily data sets, but I imagine that I would find a different workflow helpful; as Hadley answers, if you're only doing something once (as I am when I load/clean my data), it may not be helpful to write a function. But if you're doing it over and over again (as it seems you would be) it might be much more helpful.
In short, I've found dividing up the code helpful for exploratory analyses, but would probably do something different for repetitive analyses, just like you're thinking about.

I've been pondering workflow tradeoffs for some time.
Here is what I do for any project involving data analysis:
Load and Clean: Create clean versions of the raw datasets for the project, as if I was building a local relational database. Thus, I structure the tables in 3n normal form where possible. I perform basic munging but I do not merge or filter tables at this step; again, I'm simply creating a normalized database for a given project. I put this step in its own file and I will save the objects to disk at the end using save.
Functions: I create a function script with functions for data filtering, merging and aggregation tasks. This is the most intellectually challenging part of the workflow as I'm forced to think about how to create proper abstractions so that the functions are reusable. The functions need to generalize so that I can flexibly merge and aggregate data from the load and clean step. As in the LCFD model, this script has no side effects as it only loads function definitions.
Function Tests: I create a separate script to test and optimize the performance of the functions defined in step 2. I clearly define what the output from the functions should be, so this step serves as a kind of documentation (think unit testing).
Main: I load the objects saved in step 1. If the tables are too big to fit in RAM, I can filter the tables with a SQL query, keeping with the database thinking. I then filter, merge and aggregate the tables by calling the functions defined in step 2. The tables are passed as arguments to the functions I defined. The output of the functions are data structures in a form suitable for plotting, modeling and analysis. Obviously, I may have a few extra line by line steps where it makes little sense to create a new function.
This workflow allows me to do lightning fast exploration at the Main.R step. This is because I have built clear, generalizable, and optimized functions. The main difference from the LCFD model is that I do not preform line-by-line filtering, merging or aggregating; I assume that I may want to filter, merge, or aggregate the data in different ways as part of exploration. Additionally, I don't want to pollute my global environment with lengthy line-by-line script; as Spacedman points out, functions help with this.

Coding practice in R : what are the advantages and disadvantages of different styles?

The recent questions regarding the use of require versus :: raised the question about which programming styles are used when programming in R, and what their advantages/disadvantages are. Browsing through the source code or browsing on the net, you see a lot of different styles displayed.
The main trends in my code :
heavy vectorization I play a lot with the indices (and nested indices), which results in rather obscure code sometimes but is generally a lot faster than other solutions.
eg: x[x < 5] <- 0 instead of x <- ifelse(x < 5, x, 0)
I tend to nest functions to avoid overloading the memory with temporary objects that I need to clean up. Especially with functions manipulating large datasets this can be a real burden. eg : y <- cbind(x,as.numeric(factor(x))) instead of y <- as.numeric(factor(x)) ; z <- cbind(x,y)
I write a lot of custom functions, even if I use the code only once in eg. an sapply. I believe it keeps it more readible without creating objects that can remain lying around.
I avoid loops at all costs, as I consider vectorization to be a lot cleaner (and faster)
Yet, I've noticed that opinions on this differ, and some people tend to back away from what they would call my "Perl" way of programming (or even "Lisp", with all those brackets flying around in my code. I wouldn't go that far though).
What do you consider good coding practice in R?
What is your programming style, and how do you see its advantages and disadvantages?

What I do will depend on why I am writing the code. If I am writing a data analysis script for my research (day job), I want something that works but that is readable and understandable months or even years later. I don't care too much about compute times. Vectorizing with lapply et al. can lead to obfuscation, which I would like to avoid.
In such cases, I would use loops for a repetitive process if lapply made me jump through hoops to construct the appropriate anonymous function for example. I would use the ifelse() in your first bullet because, to my mind at least, the intention of that call is easier to comprehend than the subset+replacement version. With my data analysis I am more concerned with getting things correct than necessarily with compute time --- there are always the weekends and nights when I'm not in the office when I can run big jobs.
For your other bullets; I would tend not to inline/nest calls unless they were very trivial. If I spell out the steps explicitly, I find the code easier to read and therefore less likely to contain bugs.
I write custom functions all the time, especially if I am going to be calling the code equivalent of the function repeatedly in a loop or similar. That way I have encapsulated the code out of the main data analysis script into it's own .R file which helps keep the intention of the analysis separate from how the analysis is done. And if the function is useful I have it for use in other projects etc.
If I am writing code for a package, I might start with the same attitude as my data analysis (familiarity) to get something I know works, and only then go for the optimisation if I want to improve compute times.
The one thing I try to avoid doing, is being too clever when I code, whatever I am coding for. Ultimately I am never as clever as I think I am at times and if I keep things simple, I tend not to fall on my face as often as I might if I were trying to be clever.

I write functions (in standalone .R files) for various chunks of code that conceptually do one thing. This keeps things short and sweet. I found debugging somewhat easier, because traceback() gives you which function produced an error.
I too tend to avoid loops, except when its absolutely necessary. I feel somewhat dirty if I use a for() loop. :) I try really hard to do everything vectorized or with the apply family. This is not always the best practice, especially if you need to explain the code to another person who is not as fluent in apply or vectorization.
Regarding the use of require vs ::, I tend to use both. If I only need one function from a certain package I use it via ::, but if I need several functions, I load the entire package. If there's a conflict in function names between packages, I try to remember and use ::.
I try to find a function for every task I'm trying to achieve. I believe someone before me has thought of it and made a function that works better than anything I can come up with. This sometimes works, sometimes not so much.
I try to write my code so that I can understand it. This means I comment a lot and construct chunks of code so that they somehow follow the idea of what I'm trying to achieve. I often overwrite objects as the function progresses. I think this keeps the transparency of the task, especially if you're referring to these objects later in the function. I think about speed when computing time exceeds my patience. If a function takes so long to finish that I start browsing SO, I see if I can improve it.
I found out that a good syntax editor with code folding and syntax coloring (I use Eclipse + StatET) has saved me a lot of headaches.
Based on VitoshKa's post, I am adding that I use capitalizedWords (sensu Java) for function names and fullstop.delimited for variables. I see that I could have another style for function arguments.

Naming conventions are extremely important for the readability of the code. Inspired by R's S4 internal style here is what I use:
camelCase for global functions and objects (like doSomething, getXyyy, upperLimit)
functions start with a verb
not exported and helper functions always start with "."
local variables and functions are all in small letters and in "_" syntax (do_something, get_xyyy), It makes it easy to distinguish local vs global and therefore leads to a cleaner code.

For data juggling I try to use as much SQL as possible, at least for the basic things like GROUP BY averages. I like R a lot but sometimes it's not only fun to realize that your research strategy was not good enough to find yet another function hidden in yet another package. For my cases SQL dialects do not differ much and the code is really transparent. Most of the time the threshold (when to start to use R syntax) is rather intuitive to discover. e.g.
require(RMySQL)
# selection of variables alongside conditions in SQL is really transparent
# even if conditional variables are not part of the selection
statement = "SELECT id,v1,v2,v3,v4,v5 FROM mytable
WHERE this=5
AND that != 6"
mydf <- dbGetQuery(con,statement)
# some simple things get really tricky (at least in MySQL), but simple in R
# standard deviation of table rows
dframe$rowsd <- sd(t(dframe))
So I consider it good practice and really recommend to use a SQL database for your data for most use cases. I am also looking into TSdbi and saving time series in relational database, but cannot really judge that yet.

R and SPSS difference

I will be analysing vast amount of network traffic related data shortly, and will pre-process the data in order to analyse it. I have found that R and SPSS are among the most popular tools for statistical analysis. I will also be generating quite a lot of graphs and charts. Therefore, I was wondering what is the basic difference between these two softwares.
I am not asking which one is better, but just wanted to know what are the difference in terms of workflow between the two (besides the fact that SPSS has a GUI). I will be mostly working with scripts in either case anyway so I wanted to know about the other differences.

Here is something that I posted to the R-help mailing list a while back, but I think that it gives a good high level overview of the general difference in R and SPSS:
When talking about user friendlyness
of computer software I like the
analogy of cars vs. busses:
Busses are very easy to use, you just
need to know which bus to get on,
where to get on, and where to get off
(and you need to pay your fare). Cars
on the other hand require much more
work, you need to have some type of
map or directions (even if the map is
in your head), you need to put gas in
every now and then, you need to know
the rules of the road (have some type
of drivers licence). The big advantage
of the car is that it can take you a
bunch of places that the bus does not
go and it is quicker for some trips
that would require transfering between
busses.
Using this analogy programs like SPSS
are busses, easy to use for the
standard things, but very frustrating
if you want to do something that is
not already preprogrammed.
R is a 4-wheel drive SUV (though
environmentally friendly) with a bike
on the back, a kayak on top, good
walking and running shoes in the
pasenger seat, and mountain climbing
and spelunking gear in the back.
R can take you anywhere you want to go
if you take time to leard how to use
the equipment, but that is going to
take longer than learning where the
bus stops are in SPSS.
There are GUIs for R that make it a bit easier to use, but also limit the functionality that can be used that easily. SPSS does have scripting which takes it beyond being a mere bus, but the general phylosophy of SPSS steers people towards the GUI rather than the scripts.

I work at a company that uses SPSS for the majority of our data analysis, and for a variety of reasons - I have started trying to use R for more and more of my own analysis. Some of the biggest differences I have run into include:
Output of tables - SPSS has basic tables, general tables, custom tables, etc that are all output to that nifty data viewer or whatever they call it. These can relatively easily be transported to Word Documents or Excel sheets for further analysis / presentation. The equivalent function in R involves learning LaTex or using a odfWeave or Lyx or something of that nature.
Labeling of data --> SPSS does a pretty good job with the variable labels and value labels. I haven't found a robust solution for R to accomplish this same task.
You mention that you are going to be scripting most of your work, and personally I find SPSS's scripting syntax absolutely horrendous, to the point that I've stopped working with SPSS whenever possible. R syntax seems much more logical and follows programming standards more closely AND there is a very active community to rely on should you run into trouble (SO for instance). I haven't found a good SPSS community to ask questions of when I run into problems.
Others have pointed out some of the big differences in terms of cost and functionality of the programs. If you have to collaborate with others, their comfort level with SPSS or R should play a factor as you don't want to be the only one in your group that can work on or edit a script that you wrote in the future.
If you are going to be learning R, this post on the stats exchange website has a bunch of great resources for learning R: https://stats.stackexchange.com/questions/138/resources-for-learning-r

The initial workflow for SPSS involves justifying writing a big fat cheque. R is freely available.
R has a single language for 'scripting', but don't think of it like that, R is really a programming language with great data manipulation, statistics, and graphics functionality built in. SPSS has 'Syntax', 'Scripts' and is also scriptable in Python.
Another biggie is that SPSS squeezes its data into a spreadsheety table structure. Dealing with other data structures is probably very hard, but comes naturally to R. I wouldn't know where to start handling network graph type data in SPSS, but there's a package to do it for R.
Also with R you can integrate your workflow with your reporting by using Sweave - you write a document with embedded bits of R code that generate plots or tables, run the file through the system and out comes the report as a PDF. Great for when you want to do a weekly report, or you do a body of work and then the boss gives you an updated data set. Re-run, read it over, its done.
But you know, your call...

Well, are you a decent programmer? If you are, then it's worthwhile to learn R. You can do more with your data, both in terms of manipulation and statistical modeling, than you can with SPSS, and your graphs will likely be better too. On the other hand, if you've never really programmed before, or find the idea of spending several months becoming a programmer intimidating, you'll probably get more value out of SPSS. The level of stuff that you can do with R without diving into its power as a full-fledged programming language probably doesn't justify the effort.
There's another option -- collaborate. Do you know someone you can work with on your project (you don't say whether it's academic or industry, but either way...), who knows R well?

There's an interesting (and reasonably fair) comparison between a number of stats tools here
http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/

I work with both in a company and can say the following:
If you have a large team of different people (not all data scientists), SPSS is useful because it is plain (relatively) to understand. For example, if users are going to run a model to get an output (sales estimates, etc), SPSS is clear and easy to use.
That said, I find R better in almost every other sense:
R is faster (although, sometimes debatable)
As stated previously, the syntax in SPSS is aweful (I can't stress this enough). On the other hand, R can be painful to learn, but there are tons of resources online and in the end it pays much more because of the different things you can do.
Again, like everyone else says, the sky is the limit with R. Tons of packages, resources and more importantly: indepedence to do as you please. In my organization we have some very high level functions that get a lot done. The hard part is creating them once, but then they perform complicated tasks that SPSS would tangle in a never ending web of canvas. This is specially true for things like loops.
It is often overlooked, but R also has plenty of features to cooperate between teams (github integration with RStudio, and easy package building with devtools).
Actually, if everyone in your organization knows R, all you need is to maintain a basic package on github to share everything. This of course is not the norm, which is why I think SPSS, although a worst product, still has a market.

I have not data for it, but from my experience I can tell you one thing:
SPSS is a lot slower than R. (And with a lot, I really mean a lot)
The magnitude of the difference is probably as big as the one between C++ and R.
For example, I never have to wait longer than a couple of seconds in R. Using SPSS and similar data, I had calculations that took longer than 10 minutes.
As an unrelated side note: In my eyes, in the recent discussion on the speed of R, this point was somehow overlooked (i.e., the comparison with SPSS). Furthermore, I am astonished how this discussion popped up for a while and silently disappeared again.

There are some great responses above, but I will try to provide my 2 cents. My department completely relies on SPSS for our work, but in recent months, I have been making a conscious effort to learn R; in part, for some of the reasons itemized above (speed, vast data structures, available packages, etc.)
That said, here are a few things I have picked up along the way:
Unless you have some experience programming, I think creating summary tables in CTABLES destroys any available option in R. To date, I am unaware package that can replicate what can be created using Custom Tables.
SPSS does appear to be slower when scripting, and yes, SPSS syntax is terrible. That said, I have found that scipts in SPSS can always be improved but using the EXECUTE command sparingly.
SPSS and R can interface with each other, although it appears that it's one way (only when using R inside of SPSS, not the other way around). That said, I have found this to be of little use other than if I want to use ggplot2 or for some other advanced data management techniques. (I despise SPSS macros).
I have long felt that "reporting" work created in SPSS is far inferior to other solutions. As mentioned above, if you can leverage LaTex and Sweave, you will be very happy with your efficient workflows.
I have been able to do some advanced analysis by leveraging OMS in SPSS. Almost everything can be routed to a new dataset, but I have found that most SPSS users don't use this functionality. Also, when looking at examples in R, it just feels "easier" than using OMS.
In short, I find myself using SPSS when I can't figure it out quickly in R, but I sincerely have every intention of getting away from SPSS and using R entirely at some point in the near future.

SPSS provides a GUI to easily integrate existing R programs or develop new ones. For more info, see the SPSS Community on IBM Developer Works.

#Henrik, I did the same task you have mentioned (C++ and R) on SPSS. And it turned out that SPSS is faster compared to R on this one. In my case SPSS is aprox. 7 times faster. I am surprised about it.
Here is a code I used in SPSS.
data list free
/x (f8.3).
begin data
1
end data.
comp n = 1e6.
comp t1 = $time.
loop #rep = 1 to 10.
comp x = 1.
loop #i=1 to n.
comp x = 1/(1+x).
end loop.
end loop.
comp t2 = $time.
comp elipsed = t2 - t1.
form elipsed (f8.2).
exe.

Check out this video why is good to combine SPSS and R...
Link
http://bluemixanalytics.wordpress.com/2014/08/29/7-good-reasons-to-combine-ibm-spss-analytics-and-r/
If you have a compatible copy of R installed, you can connect to it from IBM SPSS Modeler and carry out model building and model scoring using custom R algorithms that can be deployed in IBM SPSS Modeler. You must also have a copy of IBM SPSS Modeler - Essentials for R installed. IBM SPSS Modeler - Essentials for R provides you with tools you need to start developing custom R applications for use with IBM SPSS Modeler.

The truth is: both packages are useful if you do data analysis professionally. Sure, R / RStudio has more statistical methods implemented than SPSS. But SPSS is much easier to use and gives more information per each button click. And, therefore, it is faster to exploit whenever a particular analysis is implemented in both R and SPSS.
In the modern age, neither CPU nor memory is the most valuable resource. Researcher's time is the most valuable resource. Also, tables in SPSS are more visually pleasing, in my opinion.
In summary, R and SPSS complement each other well.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex