I want to look into rcpp to improve the speed of some of my R code without having to resort to messy C++ code (I've had some success with that, but it looks like code from hell).
So, I checked the documentation provided with Rcpp, and also the bundle of documents provided at Dirk Eddelbuettel's site. I installed and looked at RcppExamples, but (at least from its documentation) most of these refer to RcppClassic?. Besides that, I did some googling but that didn't result in answers to what seem like basic questions.
Do indexes in Rcpp work zero-based or one-based
List provides both operator() and
operator[], but apparently not
operator[[]]. It is not clear which
ones are similar to [] and [[]] in R.
Is there any support for factors in Rcpp (there does not appear to be any)?
Note: in fact I found some answers from the first example in Rcpp-introduction.pdf, but that just felt like luck.
Also, my stl is very rusty, so if anybody can provide me with a simple example where each element of a List is (e.g.) print-ed with an stl-style loop, that would be neat.
If anybody wants to call me an idiot for not finding this information: go ahead and make your day. Then make mine and point me to the docs I need :-)
As a suggestions to Mr. Eddelbuettel and other Rcpp authors (I expect some of them to read this): the class hierarchies and the like, provided by doxygen, are really neat when you are already kneedeep into Rcpp, but for a beginner (in Rcpp), I am more interested in a list of 'this method in this class does this like that function in R' rather than 'you can find the declaration of this operator in this header file'. After all, I understand one of the goals of Rcpp is to lower the threshold for using C++ in R? Note: from what I have seen and understood, I highly value the actual code of Rcpp and have the highest respect for its creators. If the lack of basic documentation is merely a result of 'lack of resources', I would be willing to become a resource (e.g.: work on 'basic' documentation once I get through it myself).
I do not quite know where to start answering this but here is a quick attempt:
The package has a website. The website lists the documentation.
The package has eight (8) vignettes. They are clearly listed. They are mostly meant to be read as documentation, some more introductory and some more advanced. Some (such as the unit testing output) are more of a quality-control iniative.
There is a vignette called Rcpp-introduction. We refer to it repeatedly. We suggest you read it. This is now also a peer-reviewed and published paper which may lend it even more credibility.
There is a vignette called Rcpp-FAQ. It's first question is "How do I get started?" which points to the aforementioned Rcpp-introduction.
There is a mailing list dedicated to project, you could actually read the archive.
We have given numerous talks, slides are available as is a 90 minute recording of a Google Tech Talk.
Even StackOverflow has a tag for it: [rcpp]. You could read the earlier posts.
There are over two dozen packages clearly listed on the CRAN page for Rcpp as using it. You could read their source code.
All that said, Rcpp cannot be used instead of C++ so if you do not know or understand that operator[[]] cannot exist in C++ we cannot help you either. This is not a magic fairy, or R-to-C++ code compiler. Rather, its focus is to make it much easier to get to C++ code from R, and in some cases even manages to improve on C++ practice. In essence, it tries to be "super-additive": the combination R and C++ should be more than either in isolation.
Lastly, I do grant you that the RcppExamples packages -- which by the way covers the old and new API -- could use more examples. However, its sourecs give good porting hints from old ("classic") to the new and current API.
But there is only so much documentation we can write ourselves. I myself find the above bullet points quite exhaustive. You may have honed in on the weakest element part of the chain though. That is bad luck. Please do try some of other pointers listed here.
Related
I've done some research online but I haven't been able to come up with any answer. I know this has been asked at least thrice, as I've viewed those posts, linked here:
First Question
Second Question
Third Question
However, it's been 5, 7, and 9 years since those questions have been asked, and technology is obviously rapidly evolving :) I don't know much about R, and I haven't worked with it for a long time, and so I ask those of you who know better and have more experience if you know of anything that would be useful to me.
If there's nothing that exists now, how hard would it be to create? The reason I ask is that the company I work for would like to obfuscate the proprietary code before it goes out. I would have the full 40 hours a week to work on creating it, and so time and/or difficulty isn't a major concern.
Thanks!
Found this: I'm not sure about the security, but this is definitely a deterrent and would take (I think) some fairly concentrated effort to crack. There is a byte code compiler for R based on the paper linked below. There is a method in library(compiler), which comes standard with R, that allows you to compile an R script to byte code. In the same library, you can load in the source files and use them as you'd like.
A Byte Code Compiler for R
I am new to creating my own packages and I am using roxygen2.
I am creating a package with a lot of internal helper functions and I was wondering if I have to document all of them. I understand the importance of documentation but some functions are fairly simple and are just wrapper around other functions for convenience. I have done a basic search of the web but I don't seem to be able to find a definitive answer.
Any help is appreciated.
It depends what you mean by "have to". One interpretation is, "Do I have to document these functions to pass checks?" The answer to that question is no. As long as the function isn't exported from the package, R CMD check won't require that you document it.
Another interpretation is "Do I have to document it to help myself in maintaining this package?" That question is harder to answer. Some functions are so obvious that they don't really need any documentation beyond their name, e.g. a print method with no extra arguments beyond those of the generic.
Other functions aren't so obvious, or have arguments whose meaning isn't obvious. It's a good idea to document those if you plan to maintain your package for a long time, because you might forget the details between now and whenever a problem arises. And if you are releasing your package to others, you should plan on long term maintenance, because if it is useful, people will use it.
Short version
For those that don't want to read through my "case", this is the essence:
What is the recommended way of minimizing the chances of new packages breaking existing code, i.e. of making the code you write as robust as possible?
What is the recommended way of making the best use of the namespace mechanism when
a) just using contributed packages (say in just some R Analysis Project)?
b) with respect to developing own packages?
How best to avoid conflicts with respect to formal classes (mostly Reference Classes in my case) as there isn't even a namespace mechanism comparable to :: for classes (AFAIU)?
The way the R universe works
This is something that's been nagging in the back of my mind for about two years now, yet I don't feel as if I have come to a satisfying solution. Plus I feel it's getting worse.
We see an ever increasing number of packages on CRAN, github, R-Forge and the like, which is simply terrific.
In such a decentralized environment, it is natural that the code base that makes up R (let's say that's base R and contributed R, for simplicity) will deviate from an ideal state with respect to robustness: people follow different conventions, there's S3, S4, S4 Reference Classes, etc. Things can't be as "aligned" as they would be if there were a "central clearing instance" that enforced conventions. That's okay.
The problem
Given the above, it can be very hard to use R to write robust code. Not everything you need will be in base R. For certain projects you will end up loading quite a few contributed packages.
IMHO, the biggest issue in that respect is the way the namespace concept is put to use in R: R allows for simply writing the name of a certain function/method without explicitly requiring it's namespace (i.e. foo vs. namespace::foo).
So for the sake of simplicity, that's what everyone is doing. But that way, name clashes, broken code and the need to rewrite/refactor your code are just a matter of time (or of the number of different packages loaded).
At best, you will know about which existing functions are masked/overloaded by a newly added package. At worst, you will have no clue until your code breaks.
A couple of examples:
try loading RMySQL and RSQLite at the same time, they don't go along very well
also RMongo will overwrite certain functions of RMySQL
forecast masks a lot of stuff with respect to ARIMA-related functions
R.utils even masks the base::parse routine
(I can't recall which functions in particular were causing the problems, but am willing to look it up again if there's interest)
Surprisingly, this doesn't seem to bother a lot of programmers out there. I tried to raise interest a couple of times at r-devel, to no significant avail.
Downsides of using the :: operator
Using the :: operator might significantly hurt efficiency in certain contexts as Dominick Samperi pointed out.
When developing your own package, you can't even use the :: operator throughout your own code as your code is no real package yet and thus there's also no namespace yet. So I would have to initially stick to the foo way, build, test and then go back to changing everything to namespace::foo. Not really.
Possible solutions to avoid these problems
Reassign each function from each package to a variable that follows certain naming conventions, e.g. namespace..foo in order to avoid the inefficiencies associated with namespace::foo (I outlined it once here). Pros: it works. Cons: it's clumsy and you double the memory used.
Simulate a namespace when developing your package. AFAIU, this is not really possible, at least I was told so back then.
Make it mandatory to use namespace::foo. IMHO, that would be the best thing to do. Sure, we would lose some extend of simplicity, but then again the R universe just isn't simple anymore (at least it's not as simple as in the early 00's).
And what about (formal) classes?
Apart from the aspects described above, :: way works quite well for functions/methods. But what about class definitions?
Take package timeDate with it's class timeDate. Say another package comes along which also has a class timeDate. I don't see how I could explicitly state that I would like a new instance of class timeDate from either of the two packages.
Something like this will not work:
new(timeDate::timeDate)
new("timeDate::timeDate")
new("timeDate", ns="timeDate")
That can be a huge problem as more and more people switch to an OOP-style for their R packages, leading to lots of class definitions. If there is a way to explicitly address the namespace of a class definition, I would very much appreciate a pointer!
Conclusion
Even though this was a bit lengthy, I hope I was able to point out the core problem/question and that I can raise more awareness here.
I think devtools and mvbutils do have some approaches that might be worth spreading, but I'm sure there's more to say.
GREAT question.
Validation
Writing robust, stable, and production-ready R code IS hard. You said: "Surprisingly, this doesn't seem to bother a lot of programmers out there". That's because most R programmers are not writing production code. They are performing one-off academic/research tasks. I would seriously question the skillset of any coder that claims that R is easy to put into production. Aside from my post on search/find mechanism which you have already linked to, I also wrote a post about the dangers of warning. The suggestions will help reduce complexity in your production code.
Tips for writing robust/production R code
Avoid packages that use Depends and favor packages that use Imports. A package with dependencies stuffed into Imports only is completely safe to use. If you absolutely must use a package that employs Depends, then email the author immediately after you call install.packages().
Here's what I tell authors: "Hi Author, I'm a fan of the XYZ package. I'd like to make a request. Could you move ABC and DEF from Depends to Imports in the next update? I cannot add your package to my own package's Imports until this happens. With R 2.14 enforcing NAMESPACE for every package, the general message from R Core is that packages should try to be "good citizens". If I have to load a Depends package, it adds a significant burden: I have to check for conflicts every time I take a dependency on a new package. With Imports, the package is free of side-effects. I understand that you might break other people's packages by doing this. I think its the right thing to do to demonstrate a commitment to Imports and in the long-run it will help people produce more robust R code."
Use importFrom. Don't add an entire package to Imports, add only those specific functions that you require. I accomplish this with Roxygen2 function documentation and roxygenize() which automatically generates the NAMESPACE file. In this way, you can import two packages that have conflicts where the conflicts aren't in the functions you actually need to use. Is this tedious? Only until it becomes a habit. The benefit: you can quickly identify all of your 3rd-party dependencies. That helps with...
Don't upgrade packages blindly. Read the changelog line-by-line and consider how the updates will affect the stability of your own package. Most of the time, the updates don't touch the functions you actually use.
Avoid S4 classes. I'm doing some hand-waving here. I find S4 to be complex and it takes enough brain power to deal with the search/find mechanism on the functional side of R. Do you really need these OO feature? Managing state = managing complexity - leave that for Python or Java =)
Write unit tests. Use the testthat package.
Whenever you R CMD build/test your package, parse the output and look for NOTE, INFO, WARNING. Also, physically scan with your own eyes. There's a part of the build step that notes conflicts but doesn't attach a WARN, etc. to it.
Add assertions and invariants right after a call to a 3rd-party package. In other words, don't fully trust what someone else gives you. Probe the result a little bit and stop() if the result is unexpected. You don't have to go crazy - pick one or two assertions that imply valid/high-confidence results.
I think there's more but this has become muscle memory now =) I'll augment if more comes to me.
My take on it :
Summary : Flexibility comes with a price. I'm willing to pay that price.
1) I simply don't use packages that cause that kind of problems. If I really, really need a function from that package in my own packages, I use the importFrom() in my NAMESPACE file. In any case, if I have trouble with a package, I contact the package author. The problem is at their side, not R's.
2) I never use :: inside my own code. By exporting only the functions needed by the user of my package, I can keep my own functions inside the NAMESPACE without running into conflicts. Functions that are not exported won't hide functions with the same name either, so that's a double win.
A good guide on how exactly environments, namespaces and the likes work you find here:
http://blog.obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
This definitely is a must-read for everybody writing packages and the likes. After you read this, you'll realize that using :: in your package code is not necessary.
Occasionally I see small ways I could improve either R (recently the IQR command) and R documentation (just this week perhaps elaborating differences among and better interconnecting aggregate, tapply, and by). But I don't see a way to really make that contribution back. I looked into the developer site and it seems that my options are either to attempt to become a full fledged developer or create packages, neither of which fit what I wish to accomplish.
I did propose IQR changes on the R mailing list but got no response so I figure that's going nowhere.
And to clarify, I'm talking about base-R. Additional packages are another matter.
Any tips?
Send (or CC) to r-devel. Traffic is quite high on r-help, and things can be overlooked there.
File a bug under the wishlist category detailing the improvement you would like to see.
Having filed the bug, try to provide a patch against the R code and or documentation as appropriate. I've done this before where there was a problem or infelicity in R, supplied a patch and a fix to the help files/manual and had the changes accepted (after suitable modification) by R Core.
If it is an addition to the R code base, you are going to have to show that there is a real pressing need for the addition. Basically you are asking R Core to maintain your code in perpetuity, and they are unlikely to do that unless you can demonstrate a need.
If it is an addition, look for a popular R package that does similar/related things and suggest to the package maintainer that they include your function. That way you don't need to start a whole package for something simple but contribute your code. There are several, popular, *misc packages on CRAN for example.
If you want to contribute fixes to the R documentation and/or manuals, provide patches to the sources. You can find the sources at svn.r-project.org/R
Hopefully that gives you some ideas. Patches and code always help!
How about patches to existing packages?
How about open bug reports on packages? R-Forge projects don't seem to use the issue trackers much, but some folks on the RPostgreSQL team I'm on enabled it (where it is hosted on Google Code), and it has been helpful -- see here. And we had a really useful inflow of fresh blood with a rocking new developer from Japan, probably in part because of the visibility of the project there.
In essence, try to find a project / group / team to become acquainted with and join. In that sense, this is just like any other Open Source project. The r-devel list (gmane view) is a good place for R development in general.
The R Core team, on the other hand, is a little more closed and per invitation only and unlikely to change. So be it, for better or worse. It has worked so far, and hence I am not among those who bemoan this loudly.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I've been using R for a little over a year now and it's been a successful venture. But all too often, I find that there is something that I can't figure out for lack of knowing how to find it or an example of it.
Stackoverflow,
Could you recommend a pathway for learning R in a manner that provides one with a toolset at their disposal to solve problems of a statistical nature?
There's a wealth of knowledge on the internet, between the r-project website and the mailings lists but it seems to be "everywhere" and nowhere when you're actually looking for it.
For example, when I first started using R, I went through "Intro to R". Then I read the language definition (which obviously hasn't sunk in). But every time I ask a question on Stackoverflow I'm presented with some new badass function that is the solution to all my problems in the short term. My question is, how did you know these functions existed in the first place? And how does one go about finding them? Presumably, you read something or found some resources that detoured your learning to the exponential part of the curve. What was it?
Obviously, R's functionality as a statistical tool is broad. For my own purposes I work mostly with economic or financial data. Hence, answers with this in mind would be most helpful.
Completely biased response: learn plyr, reshape2 and ggplot2. They will cover 90% of your data manipulation and visualisation needs. All three packages have a consistent philosophy of data (which the ggplot2 book touches upon), and are designed to be consistent and easier to
learn.
Rather than learning many specialised functions, I really encourage you to learn about simple functions that can be flexibly composed to solve a wide range of problems. This is what plyr strives to do for data manipulation, and what ggplot2 strives to do for visualisation. It does mean you need to invest more time up front to learn a little about the underlying theory, but it's my belief that it will pay off handsomely in the long run.
My way how I learned R.
R resources:
To learn R, the most important resource is google. search for: “TOPIC r-project”, “TOPIC filetype:r”, or “TOPIC site:nabble.com”.
Second, look at the example code provided with most packages. go to “http://bm2.genes.nig.ac.jp/”, search for a topic and look at the example code. run it and adapt it, this way you can often solve part of your problem.
Third: the r-help mailing list. Read the posts, the basic questions get asked over and over again. If you have a problem and you are completely stuck, ask a question on the mailing list.
Finally, look at the source code of the R-packages. that’s the hardest part. if you can alter the code to your needs, you have mastered R ;-)
Some Tips:
R has a steep learing curve. that’s a feature ;-) , it is designed to solve advanced problems and in the end you are fast than when using an alternative to R.
Know every single R package and function that is relevant to your problem. the strength of R is that there are so many packages availiable (around 2000, I think). Usually there is always a package that’s more suited or that already solves your problem. (some help pages are badly written and hard to understand - I got used to it)
R books are not helpful in learning R. yes, that’s true. If you are an expert programmer and expert statistician, you don’t need any book on R. (only exception is Hadley Wickham’s ggplot2 book). If your are not, learn programming in general and/or advanced statistics.
Some R package have known bugs, which nobody will fix (package owner left university, etc.). just a warning, this can be tricky if you are looking for a bug in your code and the bug is in a R package.
I'll start with this:
My question is, how did you know these functions existed in the first place?
Simple - we tried to solve a similar problem and came across that function. It either suited or didn't suit our needs but we now know it's there. I haven't used R much personally but what you're describing is the learning curve for every programming language ever. Firstly, you learn the "grammar" i.e. what you can do. Then you try to do something. You find you can't.
At that stage a programmer has a number of options. What do I do personally? Depends. I'll try and look up that package/header/library/whatever's member functions to see if something suits my needs. I might Google it, because unless you're really pushing the boundaries someone somewhere has probably tried and failed to do it before and had their question answered. If you are pushing the boundaries, someone somewhere has probably tried and failed before, but got no answer. I might try a forum or two to see what happens. I personally don't use IRC much, but that's another option, as are mailing lists depending on how specialised the problem is.
I also have a folder on my computer full of books which I search through depending on the problem and a small library of books I look through/learnt from, which often contain practical, not-quite-there-but-adaptable examples.
My only comment would be attempting to read the language specification is unlikely to be massively useful to you as a beginner. You won't fully understand what it means because you haven't pushed the bounds and tried things yet. For example, a novice in C might try this:
char c = '7';
int x = (int) c;
to convert the character '7' into an integer form. It's not a bad thought process until you understand how characters and ASCII work, then you see why the above doesn't give you what you want.
In short, I think this is going to be part of the learning process and I don't think you can cut it any shorter. The consolation is like any research, the more you do it the more you'll know where to look and what questions to ask on various communities.
One of the things I do is follow the RSS feed of R questions on SO (https://stackoverflow.com/feeds/tag/r). Then I can browse what other people have asked/answered.
Often I will favourite a particular question/answer if I think I'll use it, or jot down the salient points into my notebook software (OneNote), occaisonaly I'll even try the question/answer out myself.
EDIT:
I'd also recomend Patrick Burn's book R-Inferno. It's not so much of a training book as a description of all the gotchas and oooh moments Patrick has found (so far).
There's a free book you might be interested in: Introduction to Probability and Statistics Using R
Here is a good list of resources for learning R:
https://stats.stackexchange.com/questions/138/resources-for-learning-r
Also, that website in general is a good resource.
In general I would say that following a mailing list, or a help list is the best way I have found for learning new things. (That and the "R magazine": http://www.r-bloggers.com )
Learning the RODBC package to interact directly with Oracle data made a big impact at my job. My boss was amazed when I pulled Oracle data directly into R and cranking out a plot in only a few lines of code. Try doing that in Excel!
Moral of the story, learn how to pull in data and manipulate it within R. Then move to some of the cooler stuff like ggplot.
I can recommend Penn University's Introductory Course on R.
The ggplot chapter alone is worth reading - I found ggplot very confusing but this is a great explanation.
The book that helped my learning the most was The Art of R Programming. A lot of programming books can be dry. Since R is commonly an entry point to programming it's important for the voice of materials to resonante with the student. That book did just that with me. The voice felt very casual and I liked that.
Some interesting links:
Intro, links and examples: http://manuals.bioinformatics.ucr.edu/home/programming-in-r
A lot of documentation: https://en.wikibooks.org/wiki/R_Programming
R forum: http://r.789695.n4.nabble.com/
The [R] tag FAQ, right here on Stackoverflow, https://stackoverflow.com/questions/tagged/r?sort=frequent provides numerous reproducible examples that one can use to "learn by doing".
Most of the problems are very common and will eventually be something that you will have to look up as a beginner. The FAQ also provides highly literate (and experienced) examples of usage for a diverse range of functions and useful packages.
If you're new to R, and you prefer a more hands on approach to learning, the FAQ should not be overlooked as a potential resource for learning. Many of the questions also provide useful discussion surrounding paradigms of the language itself (vectorization, workflow, debugging are just a few examples).
Nearly every question in the FAQ is worth studying as a new user as it touches on elements that, speaking for myself, I wish I had been pointed to when I asked this question originally.
Just a few examples:
How to make a great R reproducible example
Grouping functions (tapply, by, aggregate) and the *apply family
Workflow for statistical analysis and report writing
How to sort a dataframe by multiple column(s)?
What is your favorite R debugging trick?