I am new to creating my own packages and I am using roxygen2.
I am creating a package with a lot of internal helper functions and I was wondering if I have to document all of them. I understand the importance of documentation but some functions are fairly simple and are just wrapper around other functions for convenience. I have done a basic search of the web but I don't seem to be able to find a definitive answer.
Any help is appreciated.
It depends what you mean by "have to". One interpretation is, "Do I have to document these functions to pass checks?" The answer to that question is no. As long as the function isn't exported from the package, R CMD check won't require that you document it.
Another interpretation is "Do I have to document it to help myself in maintaining this package?" That question is harder to answer. Some functions are so obvious that they don't really need any documentation beyond their name, e.g. a print method with no extra arguments beyond those of the generic.
Other functions aren't so obvious, or have arguments whose meaning isn't obvious. It's a good idea to document those if you plan to maintain your package for a long time, because you might forget the details between now and whenever a problem arises. And if you are releasing your package to others, you should plan on long term maintenance, because if it is useful, people will use it.
Related
Maybe due to language barrier, I am not quite sure about how/when to use callNextMethod even after reading documents. I think it is for passing arguments, but sometimes I see a R script with only one method still uses callNextMethod. I am confused.
Can anyone explain it in a more explicit way? Better with examples.
I would like to provide an additional reminder / incentive to my colleagues to write tests for new functions that they contribute to a package I have developed.
I am not sure if this is even a good idea, or if there is a better approach, but I would like to know if there is a way to use the automated testing of files in /tests and /inst/tests to encourage developers to write tests for new functions.
For example, is there a test that will throw an error when the package is checked, built, or tested if there are any functions in a package without an associated test?
I recognize that there are many potential problems with this. It would be difficult to determine if the test is any good, and it might be a waste of time to test a trivial function. But it would still serve as a reminder that a test should be written, and perhaps it could enforce one or two very basic rules.
Is there a test that I can write that will prevent the package from compiling if there is a function without a test? I am using the testthat package in R.
It doesn't need to be terribly robust, but the ability to exclude functions by name would be useful.
It sounds like you're looking for "code coverage" functionality. The first step here would be to figure out which code is covered by tests and which isn't. Check out section 2.2 of the RUnit Vignette for information on how to do that.
You may end up having to modify the inspect function in order to throw alerts or handle custom events like that, but I think they have the bulk of what you're trying to do already implemented.
Short version
For those that don't want to read through my "case", this is the essence:
What is the recommended way of minimizing the chances of new packages breaking existing code, i.e. of making the code you write as robust as possible?
What is the recommended way of making the best use of the namespace mechanism when
a) just using contributed packages (say in just some R Analysis Project)?
b) with respect to developing own packages?
How best to avoid conflicts with respect to formal classes (mostly Reference Classes in my case) as there isn't even a namespace mechanism comparable to :: for classes (AFAIU)?
The way the R universe works
This is something that's been nagging in the back of my mind for about two years now, yet I don't feel as if I have come to a satisfying solution. Plus I feel it's getting worse.
We see an ever increasing number of packages on CRAN, github, R-Forge and the like, which is simply terrific.
In such a decentralized environment, it is natural that the code base that makes up R (let's say that's base R and contributed R, for simplicity) will deviate from an ideal state with respect to robustness: people follow different conventions, there's S3, S4, S4 Reference Classes, etc. Things can't be as "aligned" as they would be if there were a "central clearing instance" that enforced conventions. That's okay.
The problem
Given the above, it can be very hard to use R to write robust code. Not everything you need will be in base R. For certain projects you will end up loading quite a few contributed packages.
IMHO, the biggest issue in that respect is the way the namespace concept is put to use in R: R allows for simply writing the name of a certain function/method without explicitly requiring it's namespace (i.e. foo vs. namespace::foo).
So for the sake of simplicity, that's what everyone is doing. But that way, name clashes, broken code and the need to rewrite/refactor your code are just a matter of time (or of the number of different packages loaded).
At best, you will know about which existing functions are masked/overloaded by a newly added package. At worst, you will have no clue until your code breaks.
A couple of examples:
try loading RMySQL and RSQLite at the same time, they don't go along very well
also RMongo will overwrite certain functions of RMySQL
forecast masks a lot of stuff with respect to ARIMA-related functions
R.utils even masks the base::parse routine
(I can't recall which functions in particular were causing the problems, but am willing to look it up again if there's interest)
Surprisingly, this doesn't seem to bother a lot of programmers out there. I tried to raise interest a couple of times at r-devel, to no significant avail.
Downsides of using the :: operator
Using the :: operator might significantly hurt efficiency in certain contexts as Dominick Samperi pointed out.
When developing your own package, you can't even use the :: operator throughout your own code as your code is no real package yet and thus there's also no namespace yet. So I would have to initially stick to the foo way, build, test and then go back to changing everything to namespace::foo. Not really.
Possible solutions to avoid these problems
Reassign each function from each package to a variable that follows certain naming conventions, e.g. namespace..foo in order to avoid the inefficiencies associated with namespace::foo (I outlined it once here). Pros: it works. Cons: it's clumsy and you double the memory used.
Simulate a namespace when developing your package. AFAIU, this is not really possible, at least I was told so back then.
Make it mandatory to use namespace::foo. IMHO, that would be the best thing to do. Sure, we would lose some extend of simplicity, but then again the R universe just isn't simple anymore (at least it's not as simple as in the early 00's).
And what about (formal) classes?
Apart from the aspects described above, :: way works quite well for functions/methods. But what about class definitions?
Take package timeDate with it's class timeDate. Say another package comes along which also has a class timeDate. I don't see how I could explicitly state that I would like a new instance of class timeDate from either of the two packages.
Something like this will not work:
new(timeDate::timeDate)
new("timeDate::timeDate")
new("timeDate", ns="timeDate")
That can be a huge problem as more and more people switch to an OOP-style for their R packages, leading to lots of class definitions. If there is a way to explicitly address the namespace of a class definition, I would very much appreciate a pointer!
Conclusion
Even though this was a bit lengthy, I hope I was able to point out the core problem/question and that I can raise more awareness here.
I think devtools and mvbutils do have some approaches that might be worth spreading, but I'm sure there's more to say.
GREAT question.
Validation
Writing robust, stable, and production-ready R code IS hard. You said: "Surprisingly, this doesn't seem to bother a lot of programmers out there". That's because most R programmers are not writing production code. They are performing one-off academic/research tasks. I would seriously question the skillset of any coder that claims that R is easy to put into production. Aside from my post on search/find mechanism which you have already linked to, I also wrote a post about the dangers of warning. The suggestions will help reduce complexity in your production code.
Tips for writing robust/production R code
Avoid packages that use Depends and favor packages that use Imports. A package with dependencies stuffed into Imports only is completely safe to use. If you absolutely must use a package that employs Depends, then email the author immediately after you call install.packages().
Here's what I tell authors: "Hi Author, I'm a fan of the XYZ package. I'd like to make a request. Could you move ABC and DEF from Depends to Imports in the next update? I cannot add your package to my own package's Imports until this happens. With R 2.14 enforcing NAMESPACE for every package, the general message from R Core is that packages should try to be "good citizens". If I have to load a Depends package, it adds a significant burden: I have to check for conflicts every time I take a dependency on a new package. With Imports, the package is free of side-effects. I understand that you might break other people's packages by doing this. I think its the right thing to do to demonstrate a commitment to Imports and in the long-run it will help people produce more robust R code."
Use importFrom. Don't add an entire package to Imports, add only those specific functions that you require. I accomplish this with Roxygen2 function documentation and roxygenize() which automatically generates the NAMESPACE file. In this way, you can import two packages that have conflicts where the conflicts aren't in the functions you actually need to use. Is this tedious? Only until it becomes a habit. The benefit: you can quickly identify all of your 3rd-party dependencies. That helps with...
Don't upgrade packages blindly. Read the changelog line-by-line and consider how the updates will affect the stability of your own package. Most of the time, the updates don't touch the functions you actually use.
Avoid S4 classes. I'm doing some hand-waving here. I find S4 to be complex and it takes enough brain power to deal with the search/find mechanism on the functional side of R. Do you really need these OO feature? Managing state = managing complexity - leave that for Python or Java =)
Write unit tests. Use the testthat package.
Whenever you R CMD build/test your package, parse the output and look for NOTE, INFO, WARNING. Also, physically scan with your own eyes. There's a part of the build step that notes conflicts but doesn't attach a WARN, etc. to it.
Add assertions and invariants right after a call to a 3rd-party package. In other words, don't fully trust what someone else gives you. Probe the result a little bit and stop() if the result is unexpected. You don't have to go crazy - pick one or two assertions that imply valid/high-confidence results.
I think there's more but this has become muscle memory now =) I'll augment if more comes to me.
My take on it :
Summary : Flexibility comes with a price. I'm willing to pay that price.
1) I simply don't use packages that cause that kind of problems. If I really, really need a function from that package in my own packages, I use the importFrom() in my NAMESPACE file. In any case, if I have trouble with a package, I contact the package author. The problem is at their side, not R's.
2) I never use :: inside my own code. By exporting only the functions needed by the user of my package, I can keep my own functions inside the NAMESPACE without running into conflicts. Functions that are not exported won't hide functions with the same name either, so that's a double win.
A good guide on how exactly environments, namespaces and the likes work you find here:
http://blog.obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
This definitely is a must-read for everybody writing packages and the likes. After you read this, you'll realize that using :: in your package code is not necessary.
I'm developing an major upgrade to the R package, and as part of the changes I want to start using the S3 methods so I can use the generic plot, summary and print functions. But I think I'm not totally sure I understand why and when to use generic functions in general.
For example, I currently have a function called logLikSSM, which computes the log-likelihood of a state space model. Instead of using this functions, I could make function logLik.SSM or something like that, as there is generic function logLik in R. The benefit of this would be that logLik is shorter to write than logLikSSM, but is there really any other point in this?
Similar case, there is a generic function called simulate in stats package, so in theory I could use that instead of simulateSSM. But now the description of the simulate function tells that function is used to "Simulate Responses", but my function actually simulates the hidden states, so it really doesn't fit into the description of the simulate function. So probably in this case I shouldn't use the generic function right?
I apologize if this question is too vague for here.
The advantages of creating methods for generics from the core of R include:
Ease of Use. Users of your package already familiar with those generics will have less to remember making it easier to use your package. They might even be able to do a certain amount without reading the documentation. If you come up with your own names then they must discover and remember new names which is an added cognitive burden.
Leverage Existing Functionality. Also any other functions that make use of generics you create methods for can then automatically use yours as well; otherwise, they would have to be changed. For example, AIC uses logLik.
A disadvantage is that the generic involves the extra level of dispatch and if logLik is in the inner loop of an optimization there could be an impact (although possibly not material). In that case you could check the performance of calling the generic vs. calling the method directly and use the latter if it makes a significant difference.
Regarding the case that your function has a completely different purpose than the generic in the core of R, then it might be more confusing than helpful so you might, in that case, not create a method but have your own function name.
You might want to read the zoo Design manual (see link to zoo Design under Vignettes near the bottom of that page) which discusses the design ideas that went into the zoo package. These include the idea being discussed here.
EDIT: Added disadvantates.
good question.
I'll split your Question into two parts; here's the first one:
i]s there really any other point in [making functions generic]?
Well, this pattern is usually invoked when the develper doesn't know the object class for every object he/she expects a user to pass in to the method under consideration.
And because of this uncertainty, this design pattern (which is called overloading in many other languages) is invokved, and which requires R to evaluate the object class, then dispatch that object to the appropriate method given the object type.
The second part of your Question: [i]n this case I shouldn't use [the generic function] right?
To try to give you an answer useful beyond the detail of your Question, consider what happens to the original method when you call setGeneric, passing that method in.
the original function body is replaced with code for performing a top-level dispatch based on type of object passed in. This replaces the original function body, which just slides down one level so that it becomes the default method that the top level (generic) function dispatches to.
showMethods() will let you see all of those methods which are called by the newly created dispatch function (generic function).
And now for one huge disadvantage:
Ease of MISUse:
Users of your package already familiar with those generics might do a certain amount without reading the documentation.
And therein lies the fallacy that components, reusable objects, services, etc are an easy panacea for all software challenges.
And why the overwhelming majority of software is buggy, bloated, and operates inconsistently with little hope of tech support being able to diagnose your problem.
There WAS a reason for static linking and small executables back in the day. But this generation of code now, get paid now, debug later if ever, before the layoffs/IPO come, has no memory of the days when code actually worked very reliably and installation/integration didn't require 200$/hr Big 4 consultants or hackers who spend a week trying to get some "simple" open source product installed and productively running.
But if you want to continue the tradition of writing ever shorter function/method names, be my guest.
I want to look into rcpp to improve the speed of some of my R code without having to resort to messy C++ code (I've had some success with that, but it looks like code from hell).
So, I checked the documentation provided with Rcpp, and also the bundle of documents provided at Dirk Eddelbuettel's site. I installed and looked at RcppExamples, but (at least from its documentation) most of these refer to RcppClassic?. Besides that, I did some googling but that didn't result in answers to what seem like basic questions.
Do indexes in Rcpp work zero-based or one-based
List provides both operator() and
operator[], but apparently not
operator[[]]. It is not clear which
ones are similar to [] and [[]] in R.
Is there any support for factors in Rcpp (there does not appear to be any)?
Note: in fact I found some answers from the first example in Rcpp-introduction.pdf, but that just felt like luck.
Also, my stl is very rusty, so if anybody can provide me with a simple example where each element of a List is (e.g.) print-ed with an stl-style loop, that would be neat.
If anybody wants to call me an idiot for not finding this information: go ahead and make your day. Then make mine and point me to the docs I need :-)
As a suggestions to Mr. Eddelbuettel and other Rcpp authors (I expect some of them to read this): the class hierarchies and the like, provided by doxygen, are really neat when you are already kneedeep into Rcpp, but for a beginner (in Rcpp), I am more interested in a list of 'this method in this class does this like that function in R' rather than 'you can find the declaration of this operator in this header file'. After all, I understand one of the goals of Rcpp is to lower the threshold for using C++ in R? Note: from what I have seen and understood, I highly value the actual code of Rcpp and have the highest respect for its creators. If the lack of basic documentation is merely a result of 'lack of resources', I would be willing to become a resource (e.g.: work on 'basic' documentation once I get through it myself).
I do not quite know where to start answering this but here is a quick attempt:
The package has a website. The website lists the documentation.
The package has eight (8) vignettes. They are clearly listed. They are mostly meant to be read as documentation, some more introductory and some more advanced. Some (such as the unit testing output) are more of a quality-control iniative.
There is a vignette called Rcpp-introduction. We refer to it repeatedly. We suggest you read it. This is now also a peer-reviewed and published paper which may lend it even more credibility.
There is a vignette called Rcpp-FAQ. It's first question is "How do I get started?" which points to the aforementioned Rcpp-introduction.
There is a mailing list dedicated to project, you could actually read the archive.
We have given numerous talks, slides are available as is a 90 minute recording of a Google Tech Talk.
Even StackOverflow has a tag for it: [rcpp]. You could read the earlier posts.
There are over two dozen packages clearly listed on the CRAN page for Rcpp as using it. You could read their source code.
All that said, Rcpp cannot be used instead of C++ so if you do not know or understand that operator[[]] cannot exist in C++ we cannot help you either. This is not a magic fairy, or R-to-C++ code compiler. Rather, its focus is to make it much easier to get to C++ code from R, and in some cases even manages to improve on C++ practice. In essence, it tries to be "super-additive": the combination R and C++ should be more than either in isolation.
Lastly, I do grant you that the RcppExamples packages -- which by the way covers the old and new API -- could use more examples. However, its sourecs give good porting hints from old ("classic") to the new and current API.
But there is only so much documentation we can write ourselves. I myself find the above bullet points quite exhaustive. You may have honed in on the weakest element part of the chain though. That is bad luck. Please do try some of other pointers listed here.