Why and How to effectively test beta distributions of R as a normal user? - r

This question is inspired by the remark of Duncan Murdoch on the r-devel mailing list in response to a bug report about Sweave :
This is fixed in R-patched. (It would
have been fixed in 2.12.0 if more
people tested the betas...).
Honestly, I've stayed away from beta -aka development- versions for a number of reasons, and these are reasons I hear from more people :
I am a bit horrified it would
somehow cause conflicts with my
current R distribution. As I need it
for work, having to repair it regularly would be a loss of
time I can't explain to my boss
I wouldn't have a clue how to test
efficiently. I reckon every test I
could come up with has already been
run by the development team.
I still find it difficult to figure
out when something is a bug, and
when (most often) it is my own
stupidity kicking in.
But as I understood, it would be a valuable contribution to the R community, and I'm willing to do my bit of the testing as well if I can fit it somehow into my own work. I was thinking of keeping the beta on the side and running my scripts through it as well as a checkup. Saving the constructed objects allows a quick and easy all.equal() to see if something is wrong.
Anybody some more/better ideas on how I could help testing with a minimum amount of effort and a maximum amount of efficiency?
I'd also like to promote this a bit more on our department as well. Apart from the "It's time to give back to the community", any other good reasons why testing betas is worth the effort? How can I counter the arguments given above?
Edit:
As Dirk Eddelbuettel pointed out in the comments, part of the deal is preventing the path variables in Windows. I have some ideas on that, but pointers on how to practically organize your computer for testing R-devel versions are greatly appreciated as well.

I fear you misunderstand. This may not be straightforward or obvious at first so maybe this helps:
"patched" is not "beta". Patched is what R 2.12.1 will be.
There is no conflict. It drops in for 2.12.0.
It is a separate download, and a nightly build available from here.
This is not r-devel but r-patched.
It is our duty as users to test pre-releases as well. So if anything, in an ideal word you would have R-patched installed --- as well as R-devel!
Testing can be as easy as installing another version, keeping it outside your path and then adjusting PATH and R_HOME dynamicaly from a script. Testing means running it on your code and data to prevent you from getting bitten by bugs once the new code is released.

I wouldn't have a clue how to test efficiently. I reckon every test I could come up with has already been run by the development team.
I still find it difficult to figure out when something is a bug, and when (most often) it is my own stupidity kicking in.
The problem is, software is not (or not only) going to be used by developers. It is going to be used by people that may not have programming knowledge at all (I'm speaking generally, this is valid for R as well as for any other software).
If the help or the interface or the general way the software is built do not give you enough informations on how to do something, well, that is maybe not a bug, but it is something that can be improved (and pointed out to the devs).
Also, remember that the developers wrote the software. They know how to use it and often they will be biased in testing it mainly by using it correctly and see if it gives the good result rather than by "trying to break it".
By using it in YOUR way (which may possibly be "uncorrect"), you are effectively running tests that maybe escaped the developers, just because they were not thinking of using it like you did.

Related

Is it possible to compile R scripts into a binary?

I've done some research online but I haven't been able to come up with any answer. I know this has been asked at least thrice, as I've viewed those posts, linked here:
First Question
Second Question
Third Question
However, it's been 5, 7, and 9 years since those questions have been asked, and technology is obviously rapidly evolving :) I don't know much about R, and I haven't worked with it for a long time, and so I ask those of you who know better and have more experience if you know of anything that would be useful to me.
If there's nothing that exists now, how hard would it be to create? The reason I ask is that the company I work for would like to obfuscate the proprietary code before it goes out. I would have the full 40 hours a week to work on creating it, and so time and/or difficulty isn't a major concern.
Thanks!
Found this: I'm not sure about the security, but this is definitely a deterrent and would take (I think) some fairly concentrated effort to crack. There is a byte code compiler for R based on the paper linked below. There is a method in library(compiler), which comes standard with R, that allows you to compile an R script to byte code. In the same library, you can load in the source files and use them as you'd like.
A Byte Code Compiler for R

What is the Purpose of Deprecated Code?

I was changing out some PHP code the other day because it was deprecated, and no longer worked. I understand the meaning of deprecated code based on an answer I found here: https://stackoverflow.com/a/8111799/1810777
But several question came to mind:
I was wondering what is the purpose of deprecating code?
Why not just leave it in use, instead of recommending others to use
new alternatives?
Does it slow stuff down?
I couldn't find anywhere else online that talked about it. I'm just really wondering why code that used to work well, isn't useful anymore. Thanks!
It means that in a future release it's going to be removed.
This allows an API developer to give people time to migrate to the new version / method of doing whatever rather than just pulling the rug out from under them. Both the new and the old versions are available for a limited time.
As for why not leave it there forever ... because there's a new, better way to do it. You can't support legacy code forever (if you value your sanity and your budget). All support has a cost (be that tech support hours, bug fixes, regression testing, etc)

R Quality Assurance Techniques

Could you provide some insight into the techniques that you use to ensure the quality of your solutions. For example, sometimes, I like to test my result using stopifnot() to ensure I'm not receiving ridiculous results. Are there any other techniques or functions that you use in data processing to ensure that you're receiving the solution you meant to?
Note: I realize that this is a broad question and perhaps a candidate for community wiki or even closure, but rather than voting to close, perhaps assist me by adding comments to direct the conversation.
Just a few things that come to mind (in random order)
This page has very interesting link for debugging in R (ok this is during production, but still related to your issue I think)
You can use exceptions, as explained in this discussions (and links therein)
You can write tests with known results (both for success and failure) and see that they actually do what they are supposed to do. Be sure to pass some weird data to the functions and see how they behave in a "not-so-normal" situation.
Don't just rely on automated tests: give your functions to a fairly computer illiterate person at work (not enough that he/she can't use R though!) and let him/her do some beta tests. You'll be amazed at the quantity of errors he/she will come up with!!! :)
Quality in software engineering is quite a massive area, and most of it applies to code written in R as much as code written in Cobol or C#, so my first answer would be 'it depends'.
For me, I come from the Pharmaceutical Industry, where what we do is regulated by government agencies like the FDA and the MHRA. For us, Quality is something we think about throughout the process so I would list the following as visible artifacts of quality;
We have a software development process, that's written down and repeatable (traditionally in this kind of industry this is a waterfall style, but more and more agile / prototyping style methodologies are being used)
We have a system that ensures every person involved knows what they should be doing (job descriptions) and is suitably qualified to do that job (training)
We start by defining what is required in some way, hopefully in some way that can be tested
We have some way of documenting our development process, where we've been and how (a combination of good documentation and Source Control)
We do testing wherever possible, and as early as possible (so, automated if possible)
We have people who are responsible for overseeing Quality, who are separate from people who are doing to prevent conflicts
We control the software environment that is used for development, testing and production (read; change control)
We control and manage software once it is in use, tracking issues and managing them (Issue Tracking)
We keep records, so that even if every person involved went under a bus / won the lottery the new people could still defend and prove everything above to a government inspector.
However, that's a big list, and I imagine their are lots of industries that don't do all of them (finance, education) and probably some who do more (building nuclear reactors, saving lives, NASA).
More specifically to what i assume you're getting at, before you code you should be able to define some specific starting input's and the answers you should get out, and I recommend you use something like RUnit or Testthat to build these in.

How can I contribute to base R in small ways?

Occasionally I see small ways I could improve either R (recently the IQR command) and R documentation (just this week perhaps elaborating differences among and better interconnecting aggregate, tapply, and by). But I don't see a way to really make that contribution back. I looked into the developer site and it seems that my options are either to attempt to become a full fledged developer or create packages, neither of which fit what I wish to accomplish.
I did propose IQR changes on the R mailing list but got no response so I figure that's going nowhere.
And to clarify, I'm talking about base-R. Additional packages are another matter.
Any tips?
Send (or CC) to r-devel. Traffic is quite high on r-help, and things can be overlooked there.
File a bug under the wishlist category detailing the improvement you would like to see.
Having filed the bug, try to provide a patch against the R code and or documentation as appropriate. I've done this before where there was a problem or infelicity in R, supplied a patch and a fix to the help files/manual and had the changes accepted (after suitable modification) by R Core.
If it is an addition to the R code base, you are going to have to show that there is a real pressing need for the addition. Basically you are asking R Core to maintain your code in perpetuity, and they are unlikely to do that unless you can demonstrate a need.
If it is an addition, look for a popular R package that does similar/related things and suggest to the package maintainer that they include your function. That way you don't need to start a whole package for something simple but contribute your code. There are several, popular, *misc packages on CRAN for example.
If you want to contribute fixes to the R documentation and/or manuals, provide patches to the sources. You can find the sources at svn.r-project.org/R
Hopefully that gives you some ideas. Patches and code always help!
How about patches to existing packages?
How about open bug reports on packages? R-Forge projects don't seem to use the issue trackers much, but some folks on the RPostgreSQL team I'm on enabled it (where it is hosted on Google Code), and it has been helpful -- see here. And we had a really useful inflow of fresh blood with a rocking new developer from Japan, probably in part because of the visibility of the project there.
In essence, try to find a project / group / team to become acquainted with and join. In that sense, this is just like any other Open Source project. The r-devel list (gmane view) is a good place for R development in general.
The R Core team, on the other hand, is a little more closed and per invitation only and unlikely to change. So be it, for better or worse. It has worked so far, and hence I am not among those who bemoan this loudly.

Is it good practice to update R packages often? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I've started to use R a little while ago and am not sure how often to update the installed packages (at this time, I'm using mostly ggplot2 and rattle). One one hand it's the typical geek impulse to have the latest version :-) On the other, updates can break functionality and, as an R beginner, I don't want to waste time looking into package incompatibilities and reinstalling libraries, it's almost certain I wouldn't notice any difference with an improved package.
With other applications I have a sense developed from experience on how often to upgrade, how much to wait between the release of an upgrade and installing it and so on. But I'm in the dark with regards to R.
And to be clear: I'm not talking about R itself, but its libraries.
Thanks.
Here is my philosophy: the naïve user never updates. The sophisticated user always updates. The power user updates often, but carefully.
Mindless updating is not always beneficial. Bugs work their way in updated versions of R libraries (or R itself!), and you could break your existing code by updating without reading the change log or commit history. For example, R 2.11 broke lme4 on OS X... it pays to carefully update and run demos of packages between releases. It really sucks to update to a new library or R release and realize something broke when you have a deadline.
Yes it is.
Why exactly would you want to hang on to old bugs and lacking features?
The question is already answered, but I'll offer my 2 cents. In an organization, updating R should be treated like updating gcc or Java: with warnings, with a staging area, a rollback plan, etc. Others' work and results may be affected. [See update #2]
I am more impulsive about updating R packages. As long as you can reproduce the state of your system at any point in time, there's little to worry about. Ensuring that nightly backups occur should be the domain of your sysadmin.
The basic idea is that you should be able to reproduce everything. Actually testing that your earlier results are reproduced is dependent on whether you want to disprove your assumption that there are no bugs or changes that will affect later results. :)
Update 1. As has been mentioned in comments and above, updating in a production environment or any environment where stability is optimal (e.g. bugs are either known or not significant), introducing new bugs, new dependencies, different output, or any variety of other software regressions, should be done quite carefully. Moreover, where you're updating from matters a lot. Updating from R-Forge is more likely to expose you to the newest bugs than from CRAN. Even so, I have found and reported bugs that persisted through 3+ versions of a package on CRAN, as well as other regressions that have just magically appeared. I test a lot, but updating, finding new bugs, and debugging is an effort that I don't always want to (or have time to) undertake.
I am reminded of this question after finding a new bug in a new version of a package that I use a lot. Just to check, I reverted to an earlier version - no more crashes, though tracking down the cause took a couple of hours, because I assumed it was not arising in this package. I'll send a note to the maintainer before long, so others won't have to stumble on the same bug.
Update 2. In an organization, I have to say that the answer is no. In fact, in any case where there may be two or more concurrent instances of R, it is very unwise to blindly update the packages while another may be using them. There are likely to be good methods for hot-swapping packages, I just don't yet know them. Keep in mind that the two instances need only share libraries (i.e. where the packages are stored), and, AFAIK, need not run concurrently on the same machine. Thus, if libraries are placed on shared systems, e.g. over NFS, one may not know where else R is running at the same time, using those libraries. Accidentally killing another R process is not usually a good thing.
Yes, unless you have a good reason not to (see my comment to Dirk)
Although some of the following has been mentioned in previous answers, I think it might be beneficial to make a few things explicit. As a developer, I think that updating packages often (and R-devel for the matter), is a good practice. You definitely want to stick with the latest out there. If your package imports/depends/sugests... interacts with other packages, you want to ensure interoperability on day to day basis, and not face the 'bugs' just before release, when time is short.
On the other hand, some environments will put special emphasis on exact reproducibility. In that case, one may want to adopt a more careful strategy with updating.
But it is worth emphasising that these two behaviours are not exclusive. It is possible to install different versions of R and maintain different libraries, to benefit from a bleeding edge development environment and a more stable one for production.
Hope this helps.
I'd be inclined to respond as often as you need to, and never when you're in a hurry!
Firstly, debate the chances that you're labouring under a bug of which you are unaware. I would moot that is quite rare. If you're suffering under a bug and there's a newer version in which the bug is fixed, plan an upgrade. If you want a new feature, plan an upgrade. If it's your first day back after Christmas and the biggest overhead is trying to remember what you were actually doing last then the overhead of messing about with some new dependency requirements (which may include system components outside of R) is probably relatively small, so consider seeing what updates are available (guess what I did today) ;-)
The golden rule is probably that there isn't a single, recommended schedule other than what makes sense for your use; daily updates will inevitably result in fewer updates each time and thus minimize the pain of the actual update, but it's not worth it if you constantly get different numerical results from one day to the next because of some change to how a function does sampling (different numerical results have plagued Coursera students using caret). Don't underestimate the value of a stable system that allows you to just get on with productive work rather than faffing.

Resources