What is the idiomatic way to check and document function preconditions and postconditions in R? - r

What is the idiomatic way to check and document function preconditions and postconditions in R? I would consider Eiffel's built in require and ensure constructs or D's in and out blocks state of the art here, but most languages don't have these baked in so have developed best practices to approximate them.
By checking preconditions, I mean throwing some kind of runtime exception if data that doesn't meet the functions assumptions are passed in. Currently have a stopifnot statement for every precondition I can think of at the start of the function. The same applies for postconditions, but with respect to the return value rather than the parameters.
Furthermore, is there are standard way of documenting what these preconditions and postconditions are? For example, it is pretty standard to spell these out in JavaDoc comments in Java.
What are the best practices in R in this respect?

Validity checking for S4 classes, where the checks are stored with the class definition via 'setValidity'. See for example:
http://www.r-project.org/conferences/useR-2004/Keynotes/Leisch.pdf

See ?stopifnot
or
for friendlier error messages but more verbose code if (condition) stop("...message...") .

In terms of documentation I would recommend you take a look at the roxygen2 package. It is comparable to JavaDoc and Doxygen in that it stores the documentation in the source file together with the code. There are a number of items that need to be defined, e.g.:
What are the input arguments
What does the function return
But this does not stop you from creating your own pre and post items that document the pre and post conditions. For more info on roxygen2 see CRAN or StackOverflow.

Related

Documentation of internal helper function using `roxygen2`

I am new to creating my own packages and I am using roxygen2.
I am creating a package with a lot of internal helper functions and I was wondering if I have to document all of them. I understand the importance of documentation but some functions are fairly simple and are just wrapper around other functions for convenience. I have done a basic search of the web but I don't seem to be able to find a definitive answer.
Any help is appreciated.
It depends what you mean by "have to". One interpretation is, "Do I have to document these functions to pass checks?" The answer to that question is no. As long as the function isn't exported from the package, R CMD check won't require that you document it.
Another interpretation is "Do I have to document it to help myself in maintaining this package?" That question is harder to answer. Some functions are so obvious that they don't really need any documentation beyond their name, e.g. a print method with no extra arguments beyond those of the generic.
Other functions aren't so obvious, or have arguments whose meaning isn't obvious. It's a good idea to document those if you plan to maintain your package for a long time, because you might forget the details between now and whenever a problem arises. And if you are releasing your package to others, you should plan on long term maintenance, because if it is useful, people will use it.

No Global Contract available for procedure / function

I've got a procedure within a SPARK module that calls the standard Ada-Text_IO.Put_Line.
During proving I get the following warning warning: no Global contract available for "Put_Line".
I do already know how to add the respective data dependency contract to procedures and functions written by myself but how do I add them to a procedures / functions written by others where I can't edit the source files?
I looked through sections 5.2 and 7.4 of the Adacore SPARK 2014 user's guide but didn't found an example with a solution to my problem.
This means that the analyzer cannot "see" whether global variables might be affected when this function is called. It therefore assumes this call is not modifying anything (otherwise all other proofs could be refuted immediately). This is likely a valid assumption for your specific example, but it might not be valid on an embedded system, where a custom implementation of Put_Line might do anything.
There are two ways to convey the missing information:
verifier can examine the source code of the function. Then it can try to generate global contracts itself.
global contracts are specified explicitly, see RM 6.1.4 (http://docs.adacore.com/spark2014-docs/html/lrm/subprograms.html#global-aspects)
In this case, the procedure you are calling is part of the run-time system (RTS), and therefore the source is not visible, and you probably cannot/should not change it.
What to do in practice?
Suppressing warnings is almost never a good idea, especially not when you are working on something safety-critical. Usually the code has to be changed until the warning goes away, or some justification process has to start.
If you are serious about the analysis results, I recommend to not use such subprograms. If you really need output there, either write your own procedure that replaces the RTS subprogram, or ensure that the subprogram really has no side effects. This is further backed up by what Frédéric has linked: Even if the callee has no side effects, you don't know whether it raises an exception for specific inputs (e.g., very long strings).
If you are not so serious about the results, then you can consider this specific one as a warning that you could live with.
Wrapper packages for use in development of SPARK applications may be found here:
https://github.com/joakim-strandberg/aida_2012
I think you just can't add Spark contracts on code you don't own, especially code from the Ada standard.
About Text_Io, I found something that may be valuable to you in the reference manual.
EDIT
Another solution compared to what Martin said, according to "Building high integrity applications with Spark" book, is to create a wrapper package.
As Spark requires you to deal with Spark packages but allows you to depend on a Spark spec with an Ada body, the solution is to build a Spark package wrapping your Ada.Text_io calls.
It might be tedious as you will have to wrap possible exceptions, possibly define specific types and so on but this way, you'll be able to discharge VCs on your full Spark package.

When to use Pragma Pure/Preelaborate

Is there a set of general rules/guidelines that can help to understand when to prefer pragma Pure, pragma Preelaborate, or something else entirely? The rules and definitions presented in the standard (Ada 2012), are a little heavy-going and I'd be grateful to read something that's a little more clear and geared towards the average case.
If I wanted to be thorough without fully understanding the "why" of it, can I simply try:
Mark the package spec with pragma Pure;
If it doesn't compile, try pragma Preelaborate;
If that fails, then I've done something tricky and either need to pragma Elaborate units on a with-by-with basis, or rethink the package layout.
While this might work (does it?), because it's recommended to mark a package as Pure whenever possible (likewise with Preelaborate), however it seems a bit brain damaged and I'd prefer to understand the process a bit better.
pragma Pure
You should use this on any package which does not have an internal state. It tells the user of the package that calls to any subprograms cannot have side effects, because there is no internal state they could change. So a function declared at library level inside a pure package will always return the same result when called with the same parameters.
The Ada implementation is allowed to cache return values of functions of a pure package, and to omit calls to subroutines if their return values won't be used because of these requirements. However, you can violate the constraints by calling imported subroutines (e.g. from a C library) inside your pure package (these may change some internal state which the Ada compiler doesn't know of). If you're evil, you can even import Ada subroutines from other parts of the software with pragma Import to bypass the requirements of pragma Pure. Needless to say: If you're doing anything like this, don't use pragma Pure.
Edit: To clarify the circumstances when calls may be omitted, let me quote the ARM:
If a library unit is declared pure, then the implementation is permitted to omit a call on a library-level subprogram of the library unit if the results are not needed after the call. Similarly, it may omit such a call and simply reuse the results produced by an earlier call on the same subprogram, provided that none of the parameters are of a limited type, and the addresses and values of all by-reference actual parameters, and the values of all by-copy-in actual parameters, are the same as they were at the earlier call. This permission applies even if the subprogram produces other side effects when called.
GNAT, for example, additionally defines that any subroutines that take a parameter of type System.Address or a type derived from it are not considered pure even if they are defined in a pure package, because the location the address points to may be altered, but GNAT does not know what kind of structure the address points to and therefore cannot run any checks about whether the referenced value of the parameter has been changed.
pragma Preelaborate
This tells the compiler that the package won't execute any code at elaboration time (i.e. before the main procedure starts executing). At elaboration time, the following constructs will execute:
Initialization of library-level variables (this can be a function call)
Initialization of tasks declared at library level (they may start executing before the main procedure does)
Statements in a begin ... end block at library level
You generally should avoid these things if you don't need them. Use pragma Preelaborate wherever possible, it tells the caller that he can safely use the package without executing anything at elaboration time.
If something doesn't compile with one of these pragmas when you think it should, look into why it doesn't compile. It may help you discover problems with your package implementation or structure. Don't just drop the pragma when it doesn't compile. As the constraint affects possible constraints on any packages that depend on yours, you should always choose the strictest applicable pragma.
Elaboration Order Handling in GNAT is a helpful guide. Ideally, the standard rules will suffice for most programs. The pragmas tell the compiler to substitute your elaboration order. They should be applied to solve specific problems, rather than used empirically.
Addendum: #ajb underscores an important distinction among the pragmas. The article cited agrees with the approach outlined in the question (bullets one and two): "Consequently a good rule is to mark units as Pure or Preelaborate if possible, and if this is not possible, mark them as Elaborate_Body if possible." It goes on to discuss situations (bullet three) "where neither of these three pragmas can be used."

Using generic functions of R, when and why?

I'm developing an major upgrade to the R package, and as part of the changes I want to start using the S3 methods so I can use the generic plot, summary and print functions. But I think I'm not totally sure I understand why and when to use generic functions in general.
For example, I currently have a function called logLikSSM, which computes the log-likelihood of a state space model. Instead of using this functions, I could make function logLik.SSM or something like that, as there is generic function logLik in R. The benefit of this would be that logLik is shorter to write than logLikSSM, but is there really any other point in this?
Similar case, there is a generic function called simulate in stats package, so in theory I could use that instead of simulateSSM. But now the description of the simulate function tells that function is used to "Simulate Responses", but my function actually simulates the hidden states, so it really doesn't fit into the description of the simulate function. So probably in this case I shouldn't use the generic function right?
I apologize if this question is too vague for here.
The advantages of creating methods for generics from the core of R include:
Ease of Use. Users of your package already familiar with those generics will have less to remember making it easier to use your package. They might even be able to do a certain amount without reading the documentation. If you come up with your own names then they must discover and remember new names which is an added cognitive burden.
Leverage Existing Functionality. Also any other functions that make use of generics you create methods for can then automatically use yours as well; otherwise, they would have to be changed. For example, AIC uses logLik.
A disadvantage is that the generic involves the extra level of dispatch and if logLik is in the inner loop of an optimization there could be an impact (although possibly not material). In that case you could check the performance of calling the generic vs. calling the method directly and use the latter if it makes a significant difference.
Regarding the case that your function has a completely different purpose than the generic in the core of R, then it might be more confusing than helpful so you might, in that case, not create a method but have your own function name.
You might want to read the zoo Design manual (see link to zoo Design under Vignettes near the bottom of that page) which discusses the design ideas that went into the zoo package. These include the idea being discussed here.
EDIT: Added disadvantates.
good question.
I'll split your Question into two parts; here's the first one:
i]s there really any other point in [making functions generic]?
Well, this pattern is usually invoked when the develper doesn't know the object class for every object he/she expects a user to pass in to the method under consideration.
And because of this uncertainty, this design pattern (which is called overloading in many other languages) is invokved, and which requires R to evaluate the object class, then dispatch that object to the appropriate method given the object type.
The second part of your Question: [i]n this case I shouldn't use [the generic function] right?
To try to give you an answer useful beyond the detail of your Question, consider what happens to the original method when you call setGeneric, passing that method in.
the original function body is replaced with code for performing a top-level dispatch based on type of object passed in. This replaces the original function body, which just slides down one level so that it becomes the default method that the top level (generic) function dispatches to.
showMethods() will let you see all of those methods which are called by the newly created dispatch function (generic function).
And now for one huge disadvantage:
Ease of MISUse:
Users of your package already familiar with those generics might do a certain amount without reading the documentation.
And therein lies the fallacy that components, reusable objects, services, etc are an easy panacea for all software challenges.
And why the overwhelming majority of software is buggy, bloated, and operates inconsistently with little hope of tech support being able to diagnose your problem.
There WAS a reason for static linking and small executables back in the day. But this generation of code now, get paid now, debug later if ever, before the layoffs/IPO come, has no memory of the days when code actually worked very reliably and installation/integration didn't require 200$/hr Big 4 consultants or hackers who spend a week trying to get some "simple" open source product installed and productively running.
But if you want to continue the tradition of writing ever shorter function/method names, be my guest.

How to hand over variables to a function? With an array or variables?

When I try to refactor my functions, for new needs, I stumble from time to time about the crucial question:
Shall I add another variable with a default value? Or shall I use only one array, where I´m able to add an additional variable without breaking the API?
Unless you need to support a flexible number of variables, I think it's best to explicitly identify each parameter. In most cases you can add an overloaded method that has a different signature to support the extra parameter while still supporting the original method signature. If you use an array for passing variables it just makes it too confusing for users of your API. Obviously there are some inputs that lend themselves to an array (a list of points in a polygon, a list of account IDs you wish to perform an action on, etc.) but if it's not a variable that you would reasonably expect to be an array or list, you should pass it into the method as a separate parameter.
Just like many questions in programming, the right answer is "it depends".
To take Javascript/jQuery as an example, one good rule of thumb is whether the parameter will be required each time the function is called or whether it is optional. For example, the main jQuery function itself requires an expression to determine what element(s) the operation will affect:
jQuery(expresssion)
It makes no sense to try to pass this parameter as part of an array as it will be required every time this function is called.
On the other hand, many jQuery plugins require several miscellaneous parameters that may be optional. By convention, these are passed as parameters via an 'options' array. As you said, this provides a nice interface as new parameters can be added without affecting the existing API. This makes the API clean as well since the user can ignore those options that are not applicable.
In general, when several parameters are involved, passing them as an array is a nice convention as many of them are certainly going to be optional. This would have helped clean up many WIN32 API's, although it is more difficult to deal with arrays in C/C++ than in Javascript.
It depends on the programming language used.
If you have a run-of-the-mill OO language, you should use an object that you can easily extend, if you are really concerned about API consistency.
If that doesn't matter that much, there is the option of changing the method signature and overloading the method with more / different parameters.
If your language doesn't support either and you want the API to be binary stable, use an array.
There are several considerations that must be made.
Where is the function used? - Only in code you created? One place or hundreds of places? The amount of work that will need to be done to maintain existing code is important. Remember to include the amount of time it will take to communicate to other programmers that may currently be using your function.
How critical is the new parameter? - Do you want to require it to be used? If it has a default value, will that default value break existing use of the function in any subtle ways?
Ease of comprehension - How many parameters are already passed into the function? The larger the number, the more confusing and error prone it will be. Code Complete recommends that you restrict the number of parameters to 7 or less. If you need more than that, you should try to abstract some or all of the related parameters into one object.
Other special considerations - Do you want to optimize your efforts for any special conditions such as code speed or size? Are there any special considerations that must be taken into account for your execution environment? Keep in mind your goals for the project and make sure you aren't working against them with whatever design choice you make.
In his book Code Complete, Steve McConnell decrees that a function should never have more than 7 arguments, and rarely even that many. He presents compelling arguments - that I can't cite from memory, alas.
Clean Code, more recently, advocates even fewer arguments.
So unless the number of things to pass is really small, they should be passed in an enveloping structure. If they're homogenous, an array. If not, then a reasonably lightweight object should be built for the purpose.
You should do neither. Just add the parameter and change all callers to supply the proper default value. The reason is that parameters with default values can only be at the end, and will not be able to add any more required parameters anywhere in the parameters list, without having a risk of misinterpretation.
These are the critical steps to disaster:
1. add one or two parameters with defaults
2. some callers will supply it, and some will rely on defaults.
[half a year passed]
3. add a required parameter (before them)
4. change all callers to accept the required parameter
5. get a phone call, or other event which will make you forget to change one of the instances in part#2
6. now your program compiles perfectly, but is invalid.
Unfortunately, in function call semantics we usually don't have a chance to say, by name, which value goes where.
Array is also not a proper solution. Array should be used as a connection of similar objects, upon which there's a uniform activity performed. As they say here, if it's worth refactoring, it's worth refactoring now.

Resources