Hybrid handler for dplyr - r

I'm writing an hybrid handler for dplyr, and I'm wondering two things about the code in dplyr.cpp:
The option na.rm is used as a template and not passed as a parameter to the classes Sd, Var, Sum etc. What's the reason?
What does the line TAG(arg3) == R_NaRmSymbol (line 54) mean?

Although I'm not the author of the code, here's my best guesses at answers to your questions:
The option na.rm is used as a template and not passed as a parameter to the classes Sd, Var, Sum etc. What's the reason?
Likely for run-time efficiency -- dplyr tries to move computation from run-time to compile-time when possible. This is often accomplished through template usage.
What does the line TAG(arg3) == R_NaRmSymbol (line 54) mean?
Nodes in an R pairlist have a TAG attribute, which usually refers to the name of the formal. Hence, dplyr uses that to find the formal with the name na.rm. R caches many of the often-used symbols in src/main/names.c -- you should see NaRmSymbol in there.
So, effectively, the code finds the actual argument value associated with the formal na.rm, and does stuff based on its truthiness.

Related

Pipe with additional Arguments

I read in several places that pipes in Julia only work with functions that take only one argument. This is not true, since I can do the following:
function power(a, b = 2) a^b end
3 |> power
> 9
and it works fine.
However, I but can't completely get my head around the pipe. E.g. why is this not working?? :
3 |> power()
> MethodError: no method matching power()
What I would actually like to do is using a pipe and define additional arguments, e.g. keyword arguments so that it is actually clear which argument to pass when piping (namely the only positional one):
function power(a; b = 2) a^b end
3 |> power(b = 3)
Is there any way to do something like this?
I know I could do a work-around with the Pipe package, but to honest it feels kind of clunky to write #pipe at the start of half of the lines.
In R the magritrr package has convincing logic (in my opinion): it passes what's left of the pipe by default as the first argument to the function on the right - I'm looking for something similar.
power as defined in the first snippet has two methods. One with one argument, one with two. So the point about |> working only with one-argument methods still holds.
The kind of thing you want to do is called "partial application", and very common in functional languages. You can always write
3 |> (a -> power(a, 3))
but that gets clunky quickly. Other language have syntax like power(%1, 3) to denote that lambda. There's discussion to add something similar to Julia, but it's difficult to get right. Pipe is exactly the macro-based fix for it.
If you have control over the defined method, you can also implement methods with an interface that return partially applied versions as you like -- many predicates in Base do this already, e.g., ==(1). There's also the option of Base.Fix2(power, 3), but that's not really an improvement, if you ask me (apart from maybe being nicer to the compiler).
And note that magrittrs pipes are also "macro"-based. The difference is that argument passing in R is way more complicated, and you can't see from outside whether an argument is used as a value or as an expression (essentially, R passes a thunk containing the expression and a pointer to the parent environment, and automatically evaluates and caches it if you use it as a value; see substitute)

What does the jq notation <function>/<number> mean?

In various web pages, I see references to jq functions with a slash and a number following them. For example:
walk/1
I found the above notation used on a stackoverflow page.
I could not find in the jq Manual page a definition as to what this notation means. I'm guessing it might indicate that the walk function that takes 1 argument. If so, I wonder why a more meaningful notation isn't used such as is used with signatures in C++, Java, and other languages:
<function>(type1, type2, ..., typeN)
Can anyone confirm what the notation <function>/<number> means? Are other variants used?
The notation name/arity gives the name and arity of the function. "arity" is the number of arguments (i.e., parameters), so for example explode/0 means you'd just write explode without any arguments, and map/1 means you'd write something like map(f).
The fact that 0-arity functions are invoked by name, without any parentheses, makes the notation especially handy. The fact that a function name can have multiple definitions at any one time (each definition having a distinct arity) makes it easy to distinguish between them.
This notation is not used in jq programs, but it is used in the output of the (new) built-in filter, builtins/0.
By contrast, in some other programming languages, it (or some close variant, e.g. module:name/arity in Erlang) is also part of the language.
Why?
There are various difficulties which typically arise when attempting to graft a notation that's suitable for languages in which method-dispatch is based on types onto ones in which dispatch is based solely on arity.
The first, as already noted, has to do with 0-arity functions. This is especially problematic for jq as 0-arity functions are invoked in jq without parentheses.
The second is that, in general, jq functions do not require their arguments to be any one jq type. Having to write something like nth(string+number) rather than just nth/1 would be tedious at best.
This is why the manual strenuously avoids using "name(type)"-style notation. Thus we see, for example, startswith(str), rather than startswith(string). That is, the parameter names in the documentation are clearly just names, though of course they often give strong type hints.
If you're wondering why the 'name/arity' convention isn't documented in the manual, it's probably largely because the documentation was mostly written before jq supported multi-arity functions.
In summary -- any notational scheme can be made to work, but name/arity is (1) concise; (2) precise in the jq context; (3) easy-to-learn; and (4) widely in use for arity-oriented languages, at least on this planet.

How to restore a over-written built-in function in Julia

The question may apply for other languages, too.
If I use a built-in function name as a variable name,
I can restore the function by doing:
all = 123
all = Base.all
But If I define, for example, a custom function sum() and then I do,
sum = Base.sum
I got an error saying "invalid redefinition of constant sum"
Is there a way to restore a built-in function if I over-wrote it? Or is that impossible by design?
For this example, you could just redefine sum as Base.sum:
sum(x) = Base.sum(x)
Is that what you would like?
NB. this may not "overwrite" your definition of sum. If it uses type parameters (e.g. sum(x::Vector) it may still be dispatched in preference to the general sum(x), in which case you would need to repeat the above for those specific methods.
If this is simply a problem for you when you are working in the REPL, and you don't mind losing all your other definitions, you could do workspace() to reset Main.

How to not fall into R's 'lazy evaluation trap'

"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.

prolog recursion

am making a function that will send me a list of all possible elemnts .. in each iteration its giving me the last answer .. but after the recursion am only getting the last answer back .. how can i make it give back every single answer ..
thank you
the problem is that am trying to find all possible distributions for a list into other lists .. the code
addIn(_,[],Result,Result).
addIn(C,[Element|Rest],[F|R],Result):-
member( Members , [F|R]),
sumlist( Members, Sum),
sumlist([Element],ElementLength),
Cap is Sum + ElementLength,
(Cap =< Ca,
append([Element], Members,New)....
by calling test .. am getting back all the list of possible answers .. now if i tried to do something that will fail like
bp(3,11,[8,2,4,6,1,8,4],Answer).
it will just enter a while loop .. more over if i changed the
bp(NB,C,OL,A):-
addIn(C,OL,[[],[],[]],A);
bp(NB,C,_,A).
to and instead of Or .. i get error :
ERROR: is/2: Arguments are not
sufficiently instantiated
appreciate the help ..
Thanks alot #hardmath
It sounds like you are trying to write your own version of findall/3, perhaps limited to a special case of an underlying goal. Doing it generally (constructing a list of all solutions to a given goal) in a user-defined Prolog predicate is not possible without resorting to side-effects with assert/retract.
However a number of useful special cases can be implemented without such "tricks". So it would be helpful to know what predicate defines your "all possible elements". [It may also be helpful to state which Prolog implementation you are using, if only so that responses may include links to documentation for that version.]
One important special case is where the "universe" of potential candidates already exists as a list. In that case we are really asking to find the sublist of "all possible elements" that satisfy a particular goal.
findSublist([ ],_,[ ]).
findSublist([H|T],Goal,[H|S]) :-
Goal(H),
!,
findSublist(T,Goal,S).
findSublist([_|T],Goal,S) :-
findSublist(T,Goal,S).
Many Prologs will allow you to pass the name of a predicate Goal around as an "atom", but if you have a specific goal in mind, you can leave out the middle argument and just hardcode your particular condition into the middle clause of a similar implementation.
Added in response to code posted:
I think I have a glimmer of what you are trying to do. It's hard to grasp because you are not going about it in the right way. Your predicate bp/4 has a single recursive clause, variously attempted using either AND or OR syntax to relate a call to addIn/4 to a call to bp/4 itself.
Apparently you expect wrapping bp/4 around addIn/4 in this way will somehow cause addIn/4 to accumulate or iterate over its solutions. It won't. It might help you to see this if we analyze what happens to the arguments of bp/4.
You are calling the formal arguments bp(NB,C,OL,A) with simple integers bound to NB and C, with a list of integers bound to OL, and with A as an unbound "output" Answer. Note that nothing is ever done with the value NB, as it is not passed to addIn/4 and is passed unchanged to the recursive call to bp/4.
Based on the variable names used by addIn/4 and supporting predicate insert/4, my guess is that NB was intended to mean "number of bins". For one thing you set NB = 3 in your test/0 clause, and later you "hardcode" three empty lists in the third argument in calling addIn/4. Whatever Answer you get from bp/4 comes from what addIn/4 is able to do with its first two arguments passed in, C and OL, from bp/4. As we noted, C is an integer and OL a list of integers (at least in the way test/0 calls bp/4).
So let's try to state just what addIn/4 is supposed to do with those arguments. Superficially addIn/4 seems to be structured for self-recursion in a sensible way. Its first clause is a simple termination condition that when the second argument becomes an empty list, unify the third and fourth arguments and that gives "answer" A to its caller.
The second clause for addIn/4 seems to coordinate with that approach. As written it takes the "head" Element off the list in the second argument and tries to find a "bin" in the third argument that Element can be inserted into while keeping the sum of that bin under the "cap" given by C. If everything goes well, eventually all the numbers from OL get assigned to a bin, all the bins have totals under the cap C, and the answer A gets passed back to the caller. The way addIn/4 is written leaves a lot of room for improvement just in basic clarity, but it may be doing what you need it to do.
Which brings us back to the question of how you should collect the answers produced by addIn/4. Perhaps you are happy to print them out one at a time. Perhaps you meant to collect all the solutions produced by addIn/4 into a single list. To finish up the exercise I'll need you to clarify what you really want to do with the Answers from addIn/4.
Let's say you want to print them all out and then stop, with a special case being to print nothing if the arguments being passed in don't allow a solution. Then you'd probably want something of this nature:
newtest :-
addIn(12,[7, 3, 5, 4, 6, 4, 5, 2], Answer),
format("Answer = ~w\n",[Answer]),
fail.
newtest.
This is a standard way of getting predicate addIn/4 to try all possible solutions, and then stop with the "fall-through" success of the second clause of newtest/0.
(Added) Suggestions about coding addIn/4:
It will make the code more readable and maintainable if the variable names are clear. I'd suggest using Cap instead of C as the first argument to addIn/4 and BinSum when you take the sum of items assigned to a "bin". Likewise Bin would be better where you used Members. In the third argument to addIn/4 (in the head of the second clause) you don't need an explicit list structure [F|R] since you never refer to either part F or R by itself. So there I'd use Bins.
Some of your predicate calls don't accomplish much that you cannot do more easily. For example, your second call to sumlist/2 involves a list with one item. Thus the sum is just the same as that item, i.e. ElementLength is the same as Element. Here you could just replace both calls to sumlist/2 with one such call:
sumlist([Element|Bin],BinSum)
and then do your test comparing BinSum with Cap. Similarly your call to append/3 just adjoins the single item Element to the front of the list (I'm calling) Bin, so you could just replace what you have called New with [Element|Bin].
You have used an extra pair of parentheses around the last four subgoals (in the second clause for addIn/4). Since AND is implied for all the subgoals of this clause, using the extra pair of parentheses is unnecessary.
The code for insert/4 isn't shown now, but it could be a source of some unintended "backtracking" in special cases. The better approach would be to have the first call (currently to member/2) be your only point of indeterminacy, i.e. when you choose one of the bins, do it by replacing it with a free variable that gets unified with [Element|Bin] at the next to last step.

Resources