what data structure does model formula operator in R create? - r

The R Language Definition makes several mentions of the model formula operator, but fails to define or explain anything about the formula class.
I am having a hard time finding anything that documents the semantics of the ~ operator from either official or unofficial sources.
In particular, I am not interested in information like is provided in the formula function documentation ("An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.") or usage scenarios, I'd like to understand what kind of data structure I am creating when using it and how I could inspect and dissect it on the REPL.

Don't know if this helps, but: it's a language object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"
> f <- a ~ b + (c + d)
> str(f)
Class 'formula' language a ~ b + (c + d)
..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~, +, () , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.
f[[1]] is ~
f[[2]] is a (the first argument, i.e. the LHS of the formula)
f[[3]] is b + (c+d)
f[[3]][[1]] is +
f[[3]][[2]] is b
... and so on.
The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.
This is also discussed (more opaquely) in the R Language Manual, e.g.
Expression objects
Direct manipulation of language objects
#user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created

Related

R: How can I find the source code for the R formula parser or interpreter?

I am looking for the code of the base R formula parser or interpreter, that translates the formula the user types into the variables and transformations used to bridge the data to the model matrix. A number of packages have their own formula interpreters that supplement or replace the base R interpreter, e.g. rmutil, gamlss.nl, ttBulk.
At a minimum, the following symbols have a distinct meaning in the formula context. I am looking for the code that implements that meaning.
~, 1, 0, +, -, *, /, :, ^, ., |, I, %in%
In addition, the functions below seem to be used mainly within the formula context, but I am not sure if they operate in a distinct way in that context. Some may have meaning only in model-fitting functions beyond lm or from particular packages. In some cases I am not sure that they have a meaning outside of the formula context.
C
||
poly
offset
strata
cluster
contrasts
ns
lo
bs
s
What I really want is an expository piece or tutorial at a level of detail that would let me figure out, e.g. which of the operations above commute, which are distributive over which other ones, which ones have inverse operations. But I gather that no such exposition exists.
I'd also like to get a complete list of functions that mean something different inside a formula, if such can be had. There is nothing in the R Language Definition or in R Internals about these special meanings, and, e.g., methods("|") gives me methods for hex and octal. The best discussion I have seen is still Statistical Models in S, Chap. 2, sect. 2.3.1, but I believe this is incomplete and maybe also not currant.

Define new operator for tilde / formula

I'm trying to code a new operator, a double tilde ~~, to denote a different kind of formula to be passed onto another function (e.g., mirroring the functionality of ~~ in the lavaan package lavaan syntax).
The issue is y ~~ x returns y ~ ~x, where the second ~ is returned with the predictors.
I am at a total loss here. It seems ~ is a primitive function .Primitive("~") with no methods, unlike, say, +. So existing tutorials for S3 methods are useless.
Is this a dead end and am I doing something really against the programming language? Or is there an easy solution I am missing?
I guess, if you accept the comment, I can make an answer out of it:
~ is an operator in R like +,-, /,*. Although it is possible to use many kinds of characters for your variables using ticks `xxx` and qoute "xxx" you also need to access them with ticks (see ?Reserved). (I'm gonna use quotes instead of ticks here, but consider using ticks for a more accepted style guide.)
R is a functional programming language and therefore you can access every single language statement as a function, e.g. a + b is the same as "+"(a, b). When you write a + b it is just syntactic sugar - language-wise it is translated into a primitive function call with two arguments.
To complicate things, there is an order of evaluation. So if you write a~~b it gets translated into "~"(a, ~b). It is because ~ is a primitive operator desiged as a sigle character. You still can define the function "~~" <- function(a,b) {a + b}, but you can only call it by "~~"(a,b) directly for it to work.
On the other hand, you need to be able to specify how a binary operator looks like. Having defined a function "asdf" <- function(a,b) {a + b} is not enough and this will not work: a asdf b
R has something to define binary operators (R: What are operators like %in% called and how can I learn about them?), see large portion of binary operators used like in magrittr's %>% or doParallel's %dopar%. Thus it is better to stick to the binary operator syntax using %, i.e. <tick>%~~%<tick> <- function(a,b) {a+b}. Then you can easily access it by using syntactic sugar a %~~% b.
Strange stuff, I agree. As for magic tricks: try this at home "for"(a, 1:10, {print(a)}). Bonus question: why is a visible in the parent frame ?

R attribute ".Environment" consuming large amounts of RAM in nnet package

I have a piece of code that that is using the nnet package and I am interested in calculating a number of different neural network models & then saving all the models to disk (with save() ).
The issue that I am running into is that the "terms" elements in the neural network has an attribute ".Environment" that ends up being hundreds of megabytes whereas the rest of the model is only a few kilobytes. (once the fitted values & residuals are deleted)
Further, deleting the ".Environment" attribute doesn't appear to cause a problem in terms of using the model with 'predict'.
Does anyone have any idea what either R or nnet is doing with this attribute? Has anyone seen anything like this?
tl;dr: this is OK, except for some very special cases
Background
The .Environment attribute in R contains a reference to the context in which an R closure (usually a formula or a function) was defined. An R environment is a store holding values of variables, similarly to a list. This allows the formula to refer to these variables, for example:
> f = function(g) return(y ~ g(x))
> form = f(exp)
> lm(form, list(y=1:10, x=log(1:10)))
...
Coefficients:
(Intercept) g(x)
3.37e-15 1.00e+00
In this example, the formula form if defined as y~exp(x), by giving g the value of exp. In order to be able to find the value of g (which is an argument to function f), the formula needs to hold a reference to the environment constructed inside the call to function f.
You can see the enviroment attached to a formula by using the attributes() or environment() functions as follows:
> attributes(form)
$class
[1] "formula"
$.Environment
<environment: R_GlobalEnv>
> environment(form)
<environment: R_GlobalEnv>
Your question
I believe you are using the nnet() function variant with a formula (rather than matrices), i.e.
> nnet(y ~ x1 + x2, ...)
Unfortunately, R keeps the entire environment (including all the variables defined where your formula is defined) allocated, even if your formula does not refer to any of it. There is no way to the language to easily tell what you may or may not be using from the environment.
One solution is to explicitly retain only the required parts of the environment. In particular, if your formula does not refer to anything in the environment (which is the most common case), it is safe to remove it.
I would suggest removing the environment from your formula before you call nnet, something like this:
form = y~x + z
environment(form) = NULL
...
result = nnet(form, ...)

How are functions curried?

I understand what the concept of currying is, and know how to use it. These are not my questions, rather I am curious as to how this is actually implemented at some lower level than, say, Haskell code.
For example, when (+) 2 4 is curried, is a pointer to the 2 maintained until the 4 is passed in? Does Gandalf bend space-time? What is this magic?
Short answer: yes a pointer is maintained to the 2 until the 4 is passed in.
Longer than necessary answer:
Conceptually, you're supposed to think about Haskell being defined in terms of the lambda calculus and term rewriting. Lets say you have the following definition:
f x y = x + y
This definition for f comes out in lambda calculus as something like the following, where I've explicitly put parentheses around the lambda bodies:
\x -> (\y -> (x + y))
If you're not familiar with the lambda calculus, this basically says "a function of an argument x that returns (a function of an argument y that returns (x + y))". In the lambda calculus, when we apply a function like this to some value, we can replace the application of the function by a copy of the body of the function with the value substituted for the function's parameter.
So then the expression f 1 2 is evaluated by the following sequence of rewrites:
(\x -> (\y -> (x + y))) 1 2
(\y -> (1 + y)) 2 # substituted 1 for x
(1 + 2) # substituted 2 for y
3
So you can see here that if we'd only supplied a single argument to f, we would have stopped at \y -> (1 + y). So we've got a whole term that is just a function for adding 1 to something, entirely separate from our original term, which may still be in use somewhere (for other references to f).
The key point is that if we implement functions like this, every function has only one argument but some return functions (and some return functions which return functions which return ...). Every time we apply a function we create a new term that "hard-codes" the first argument into the body of the function (including the bodies of any functions this one returns). This is how you get currying and closures.
Now, that's not how Haskell is directly implemented, obviously. Once upon a time, Haskell (or possibly one of its predecessors; I'm not exactly sure on the history) was implemented by Graph reduction. This is a technique for doing something equivalent to the term reduction I described above, that automatically brings along lazy evaluation and a fair amount of data sharing.
In graph reduction, everything is references to nodes in a graph. I won't go into too much detail, but when the evaluation engine reduces the application of a function to a value, it copies the sub-graph corresponding to the body of the function, with the necessary substitution of the argument value for the function's parameter (but shares references to graph nodes where they are unaffected by the substitution). So essentially, yes partially applying a function creates a new structure in memory that has a reference to the supplied argument (i.e. "a pointer to the 2), and your program can pass around references to that structure (and even share it and apply it multiple times), until more arguments are supplied and it can actually be reduced. However it's not like it's just remembering the function and accumulating arguments until it gets all of them; the evaluation engine actually does some of the work each time it's applied to a new argument. In fact the graph reduction engine can't even tell the difference between an application that returns a function and still needs more arguments, and one that has just got its last argument.
I can't tell you much more about the current implementation of Haskell. I believe it's a distant mutant descendant of graph reduction, with loads of clever short-cuts and go-faster stripes. But I might be wrong about that; maybe they've found a completely different execution strategy that isn't anything at all like graph reduction anymore. But I'm 90% sure it'll still end up passing around data structures that hold on to references to the partial arguments, and it probably still does something equivalent to factoring in the arguments partially, as it seems pretty essential to how lazy evaluation works. I'm also fairly sure it'll do lots of optimisations and short cuts, so if you straightforwardly call a function of 5 arguments like f 1 2 3 4 5 it won't go through all the hassle of copying the body of f 5 times with successively more "hard-coding".
Try it out with GHC:
ghc -C Test.hs
This will generate C code in Test.hc
I wrote the following function:
f = (+) 16777217
And GHC generated this:
R1.p[1] = (W_)Hp-4;
*R1.p = (W_)&stg_IND_STATIC_info;
Sp[-2] = (W_)&stg_upd_frame_info;
Sp[-1] = (W_)Hp-4;
R1.w = (W_)&integerzmgmp_GHCziInteger_smallInteger_closure;
Sp[-3] = 0x1000001U;
Sp=Sp-3;
JMP_((W_)&stg_ap_n_fast);
The thing to remember is that in Haskell, partially applying is not an unusual case. There's technically no "last argument" to any function. As you can see here, Haskell is jumping to stg_ap_n_fast which will expect an argument to be available in Sp.
The stg here stands for "Spineless Tagless G-Machine". There is a really good paper on it, by Simon Peyton-Jones. If you're curious about how the Haskell runtime is implemented, go read that first.

Derivative Calculator

I'm interested in building a derivative calculator. I've racked my brains over solving the problem, but I haven't found a right solution at all. May you have a hint how to start? Thanks
I'm sorry! I clearly want to make symbolic differentiation.
Let's say you have the function f(x) = x^3 + 2x^2 + x
I want to display the derivative, in this case f'(x) = 3x^2 + 4x + 1
I'd like to implement it in objective-c for the iPhone.
I assume that you're trying to find the exact derivative of a function. (Symbolic differentiation)
You need to parse the mathematical expression and store the individual operations in the function in a tree structure.
For example, x + sin²(x) would be stored as a + operation, applied to the expression x and a ^ (exponentiation) operation of sin(x) and 2.
You can then recursively differentiate the tree by applying the rules of differentiation to each node. For example, a + node would become the u' + v', and a * node would become uv' + vu'.
you need to remember your calculus. basically you need two things: table of derivatives of basic functions and rules of how to derivate compound expressions (like d(f + g)/dx = df/dx + dg/dx). Then take expressions parser and recursively go other the tree. (http://www.sosmath.com/tables/derivative/derivative.html)
Parse your string into an S-expression (even though this is usually taken in Lisp context, you can do an equivalent thing in pretty much any language), easiest with lex/yacc or equivalent, then write a recursive "derive" function. In OCaml-ish dialect, something like this:
let rec derive var = function
| Const(_) -> Const(0)
| Var(x) -> if x = var then Const(1) else Deriv(Var(x), Var(var))
| Add(x, y) -> Add(derive var x, derive var y)
| Mul(a, b) -> Add(Mul(a, derive var b), Mul(derive var a, b))
...
(If you don't know OCaml syntax - derive is two-parameter recursive function, with first parameter the variable name, and the second being mathched in successive lines; for example, if this parameter is a structure of form Add(x, y), return the structure Add built from two fields, with values of derived x and derived y; and similarly for other cases of what derive might receive as a parameter; _ in the first pattern means "match anything")
After this you might have some clean-up function to tidy up the resultant expression (reducing fractions etc.) but this gets complicated, and is not necessary for derivation itself (i.e. what you get without it is still a correct answer).
When your transformation of the s-exp is done, reconvert the resultant s-exp into string form, again with a recursive function
SLaks already described the procedure for symbolic differentiation. I'd just like to add a few things:
Symbolic math is mostly parsing and tree transformations. ANTLR is a great tool for both. I'd suggest starting with this great book Language implementation patterns
There are open-source programs that do what you want (e.g. Maxima). Dissecting such a program might be interesting, too (but it's probably easier to understand what's going on if you tried to write it yourself, first)
Probably, you also want some kind of simplification for the output. For example, just applying the basic derivative rules to the expression 2 * x would yield 2 + 0*x. This can also be done by tree processing (e.g. by transforming 0 * [...] to 0 and [...] + 0 to [...] and so on)
For what kinds of operations are you wanting to compute a derivative? If you allow trigonometric functions like sine, cosine and tangent, these are probably best stored in a table while others like polynomials may be much easier to do. Are you allowing for functions to have multiple inputs,e.g. f(x,y) rather than just f(x)?
Polynomials in a single variable would be my suggestion and then consider adding in trigonometric, logarithmic, exponential and other advanced functions to compute derivatives which may be harder to do.
Symbolic differentiation over common functions (+, -, *, /, ^, sin, cos, etc.) ignoring regions where the function or its derivative is undefined is easy. What's difficult, perhaps counterintuitively, is simplifying the result afterward.
To do the differentiation, store the operations in a tree (or even just in Polish notation) and make a table of the derivative of each of the elementary operations. Then repeatedly apply the chain rule and the elementary derivatives, together with setting the derivative of a constant to 0. This is fast and easy to implement.

Resources