I'm trying to code a new operator, a double tilde ~~, to denote a different kind of formula to be passed onto another function (e.g., mirroring the functionality of ~~ in the lavaan package lavaan syntax).
The issue is y ~~ x returns y ~ ~x, where the second ~ is returned with the predictors.
I am at a total loss here. It seems ~ is a primitive function .Primitive("~") with no methods, unlike, say, +. So existing tutorials for S3 methods are useless.
Is this a dead end and am I doing something really against the programming language? Or is there an easy solution I am missing?
I guess, if you accept the comment, I can make an answer out of it:
~ is an operator in R like +,-, /,*. Although it is possible to use many kinds of characters for your variables using ticks `xxx` and qoute "xxx" you also need to access them with ticks (see ?Reserved). (I'm gonna use quotes instead of ticks here, but consider using ticks for a more accepted style guide.)
R is a functional programming language and therefore you can access every single language statement as a function, e.g. a + b is the same as "+"(a, b). When you write a + b it is just syntactic sugar - language-wise it is translated into a primitive function call with two arguments.
To complicate things, there is an order of evaluation. So if you write a~~b it gets translated into "~"(a, ~b). It is because ~ is a primitive operator desiged as a sigle character. You still can define the function "~~" <- function(a,b) {a + b}, but you can only call it by "~~"(a,b) directly for it to work.
On the other hand, you need to be able to specify how a binary operator looks like. Having defined a function "asdf" <- function(a,b) {a + b} is not enough and this will not work: a asdf b
R has something to define binary operators (R: What are operators like %in% called and how can I learn about them?), see large portion of binary operators used like in magrittr's %>% or doParallel's %dopar%. Thus it is better to stick to the binary operator syntax using %, i.e. <tick>%~~%<tick> <- function(a,b) {a+b}. Then you can easily access it by using syntactic sugar a %~~% b.
Strange stuff, I agree. As for magic tricks: try this at home "for"(a, 1:10, {print(a)}). Bonus question: why is a visible in the parent frame ?
Related
The R Language Definition makes several mentions of the model formula operator, but fails to define or explain anything about the formula class.
I am having a hard time finding anything that documents the semantics of the ~ operator from either official or unofficial sources.
In particular, I am not interested in information like is provided in the formula function documentation ("An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.") or usage scenarios, I'd like to understand what kind of data structure I am creating when using it and how I could inspect and dissect it on the REPL.
Don't know if this helps, but: it's a language object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"
> f <- a ~ b + (c + d)
> str(f)
Class 'formula' language a ~ b + (c + d)
..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~, +, () , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.
f[[1]] is ~
f[[2]] is a (the first argument, i.e. the LHS of the formula)
f[[3]] is b + (c+d)
f[[3]][[1]] is +
f[[3]][[2]] is b
... and so on.
The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.
This is also discussed (more opaquely) in the R Language Manual, e.g.
Expression objects
Direct manipulation of language objects
#user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created
I am looking for the code of the base R formula parser or interpreter, that translates the formula the user types into the variables and transformations used to bridge the data to the model matrix. A number of packages have their own formula interpreters that supplement or replace the base R interpreter, e.g. rmutil, gamlss.nl, ttBulk.
At a minimum, the following symbols have a distinct meaning in the formula context. I am looking for the code that implements that meaning.
~, 1, 0, +, -, *, /, :, ^, ., |, I, %in%
In addition, the functions below seem to be used mainly within the formula context, but I am not sure if they operate in a distinct way in that context. Some may have meaning only in model-fitting functions beyond lm or from particular packages. In some cases I am not sure that they have a meaning outside of the formula context.
C
||
poly
offset
strata
cluster
contrasts
ns
lo
bs
s
What I really want is an expository piece or tutorial at a level of detail that would let me figure out, e.g. which of the operations above commute, which are distributive over which other ones, which ones have inverse operations. But I gather that no such exposition exists.
I'd also like to get a complete list of functions that mean something different inside a formula, if such can be had. There is nothing in the R Language Definition or in R Internals about these special meanings, and, e.g., methods("|") gives me methods for hex and octal. The best discussion I have seen is still Statistical Models in S, Chap. 2, sect. 2.3.1, but I believe this is incomplete and maybe also not currant.
Consider the following R code:
y1 <- dataset %>% dplyr::filter(W == 1)
This works, but there seems to some magic here. Usually, when we have an expression like foo(bar), we should be able to do this:
baz <= bar
foo(baz)
However, in the presented code snippet, we cannot evaluate W == 1 outside of dplyr::filter()! W is not a defined variable.
What's going on?
dplyr uses a concept called Non-standard Evaluation (NSE) to make columns from the data frame argument accessible to its functions without quoting or using dataframe$column syntax. Basically:
[Non-standard evaluation] is a catch-all term that means they don’t follow the usual R rules of evaluation. Instead, they capture the expression that you typed and evaluate it in a custom way.1
In this case, the custom evaluation takes the argument(s) given to dplyr::filter, and parses them so that W can be used to refer to the dataset$W. The reason that you can't then take that variable and use it elsewhere is that NSE is only applied to the scope of the function.
NSE makes a trade-off: functions which modify scope are less safe and/or unusable in programming where you're building a program that uses functions to modify other functions:
This is an example of the general tension between functions that are designed for interactive use and functions that are safe to program with. A function that uses substitute() might reduce typing, but it can be difficult to call from another function.2
For example, if you wanted to write a function which would use the same code, but swap out W == 1 for W == 0 (or some completely different filter), NSE would make that more difficult to accomplish.
In 2017 the tidyverse started to build a solution to this in tidy evaluation.
As far as I can tell the only difference is speed and you have to be a bit tricker in how you define lambda functions.
For instance:
map(lambda x: x + 1, range(4)) == [(lambda x: x + 1)(y) for y in range(4)]
It seems to me like the second way is more pythonic, but I am not sure why.
EDIT:
Yes I understand that the lambda would be excluded in the second example, I was just trying to show as equivalent code as possible.
The right way to do this would be
[y + 1 for y in range(4)]
No need to construct a lambda function here. Your code would unnecessarily build a new function object in every single iteration of the list comprehension.
That said, you can write any call to map() as an equivalent list comprehension. If the first argument to map() is a lambda function, the list comprehension is usually preferred. If the first argument to map() is a function name, both variants are fine. Some people (including me) prefer, say,
map(str, my_list)
while others prefer
[str(x) for x in my_list]
There is no difference, but the pythonic way would be to omit the lambda completely:
[y + 1 for y in range(4)]
Note also that if your mapping function is a "built-in" (written in C) function, rather than a python function or a lambda, map will be faster.
Another pythonic, but uncommon, way (avoids unnecessary lambda) would be:
map(1 .__add__, range(4)) # thanks to SvenMarnach for this
It is usually preferable to avoid lambdas in mapping forms, because a list comprehension will always be more efficient, AND clearer. By contrast, using multi-line functions is perfectly acceptable - there is no way to write them inline, and even if you could, it would likely be less clear.
Another difference is that because map can take multiple sequences to map against, and passes them as positional parameters to the mapping function, one can avoid the zipping that would be required in a list comprehension:
[x+y for x,y in zip(range(4), range(2,6))]
#vs
from operator import add
map(add, range(4), range(2,6))
I'm interested in building a derivative calculator. I've racked my brains over solving the problem, but I haven't found a right solution at all. May you have a hint how to start? Thanks
I'm sorry! I clearly want to make symbolic differentiation.
Let's say you have the function f(x) = x^3 + 2x^2 + x
I want to display the derivative, in this case f'(x) = 3x^2 + 4x + 1
I'd like to implement it in objective-c for the iPhone.
I assume that you're trying to find the exact derivative of a function. (Symbolic differentiation)
You need to parse the mathematical expression and store the individual operations in the function in a tree structure.
For example, x + sin²(x) would be stored as a + operation, applied to the expression x and a ^ (exponentiation) operation of sin(x) and 2.
You can then recursively differentiate the tree by applying the rules of differentiation to each node. For example, a + node would become the u' + v', and a * node would become uv' + vu'.
you need to remember your calculus. basically you need two things: table of derivatives of basic functions and rules of how to derivate compound expressions (like d(f + g)/dx = df/dx + dg/dx). Then take expressions parser and recursively go other the tree. (http://www.sosmath.com/tables/derivative/derivative.html)
Parse your string into an S-expression (even though this is usually taken in Lisp context, you can do an equivalent thing in pretty much any language), easiest with lex/yacc or equivalent, then write a recursive "derive" function. In OCaml-ish dialect, something like this:
let rec derive var = function
| Const(_) -> Const(0)
| Var(x) -> if x = var then Const(1) else Deriv(Var(x), Var(var))
| Add(x, y) -> Add(derive var x, derive var y)
| Mul(a, b) -> Add(Mul(a, derive var b), Mul(derive var a, b))
...
(If you don't know OCaml syntax - derive is two-parameter recursive function, with first parameter the variable name, and the second being mathched in successive lines; for example, if this parameter is a structure of form Add(x, y), return the structure Add built from two fields, with values of derived x and derived y; and similarly for other cases of what derive might receive as a parameter; _ in the first pattern means "match anything")
After this you might have some clean-up function to tidy up the resultant expression (reducing fractions etc.) but this gets complicated, and is not necessary for derivation itself (i.e. what you get without it is still a correct answer).
When your transformation of the s-exp is done, reconvert the resultant s-exp into string form, again with a recursive function
SLaks already described the procedure for symbolic differentiation. I'd just like to add a few things:
Symbolic math is mostly parsing and tree transformations. ANTLR is a great tool for both. I'd suggest starting with this great book Language implementation patterns
There are open-source programs that do what you want (e.g. Maxima). Dissecting such a program might be interesting, too (but it's probably easier to understand what's going on if you tried to write it yourself, first)
Probably, you also want some kind of simplification for the output. For example, just applying the basic derivative rules to the expression 2 * x would yield 2 + 0*x. This can also be done by tree processing (e.g. by transforming 0 * [...] to 0 and [...] + 0 to [...] and so on)
For what kinds of operations are you wanting to compute a derivative? If you allow trigonometric functions like sine, cosine and tangent, these are probably best stored in a table while others like polynomials may be much easier to do. Are you allowing for functions to have multiple inputs,e.g. f(x,y) rather than just f(x)?
Polynomials in a single variable would be my suggestion and then consider adding in trigonometric, logarithmic, exponential and other advanced functions to compute derivatives which may be harder to do.
Symbolic differentiation over common functions (+, -, *, /, ^, sin, cos, etc.) ignoring regions where the function or its derivative is undefined is easy. What's difficult, perhaps counterintuitively, is simplifying the result afterward.
To do the differentiation, store the operations in a tree (or even just in Polish notation) and make a table of the derivative of each of the elementary operations. Then repeatedly apply the chain rule and the elementary derivatives, together with setting the derivative of a constant to 0. This is fast and easy to implement.