what's preventing additions to the current set of R reserved words/symbols? - r

Is there a historical precedent of internal changes to the R parser, adding new reserved words or symbols?
If I remember correctly data.table uses a serendipitous := that was once defined but left unused in R internals, but I'm not aware of others. However, as the language evolves, it would sometimes seem useful to define new symbols.
An obvious case could be made for magrittr's pipe %>% which has become ubiquitous for many, but remains a pain to type (sure, there are keyboard tricks, but still). Similarly, dplyr/rlang introduce/repurpose notations for "tidy evaluation" (!!, !!!, :=, ~, etc.).
Another case I'm seeing is the verbosity of lambda functions. Would it be possible, theoretically, to define internally something like f = λ(x) x+1 instead of f = function(x) x+1, or are there character restrictions on top of other reasons?

Why add an ergonomics feature if you risk breaking a runtime that hosts a huge ecosystem? Also, once you add one feature, you are on a slippery slope and are staring straight in the face of feature bloat.
And if you say that we can be smart and judicious about what features we add, how do we structure that decision process? R does not have a "benevolent dictator" having a final word in decisions like this so you are left with design by committee with all that it entails.
The big thing with R has always been the package ecosystem, in which if you want a feature you write it yourself -- as in your magrittr example. The language itself has remained close to its S roots and has successfully served as a stable platform for all the development that has been happening.

Related

In GHC's STG output with -O2, what's this sequence following Str=DmdType all about?

(Misleading title: it's only one of a plethora of inter-related similar questions below: these sound like asking for a full reference manual but keep in mind for this topic there is no reference manual other than the entirety of GHC's source-codes of its STG pipeline stage, and the collective accumulated experience of others/"insiders"..)
I'm exploring "transpiling" Haskell (from scratch for fun/learning, ignoring existing projects; target language/s similarly high-level / "already-fit-for-STG-machine" with existing GC + lambdas/func-values + closures) and so I'm trying to become ever more familiar with GHC's STG IR. Having repeatedly gone through the dozen-or-two online articles/videos of varying age, depth, detail that actually deal with the topic (plus the original paper, plus StgSyn.hs), and understanding many-perhaps-most basic principles, seeing -ddump-stged output still baffles me in various parts (I won't manually parse it but reuse GHC API's in-memory AST later on of course) --- mostly I think I'm stuck mapping my "roughly known" concepts to the "still-foreign" abbreviated/codified identifiers of that IR. If you know your way around STG a bit, mind looking at the following mini-sample to clarify a few open questions and help further solidify my (and future searchers') grasp?
From a most simple .hs module, I have -ddump-stged twice, first (on the left) with -O0 and then (on the right) with -O2, both captured in this diff.
Walking through everything def-by-def..
Lines L_|R5-11: so in O2, testX1 and testX2 seem to be global constants/literals for the integers 4 and 5 --- O0 doesn't have them. Curious!
Is Str=DmdType something about strictness? "Strictness is of type on-demand" or some such? But then a top-level/heap-ish/"global" constant literal can't be "lazy" can it.. (one of the things where I can't just casually Ctrl+F in StgSyn.hs --- it's not in there! which is odd in its own way, how come there's STG syntax not in StgSyn.hs)
Caf have a rough idea about constant-applicative-forms, but Unf=OtherCon? "Other constructor" (unboxed/native Type.S#-related?) ..
Line L6|R14: Surprised to still see type-class information in there (Num), is that "just info/annotation" or is this crucial for any of the built-in code-gens to set up some "dictionary" lookup machinery at runtime? (I'd sure hope by the late STG / pre-CMM stage that would be resolved and inlined already where possible at least in O2. After all GHC has also decided to type-default 4 and 5 to Integer). Generally speaking I understand STG is "untyped" other than denoting prim types, saturated cons, perhaps strings (looks like it later on at the bottom), so such "typeclass" annotations can only be.. I guess for readers to find their way around the ddump-ed *.stg. But correct me if not.
GblId probably just "global identifier" aka top-level CAF right? Arity clear.
Line L7|R18: now Str=DmdType for testX is, only in O2, followed by a freakish <S(LLC(C(S))LLLL),U(1*C1(C1(U)),A,1*C1(C1(U)),A,A,A,C(U))><L,U>! What's that, SKI calculus? ;D no seriously, LLC.. LLLL.. stack or other memory layout hints for CMM? Any idea? Must be some optimization, would like to understand which-and-how..
Line L8|R20: $dNum_sGM (left) and $dNum_sIx (right) have me a bit concerned, they don't seem to be "defined at the module level" here anywhere. Typeclass "method dispatch dictionary lookup" kind of thing? Would eg. CMM take this together with the above Num annotation to set things up? It always appears together with the input func arg.
The whole function "body" for both left and right can be seen here essentially as "3 lets with a lambda-ish form for 3 atoms, 2 of which are statically known literal-constants" --- I suppose this is standard and to be expected in the STG IR AST? For the first of these, funnily enough we could say that O0 has "inlined the global (what is testX1 or testX2 in O2) and O2 hasn't" (making the latter much shorter as that applies to both these constant literals).
I've only ever seen Occ=Once, what are the others and how to interpret? Once for one isn't even in StgSyn.hs..
Now LclId a counterpart to the earlier encountered GblId. That's denoting the scope of the identifier? Could it also be anything else, in this expression context? As in: if traversing the AST I roughly know how deep I am, I can ignore this since if I'm at the top-level it must be GblId and otherwise LclId? Hm.. maybe better take what STG gives me but then I need to be sure about the semantics and possibilities.. guys, using StgSyn.hs I have the wrong source file, right? Nothing on this in there either.. (always hopeful as its comments are quite well-done)
the rest is just metadata as string constants, OK.. oh wait, look at O2, there's Str=DmdType m1 and Str=DmdType m, what's the m/m1 about, another thing I don't see "defined anywhere at the module level" here? And it's not in O0..
still going strong? Merely a bonus question (for now), tell us about srt:SRT:[] ;)
Just a few tidbits - a full answer is quite beyond my knowledge.
The type of your function is
testX :: GHC.Num.Num a => a -> a
It’s compiled to a function with two arguments: a dictionary of the Num type class, and the actual argument.
The $d… names stand for dictionaries of type class instances. The <S(LLC(C(S))LLLL),… annotations are strictness information about the function arguments. They basically say which part of the argument will be used by your function and which not. Looks a bit weird here because it contains information about all the class instance members.
Some of this is explained here:
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Demand
The str:STR: is the „Static reference table“, i.e. list of free variables of the expression - in your case, always [].

Replacing an ordinary function with a generic function

I'd like to use names such as elt, nth and mapcar with a new data structure that I am prototyping, but these names designate ordinary functions and so, I think, would need to be redefined as generic functions.
Presumably it's bad form to redefine these names?
Is there a way to tell defgeneric not to generate a program error and to go ahead and replace the function binding?
Is there a good reason for these not being generic functions or is just historic?
What's the considered wisdom and best practice here please?
If you are using SBCL or ABCL, and aren't concerned with ANSI compliance, you could investigate Extensible Sequences:
http://www.sbcl.org/manual/#Extensible-Sequences
http://www.doc.gold.ac.uk/~mas01cr/papers/ilc2007/sequences-20070301.pdf
...you can't redefine functions in the COMMON-LISP package, but you could create a new package and shadow the imports of the functions you want to redefine.
Is there a good reason for these not being generic functions or is just historic?
Common Lisp has some layers of language in some of its areas. Higher-level parts of the software might need to be built on lower-level constructs.
One of its goals was being fast enough for a range of applications.
Common Lisp also introduced the idea of sequences, the abstraction over lists and vectors, at a time, when the language didn't have an object-system. CLOS came several years after the initial Common Lisp design.
Take for example something like equality - for numbers.
Lisp has =:
(= a b)
That's the fastest way to compare numbers. = is also defined only for numbers.
Then there are eql, equal and equalp. Those work for numbers, but also for some other data types.
Now, if you need more speed, you can declare the types and tell the compiler to generate faster code:
(locally
(declare (fixnum a b)
(optimize (speed 3) (safety 0)))
(= a b))
So, why is = not a CLOS generic function?
a) it was introduced when CLOS did not exist
but equally important:
b) in Common Lisp it wasn't known (and it still isn't) how to make a CLOS generic function = as fast as a non-generic function for typical usage scenarios - while preserving dynamic typing and extensibility
CLOS generic function simply have a speed penalty. The runtime dispatch costs.
CLOS is best used for higher level code, which then really benefits from features like extensibility, multi-dispatch, inheritance/combinations. Generic functions should be used for defined generic behavior - not as collections of similar methods.
With better implementation technology, implementation-specific language enhancements, etc. it might be possible to increase the range of code which can be written in a performant way using CLOS. This has been tried with programming languages like Dylan and Julia.
Presumably it's bad form to redefine these names?
Common Lisp implementations don't let you replace them just so. Be aware, that your replacement functions should be implemented in a way which works consistently with the old functions. Also, old versions could be inlined in some way and not be replaceable everywhere.
Is there a way to tell defgeneric not to generate a program error and to go ahead and replace the function binding?
You would need to make sure that the replacement is working while replacing it. The code replacing functions, might use those function you are replacing.
Still, implementations allow you to replace CL functions - but this is implementation specific. For example LispWorks provides the variables lispworks:*packages-for-warn-on-redefinition* and lispworks:*handle-warn-on-redefinition*. One can bind them or change them globally.
What's the considered wisdom and best practice here please?
There are two approaches:
use implementation specific ways to replace standard Common Lisp functions
This can be dangerous. Plus you need to support it for all implementations of CL you want to use...
use a language package, where you define your new language. Here this would be standard Common Lisp plus your extensions/changes. Export everything the user would use. In your software use this package instead of CL.

Why does `stringsAsFactors` use capital letters for readability in R?

Why does stringsAsFactors use capital letters to aid readability in R when most other commands seem to use . (e.g., as.factor)?
Is this an idiosyncrasy or part of a higher organizaton of the commands that I am not familiar with?
Is there any way to predict which commands will use capital letters and which will use .?
Thanks
It is obvious -- no standard has been established before it was too late ;-)
A lot of the idiosyncrasies arise because of the heritage from the S language and compatibility with the implementation in S-PLUS. There has been a tendency in recent years to avoid new functions with names that include a . as a separator to avoid confusion with S3 methods. This hasn't been change retrospectively because of backwards compatibility and a desire to be faithful to functions from S/S-PLUS days.
Since _ was deprecated as an alternative to <-, some authors have used it in function names; an example are packages of Hadley Wickham, but there are plenty of others.
The lack of a strictly adhered standard can be confusing, and certainly adds to the learning curve, but is something you have to live with.
The so-called 'camelCase' is a good choice.
Besides Hadley, few recommend underscores. See for example the Google R Style Guide which says:
Don't use underscores ( _ ) or hyphens ( - ) in identifiers.
R itself does not enforce a style, but (heuristically speaking) not too many new core libraries use a dot either as a separator in identifiers as this is also used for
S3 methods.

Coding mathematical algorithms - should I use variables in the book or more descriptive ones?

I'm maintaining code for a mathematical algorithm that came from a book, with references in the comments. Is it better to have variable names that are descriptive of what the variables represent, or should the variables match what is in the book?
For a simple example, I may see this code, which reflects the variable in the book.
A_c = v*v/r
I could rewrite it as
centripetal_acceleration = velocity*velocity/radius
The advantage of the latter is that anyone looking at the code could understand it. However, the advantage of the former is that it is easier to compare the code with what is in the book. I may do this in order to double check the implementation of the algorithms, or I may want to add additional calculations.
Perhaps I am over-thinking this, and should simply use comments to describe what the variables are. I tend to favor self-documenting code however (use descriptive variable names instead of adding comments to describe what they are), but maybe this is a case where comments would be very helpful.
I know this question can be subjective, but I wondered if anyone had any guiding principles in order to make a decision, or had links to guidelines for coding math algorithms.
I would prefer to use the more descriptive variable names. You can't guarantee everyone that is going to look at the code has access to "the book". You may leave and take your copy, it may go out of print, etc. In my opinion it's better to be descriptive.
We use a lot of mathematical reference books in our work, and we reference them in comments, but we rarely use the same mathematically abbreviated variable names.
A common practise is to summarise all your variables, indexes and descriptions in a comment header before starting the code proper. eg.
// A_c = Centripetal Acceleration
// v = Velocity
// r = Radius
A_c = (v^2)/r
I write a lot of mathematical software. IF I can insert in the comments a very specific reference to a book or a paper or (best) web site that explains the algorithm and defines the variable names, then I will use the SHORT names like a = v * v / r because it makes the formulas easier to read and write and verify visually.
IF not, then I will write very verbose code with lots of comments and long descriptive variable names. Essentially, my code becomes a paper that describes the algorithm (anyone remember Knuth's "Literate Programming" efforts, years ago? Though the technology for it never took off, I emulate the spirit of that effort). I use a LOT of ascii art in my comments, with box-and-arrow diagrams and other descriptive graphics. I use Jave.de -- the Java Ascii Vmumble Editor.
I will sometimes write my math with short, angry little variable names, easier to read and write for ME because I know the math, then use REFACTOR to replace the names with longer, more descriptive ones at the end, but only for code that is much more informal.
I think it depends almost entirely upon the audience for whom you're writing -- and don't ever mistake the compiler for the audience either. If your code is likely to be maintained by more or less "general purpose" programmers who may not/probably won't know much about physics so they won't recognize what v and r mean, then it's probably better to expand them to be recognizable for non-physicists. If they're going to be physicists (or, for another example, game programmers) for whom the textbook abbreviations are clear and obvious, then use the abbreviations. If you don't know/can't guess which, it's probably safer to err on the side of the names being longer and more descriptive.
I vote for the "book" version. 'v' and 'r' etc are pretty well understood as acronymns for velocity and radius and is more compact.
How far would you take it?
Most (non-greek :-)) keyboards don't provide easy access to Δ, but it's valid as part of an identifier in some languages (e.g. C#):
int Δv;
int Δx;
Anyone coming afterwards and maintaining the code may curse you every day. Similarly for a lot of other symbols used in maths. So if you're not going to use those actual symbols (and I'd encourage you not to), I'd argue you ought to translate the rest, where it doesn't make for code that's too verbose.
In addition, what if you need to combine algorithms, and those algorithms have conflicting usage of variables?
A compromise could be to code and debug as contained in the book, and then perform a global search and replace for all of your variables towards the end of your development, so that it is easier to read. If you do this I would change the names of the variables slightly so that it is easier to change them later.
e.g A_c# = v#*v#/r#

Smart design of a math parser?

What is the smartest way to design a math parser? What I mean is a function that takes a math string (like: "2 + 3 / 2 + (2 * 5)") and returns the calculated value? I did write one in VB6 ages ago but it ended up being way to bloated and not very portable (or smart for that matter...). General ideas, psuedo code or real code is appreciated.
A pretty good approach would involve two steps. The first step involves converting the expression from infix to postfix (e.g. via Dijkstra's shunting yard) notation. Once that's done, it's pretty trivial to write a postfix evaluator.
I wrote a few blog posts about designing a math parser. There is a general introduction, basic knowledge about grammars, sample implementation written in Ruby and a test suite. Perhaps you will find these materials useful.
You have a couple of approaches. You could generate dynamic code and execute it in order to get the answer without needing to write much code. Just perform a search on runtime generated code in .NET and there are plenty of examples around.
Alternatively you could create an actual parser and generate a little parse tree that is then used to evaluate the expression. Again this is pretty simple for basic expressions. Check out codeplex as I believe they have a math parser on there. Or just look up BNF which will include examples. Any website introducing compiler concepts will include this as a basic example.
Codeplex Expression Evaluator
If you have an "always on" application, just post the math string to google and parse the result. Simple way but not sure if that's what you need - but smart in some way i guess.
I know this is old, but I came across this trying to develop a calculator as part of a larger app and ran across some issues using the accepted answer. The links were IMMENSELY helpful in understanding and solving this problem and should not be discounted. I was writing an Android app in Java and for each item in the expression "string," I actually stored a String in an ArrayList as the user types on the keypad. For the infix-to-postfix conversion, I iterated through each String in the ArrayList, then evaluated the newly arranged postfix ArrayList of Strings. This was fantastic for a small number of operands/operators, but longer calculations were consistently off, especially as the expressions started evaluating to non-integers. In the provided link for Infix to Postfix conversion, it suggests popping the Stack if the scanned item is an operator and the topStack item has a higher precedence. I found that this is almost correct. Popping the topStack item if it's precedence is higher OR EQUAL to the scanned operator finally made my calculations come out correct. Hopefully this will help anyone working on this problem, and thanks to Justin Poliey (and fas?) for providing some invaluable links.
The related question Equation (expression) parser with precedence? has some good information on how to get started with this as well.
-Adam
Assuming your input is an infix expression in string format, you could convert it to postfix and, using a pair of stacks: an operator stack and an operand stack, work the solution from there. You can find general algorithm information at the Wikipedia link.
ANTLR is a very nice LL(*) parser generator. I recommend it highly.
Developers always want to have a clean approach, and try to implement the parsing logic from ground up, usually ending up with the Dijkstra Shunting-Yard Algorithm. Result is neat looking code, but possibly ridden with bugs. I have developed such an API, JMEP, that does all that, but it took me years to have stable code.
Even with all that work, you can see even from that project page that I am seriously considering to switch over to using JavaCC or ANTLR, even after all that work already done.
11 years into the future from when this question was asked: If you don't want to re-invent the wheel, there are many exotic math parsers out there.
There is one that I wrote years ago which supports arithmetic operations, equation solving, differential calculus, integral calculus, basic statistics, function/formula definition, graphing, etc.
Its called ParserNG and its free.
Evaluating an expression is as simple as:
MathExpression expr = new MathExpression("(34+32)-44/(8+9(3+2))-22");
System.out.println("result: " + expr.solve());
result: 43.16981132075472
Or using variables and calculating simple expressions:
MathExpression expr = new MathExpression("r=3;P=2*pi*r;");
System.out.println("result: " + expr.getValue("P"));
Or using functions:
MathExpression expr = new MathExpression("f(x)=39*sin(x^2)+x^3*cos(x);f(3)");
System.out.println("result: " + expr.solve());
result: -10.65717648378352
Or to evaluate the derivative at a given point(Note it does symbolic differentiation(not numerical) behind the scenes, so the accuracy is not limited by the errors of numerical approximations):
MathExpression expr = new MathExpression("f(x)=x^3*ln(x); diff(f,3,1)");
System.out.println("result: " + expr.solve());
result: 38.66253179403897
Which differentiates x^3 * ln(x) once at x=3.
The number of times you can differentiate is 1 for now.
or for Numerical Integration:
MathExpression expr = new MathExpression("f(x)=2*x; intg(f,1,3)");
System.out.println("result: " + expr.solve());
result: 7.999999999998261... approx: 8
This parser is decently fast and has lots of other functionality.
Work has been concluded on porting it to Swift via bindings to Objective C and we have used it in graphing applications amongst other iterative use-cases.
DISCLAIMER: ParserNG is authored by me.

Resources