gsub >=2 white space with tab in R [duplicate] - r

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?

To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.

I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.

They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.

They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

Related

What can I leverage (from R) to convert expressions with nested fractions from infix-notation to LaTeX?

I'd like something to convert pretty basic math expressions, having nested parentheses and fractions, to LaTeX notation. Like mathquill, but a function (or even the building blocks of one).
There seem to be some Lua and Haskell solutions in a pandoc/Rmarkdown context, but I can't use those, because (a) I'm scared of real languages, and (b) I'm generating PNGs (via webtex) to be featured in a flextable table, outside of a rendered document.
I'm inexperienced with regular expressions, so I don't know how to leverage something like this, but I'd appreciate any pointers if that seems like a productive path.
"Write a parser" is something best left to others, at least in my case!
Example expression below. Just a few levels of nesting, no big deal. I can bound it at, say, 5 levels, if that helps. And the input is parseable as an R expression—I can guarantee matched parentheses, for example.
(y0 - (y0/((1-y0)*exp(B*dx)+y0)))*Pop
Here's what I'd want in the case above. The \cdots are sugar; I can handle those.
\left(y0-\frac{y0}{\left(1-y0\right)\cdot\exp\left(B\cdot dx\right)+y0}\right)\cdot Pop
Visually:

How to match a minimum number of regex groups or assertions?

OK regex nerds!
I am using regex lookahead assertions for password validation that is similar to the pattern described here:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)
However, we want to only require that any 3 of the 4 assertions be valid - not necessarily all of them. Any thoughts on how this could be done?
To shorten any kind of pattern, factorize:
\A(?:
(?=\w{6,10}\z) (?=.*[a-z]) (?: (?:.*[A-Z]){3} | .*\d )
|
(?=.*\d) (?=(?:.*[A-Z]){3}) (?: .*[a-z] | \w{6,10}\z )
)
Note that you don't need a lookahead to test the last condition.
demo
Other way, where each condition is optional and that uses a named group to count (.net only):
\A
(?<c>(?=\w{6,10}\z))?
(?<c>(?=[^a-z]*[a-z]))?
(?<c>(?=(?:[^A-Z]*[A-Z]){3}))?
(?<c>(?=\D*\d))?
(?<-c>){3} # decrement c 3 times
(?(c)|(?!$)) # conditional: force the pattern to fail if too few conditions succeed.
demo
There's no "easy" way to do this in a single regular expression. The only way would be to define all possible permutations of the "three out of four" assertions - e.g.
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})| # Maybe no digit
\A(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)| # Maybe wrong length
\A(?=\w{6,10}\z)(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)| # Maybe no lower
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=\D*\d) # Maybe not enough uppers
However, this mind-melting regex is clearly not a good solution.
A better approach would be to perform the four checks separately (with regex or otherwise), and count that there is at least three passed conditions.
...However, let's take a step back here and ask: Why are you doing this?? You're implementing a password entropy check. Based on your fuzzy rules, the following passwords are valid:
AAAa1
password1
LETmein
And the following passwords are invalid:
reallylongsecurepassword8374235359232
HorseBatteryStapleCorrect
I would strongly advise against such a bizarrely restrictive policy.
Brief
The easiest method would be to have separate regular expressions and check whether 3/4 of them are successful in your code's language. The only way to do this in regex is to present all cases. That being said, this is probably the easiest method (in regex) to present all options as it allows you to edit the patterns in one location (where they are defined) rather than multiple times (more prone to bugs). The DEFINE constructs in regex are seldom supported, but PCRE regex does.
You can also have your code generate each regex permutation. See this question about generating all permutations of a list in python
I don't know why you want to do this for passwords, it's considered malpractice, but, since you're asking for it, I figured I'd give you the easiest solution possible in regex... You really should only check minimum length (and complexity if you want [based on algorithms] to show the user how secure your system finds their password to be).
Code
(?(DEFINE)
(?<w>(?=\w{6,10}\z))
(?<l>(?=[^a-z]*[a-z]))
(?<u>(?=(?:[^A-Z]*[A-Z]){3}))
(?<d>(?=\D*\d))
)
\A(?:
(?&w)(?&l)(?&u)|
(?&w)(?&l)(?&d)|
(?&w)(?&u)(?&d)|
(?&l)(?&u)(?&d)
)
Note: The regex above uses the x modifier (ignore whitespace) so that we can nicely organize the content.

matrix multiplication order PVM vs MVP in graphics programming

hi there I was wondering why most tutorials and programming code use MVP to describe the Model-View-Projection matrix. Instead of PVM which is the actual order of implementation in the code:
mat4 MVP = ProjectionMatrix * ViewMatrix * ModelMatrix;
gl_Position = MVP * VertexInModelSpace;
seems much more understandable to me to write PVM instead of MVP.
Matrices don't actually have a fixed meaning, just relations between rows and columns. The meaning is freely definable by the developers. The MVP order follows from standard mathematical conventions. But since nothing says you can not define the vectors as columns instead of rows nothing precludes this ordering.
Clarification: Since changing notation transposes the meaning. Then following applies:
MmvpT = Mpvm
Due to the definition of matrix multiplication following rule kicks in:
(AB)T = BTAT
Since B can be recursively another matrix multiplication a infinite chain of these are possible. Which means essentially that you have swapped the multiplication order, by changing notation.
Its a bit like looking at the problem from the outside or the problem from the inside. In this case your thinking as a outside observer. Whereas the other way around one would observe the thing from the standpoint of the first operator in the chain. Personally I think the notation you use may be more intuitive for this specific task, the other is just way more common. Mainly due to the fact that all mathematics books I have ever seen use this convention, so blame the mathematicians.
So better stick with the more common way, makes things more generally understandable. For example: Nothing stops me from typing the answer in Finnish but the convention of stackoverflow is to answer in English, making answers more understandable to most users. Use the more common form since others may not grasp the difference, and this leads to errors.
The other problem is that matrix multiplication is not necessarily commutative:
AB != BA
So it's a good idea to stick with the convention.

Excel-like toy-formula parsing

I would like to create a grammar for parsing a toy like formula language that resembles S-expression syntax.
I read through the "Getting Started with PyParsing" book and it included a very nice section that sort of covers a similar grammar.
Two examples of data to parse are:
sum(5,10,avg(15,20))+10
stdev(5,10)*2
Now, I have come up with a grammar that sort-of parses the formula but disregards
expanding the functions and operator precedence.
What would be the best practice to continue on with it: Should I add parseActions
for words that match oneOf the function names ( sum, avg ... ). If I build a nested
list, I could do a depth-first walking of parse results and evaluate the functions ?
It's a little difficult to advise without seeing more of your code. Still, from what you describe, it sounds like you are mostly tokenizing, to recognize the various bits of punctuation and distinguishing variable names from numeric constants from algebraic operators. nestedExpr will impart some structure, but only basic parenthetical nesting - this still leaves operator precedence handling for your post-parsing work.
If you are learning about parsing infix notation, there is a succession of pyparsing examples to look through and study (at the pyparsing wiki Examples page). Start with fourFn.py, which is actually a five function infix notation parser. Look through its BNF() method, and get an understanding of how the recursive definitions work (don't worry about the pushFirst parse actions just yet). By structuring the parser this way, operator precedence gets built right into the parsed results. If you parse 4 + 2 * 3, a mere tokenizer just gives you ['4','+','2','*','3'], and then you have to figure out how to do the 2*3 before adding the 4 to get 10, and not just brute force add 4 and 2, then multiply by 3 (which gives the wrong answer of 18). The parser in fourFn.py will give you ['4','+',['2','*','3']], which is enough structure for you to know to evaluate the 2*3 part before adding it to 4.
This whole concept of parsing infix notation with precedence of operations is so common, I wrote a helper function that does most of the hard work, called operatorPrecedence. You can see how this works in the example simpleArith.py and then move on to eval_arith.py to see the extensions need to create an evaluator of the parsed structure. simpleBool.py is another good example showing precedence for logical terms AND'ed and OR'ed together.
Finally, since you are doing something Excel-like, take a look at excelExpr.py. It tries to handle some of the crazy corner cases you get when trying to evaluate Excel cell references, including references to other sheets and other workbooks.
Good luck!

Coding mathematical algorithms - should I use variables in the book or more descriptive ones?

I'm maintaining code for a mathematical algorithm that came from a book, with references in the comments. Is it better to have variable names that are descriptive of what the variables represent, or should the variables match what is in the book?
For a simple example, I may see this code, which reflects the variable in the book.
A_c = v*v/r
I could rewrite it as
centripetal_acceleration = velocity*velocity/radius
The advantage of the latter is that anyone looking at the code could understand it. However, the advantage of the former is that it is easier to compare the code with what is in the book. I may do this in order to double check the implementation of the algorithms, or I may want to add additional calculations.
Perhaps I am over-thinking this, and should simply use comments to describe what the variables are. I tend to favor self-documenting code however (use descriptive variable names instead of adding comments to describe what they are), but maybe this is a case where comments would be very helpful.
I know this question can be subjective, but I wondered if anyone had any guiding principles in order to make a decision, or had links to guidelines for coding math algorithms.
I would prefer to use the more descriptive variable names. You can't guarantee everyone that is going to look at the code has access to "the book". You may leave and take your copy, it may go out of print, etc. In my opinion it's better to be descriptive.
We use a lot of mathematical reference books in our work, and we reference them in comments, but we rarely use the same mathematically abbreviated variable names.
A common practise is to summarise all your variables, indexes and descriptions in a comment header before starting the code proper. eg.
// A_c = Centripetal Acceleration
// v = Velocity
// r = Radius
A_c = (v^2)/r
I write a lot of mathematical software. IF I can insert in the comments a very specific reference to a book or a paper or (best) web site that explains the algorithm and defines the variable names, then I will use the SHORT names like a = v * v / r because it makes the formulas easier to read and write and verify visually.
IF not, then I will write very verbose code with lots of comments and long descriptive variable names. Essentially, my code becomes a paper that describes the algorithm (anyone remember Knuth's "Literate Programming" efforts, years ago? Though the technology for it never took off, I emulate the spirit of that effort). I use a LOT of ascii art in my comments, with box-and-arrow diagrams and other descriptive graphics. I use Jave.de -- the Java Ascii Vmumble Editor.
I will sometimes write my math with short, angry little variable names, easier to read and write for ME because I know the math, then use REFACTOR to replace the names with longer, more descriptive ones at the end, but only for code that is much more informal.
I think it depends almost entirely upon the audience for whom you're writing -- and don't ever mistake the compiler for the audience either. If your code is likely to be maintained by more or less "general purpose" programmers who may not/probably won't know much about physics so they won't recognize what v and r mean, then it's probably better to expand them to be recognizable for non-physicists. If they're going to be physicists (or, for another example, game programmers) for whom the textbook abbreviations are clear and obvious, then use the abbreviations. If you don't know/can't guess which, it's probably safer to err on the side of the names being longer and more descriptive.
I vote for the "book" version. 'v' and 'r' etc are pretty well understood as acronymns for velocity and radius and is more compact.
How far would you take it?
Most (non-greek :-)) keyboards don't provide easy access to Δ, but it's valid as part of an identifier in some languages (e.g. C#):
int Δv;
int Δx;
Anyone coming afterwards and maintaining the code may curse you every day. Similarly for a lot of other symbols used in maths. So if you're not going to use those actual symbols (and I'd encourage you not to), I'd argue you ought to translate the rest, where it doesn't make for code that's too verbose.
In addition, what if you need to combine algorithms, and those algorithms have conflicting usage of variables?
A compromise could be to code and debug as contained in the book, and then perform a global search and replace for all of your variables towards the end of your development, so that it is easier to read. If you do this I would change the names of the variables slightly so that it is easier to change them later.
e.g A_c# = v#*v#/r#

Resources