I thought to ask this as an update to my previous similar question but it became too long.
I was trying to understand a regex given in w3.org that matches css comments and got this doubt
Why do they use
\/\*[^*]*\*+([^/*][^*]*\*+)*\/
----------------^
instead of just
\/\*[^*]*\*+([^/][^*]*\*+)*\/
?
Both are working similarly. Why do they have an extra star there?
Let's look at this part:
\*+([^/*][^*]*\*+)*
-A- --B-- -C-
Regex engine will parse the A part and match all the stars until there is NO MORE stars or there is a line break. So once A is done, the next character must be a line break or anything else that's not a star. Then why instead of using [^/] they used [^/*]?
Also look at the repeating capturing group.
([any one char that's not / or *][zero or more chars that's not *][one or more stars])
It captures groups of characters ending with atleast one or more stars. So C will take all the stars leaving B with no stars to match in the next round.
So the B part won't get a chance to meet any stars at all. That is why I think there's no need to put a star there.
But that regex is in w3.org so I guess my understanding may be wrong. Please explain what I'm missing.
This has already been corrected in the CSS3 Syntax module:
\/\*[^*]*\*+([^/][^*]*\*+)*\/ /* ignore comments */
Notice that the extraneous asterisk is gone, making this expression identical to what you have.
So it would seem that it was simply a mistake on their part while writing the grammar for CSS2. I'm digging the mailing list archives to see if there's any discussion there that could be relevant.
Related
I'm trying to assign the result of a chain matrix multiplication in Maxima to a new variable. I'm not sure as a new user why line %o6 isn't the same as the previous and fully evaluate the chain. Also why when I enter the new variable name "B" I simply have "B" returned back to me and not ([32, 32], [32, 32]). Basic questions I know but I've searched the documentation for a number of hours, and tutorials, and the syntax that I'm supposed to use here to get what I guess I was expecting as output, is still unclear to me.
I can't tell for sure, but it appears that the problem is that B : A.A.A is entered holding the shift key for at least one of the spaces, and Shift+Space is interpreted as non-breaking space instead of ordinary space. This appears to be a known bug or at least a serious misfeature in wxMaxima; see: https://github.com/wxMaxima-developers/wxmaxima/issues/1031
(I say misfeature because Shift+Space --> non-breaking space is documented in the wxMaxima documentation, but it seems like a classic example of "bad affordance"; it is all too easy to do the wrong thing without knowing it. Anyway this is just my opinion.)
I built wxMaxima from current source code and it appears that Shift+Space is now not interpreted as non-breaking space in code, so B : A.A.A should have the expected effect even if shift key is held while typing space. The current version is 19.07.0-DevelopmentSnapshot. I poked through the commit log a bit, but I can't figure out which commit changed the behavior of Shift+Space, so it's possible that the problem is not fixed and it is just fortuitous that I am not encountering it.
There are two workarounds, if one doesn't want to hazard an upgrade. (1) Omit spaces. (2) Be careful to only type space without shift.
Hope this is helpful in some way.
Dear People of Stack Overflow,
While I would say that I get the job done with regular expressions most of the time, now I have a problem I cannot seem to grasp:
I have text files I need to parse (language being R, but that doesn’t seem to matter). Essentially these files are protocols of speakers and I want to extract some information. The speakers generally follow this pattern:
Mr. Paul (speaks in English): Text.
Mr. Hernandez Gabriel (speaks in Spanish): Text.
Mr. Jenchewkow (speaks in Russian, translation provided): Text.
The regex I use for these speakers is: ^(Mr\.)\s*([^\(]*?)\s*(|\(speaks.*?\)):\s*(.*)$
The problem occurs when these speakers quote somebody else or reference something like:
Mr. Puk once said: ‚Hello‘ and I want to second that.
Here, sometimes a mismatch occurs as the regex captures everything between „Mr.“ and the colon, parsing the second capture group as: „Puk once said“ and messing up the parsed document. Thus, I tried to exclude these matches with a negative lookahead, guessing the words that could occur between Mr. and the colon like „said“, „expressed“, etc.
However, a) I seem to be unable to combine the negative lookahead with the second capture group’s ([^\(]*?) and b) this approach doesn’t seem to be that universal, given that there are other mismatches like:
Mr. Peter thought it acceptable that: Some text.
So my question is twofold: How would I exclude matches that have a „said“, „expressed“, etc. after the ‚name‘? And secondly: Is there a better, more universal way of achieving this? I thought about limiting the number of words between „Mr.“ and the colon, but that doesn’t seem to solve the problem.
Thanks in advance!
Edit:
As a reaction to the very helpful answers up to this point, I should emphasize that
a) there are indeed people with multiple names in the data
and
b) there are speakers which are not followed by a "speaks in...". Thus, Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\): doesn't match them. An example would be:
Mr. Paul: Hello!
The last one was an oversight on my part when giving the initial examples. Sorry!
I suggest the following more flexible but still anchored pattern:
Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\):
Demo
Mr. acts as start anchor, and \s\(speaks in ... ): is used for the second part. The single \s is not absolutely required but the output becomes nicer.
Your updated requirement makes it hard to come up with a watertight solution. If there is only a limited number of speakers of the loose type you could add them as separate cases like: Mr. (Paul|Peter|Matt)(?=:)) and then wrap-up everything with:
(?|Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\):|Mr. (Paul|Peter|Mary)(?=:))
If this is not enough you could add alternations for the cases where there is just a name (including a second first name):
(?|Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\):|Mr. ([A-Z]\w+)(?=:)|Mr. ([A-Z]\w+ [A-Z]\w+)(?=:))
Demo2
This more generic regex would catch the name in each case and then any text after the colon:
^Mr\.?\s*([^\s]*)[^:]*:\s*(.+)$
Note I've put a question mark after the first period in case you occasionally have Mr without a . Remove the question mark if you always want the period matched. Also, you might consider setting case insensitivity again in case occasionally you have mr. And are there no women potentially speaking?
Forgot to say: this regex assumes there is only one surname. If you have something like "Mr. García Hernández said:" then the regex will need to be more complicated to find the name. This one will only match García in that case.
EDIT: In response to further info, I'd now write the Regex like this (in R syntax):
grepl("Mr\\.?\\s*([A-Z](?:[^\\s:]|\\s(?=[A-Z]))+)[^:]*:\\s*(.+)", subject, perl=TRUE);
The conditions for this to work are that Mr is always with a capital letter, and that names always begin with a capital letter in the ASCII range [A-Z] (otherwise how is the Regex going to know it's a name?). As a plain regex it looks like this (without the R syntax):
Mr\.?\s*([A-Z](?:[^\s:]|\s(?=[A-Z]))+)[^:]*:\s*(.+)
Note that I've removed the start-of-string ^ and end-of-string $ because it seems matching ^ and $ at the end of lines in a long string is not supported in R (3.1-3.4). Change that if you're dealing with single strings. It seems that the dot doesn't work multi-line either in R, so the last (.+) matches to the end of the line. You could get some false positives if there is a speaker who addresses "As Mr. Hernández said...", though if there are no colons after that to the end of the line it should still work. This is where $ at the start could help, so add it back if necessary.
This will match any number of surnames before the colon so long as they begin with [A-Z]. This also has to be run in case-sensitive mode. If you want an explanation of how it works, just ask, but maybe you follow anyway.
Output of above regex by numbered capturing groups:
Mr. Paul (speaks in English): Text. -> 1. Paul -> 2. Text.
Mr. Hernandez Gabriel Theodor (speaks in Spanish): Text. -> 1. Hernandez Gabriel Theodor -> 2. Text.
Mr. Jenchewkow (speaks in Russian, translation provided): Text. -> 1. Jenchewkow -> 2. Text.
Mr. Puk once said: ‚Hello‘ and I want to second that. -> 1. Puk -> 2. ‚Hello‘ and I want to second that.
Mr. Peter thought it acceptable that: Some text. -> 1. Peter -> 2. Some text.
Mr Paul: Hello! -> 1. Paul -> 2. Hello!
FURTHER EDIT:
OK, so to exclude anything that has text other than something in parentheses before the colon, you can do this:
Mr\.?\s*([A-Z](?:[^\s:]|\s(?=[A-Z]))+)(?=[\s]*[(:])[^:]*:\s*(.+)
You can try it out and change options here: https://regex101.com/r/YzHPa0/1 - have a look at the Match Information on right-hand side of that screen to see what the capture groups match.
Note that this needs to be case sensitive. If you want to specify the text that goes in the parentheses for even more selectivity you'll have to change [^:]* to (?:\s\(speaks\sin[^:]+)?.
I've tried to follow a tutorial to add a comment for Beyond Compare but I am still unable to mark the commented lines as unimportant differences. I would like to compare R files. This is how I configured the grammar Rules.
If possible I would like to ignore the commented line only if the content of the line is equal. In other words if by removing the comment the two lines would actually differ I would still like to have them marked as important differences.
Here is the actual result of the comparison. Strangely when there are two comment symbols (#) the line appear as minor difference.
Beyond Compare doesn't support what you're trying to do. The comparison for each character checks both the character itself and the grammar type of the element. For example, comparing an identifier to a string will always show the characters as completely different even if the strings themselves are identical.
In your example, since they're different grammar types, every character is considered a difference. On the left they're comments, so unimportant and normally drawn as blue differences, but you're ignoring unimportant differences so they're shown as matching/black instead. On the right, they're important text, so they're drawn as red differences.
The lines that are comments on both sides are showing as matching because (A) they're all the same character and grammar type, so, aside from the # leading character, they are treated as matches, and (B) you're ignoring unimportant differences. (B) means that you could actually have anything for the content of the comments on each side and it would still show up as matching.
Should be validating 1-280 input characters, but it hangs when more than 280 characters are input.
Clarification
I am using the above regex to validate the length of input string to be 280 characters maximum.
I am using asp:RegularExpressionValidator to do that.
There's nothing “wrong” with it per se, but it's horrendous because with most RE engines (you don't say which one you're using) when it doesn't match with the first thing it tries because it causes the engine to backtrack and try loads of different possibilities (none of which can ever cause a match). So it's not a hang, but rather just a machine that's trying to execute around 2280 operations to see if there's a match possible. Excuse me if I don't wait around for that!
Of course, it's theoretically possible for the RE compiler to merge the (.|\s) part of the RE into something it doesn't need to backtrack to deal with. Some RE engines do this (typically the more automata-theoretic ones) but many don't (the stack-based ones).
It is trying every possible combination of . and \s for each character trying to find a version of the pattern that matches the string.
. already matches any character, so (.|\s) is redundant. Further, if you just want to check what the length of the string is, then just do that - why are you pulling out regexes?
If you really want to use a regular expression, you could use .{1, 280}$ combined with the SingleLine option, so that the . metacharacter will match everything, including new lines (see here, Regular Expression API section).
I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40DB50DB87DBB0DBA1DBABDBA0DB9ADBA6DBACDBA0DB96DB95DBB7DBCFDBCBDBD6DB9CDBB5DB9DDB9FDBA3DB88DB89DB93DBA5DB9CDBC1DBC1DBC6DBC3DBC9DBB3DBB8DBB6DBC8DBA8DBB6DBA2DB98DBA9DBB9DBDBDBD5DBD9DBC3DB9BDBA2DB84DB83DB7DDB6BDB58DB4EDB42DB16DB0DDB01DB02DAFCDAE9DAE5DAD9DAE2DAB7DA9BDAA6DA9EDAAADAC9DACADAC4DA92DA90DA84DA89DA93DAA9DA8CDA7FDA62DA53DA6EDA
That's an excerpt of the HEX file, and as an example I'd like to get:
XX occurrences of BDBDBD
XX occurrences of B93D
Is there a way to mine the file to generate that output?
Sure. Use a sliding window to create the counts (The link is for Perl, but it seems general enough to understand the algorithm). Your patterns are named N-grams. You will have to limit the maximal pattern, though.
This is a pretty classic CS problem. The code in general is non-trivial to implement as it will require at least one full parse of the sequence, and depending on your efficiency and memory/processor constraints might require several. See here.
You will need to partition your input string in some way to ensure that you get a good subsequence across it.
If there is a specific problem we might be able to help more, but the general strategy is in the Wikipedia article above.
You can use Regular Expressions to make a pattern to search for.
The regex needed would be very simple. Just use the exact phrase you're searching for. Then there should be a regular expression function in the language you're using (you didn't specify) that can count the number of matches.
Use that to create a simple counter.