regex - Combining negative look-aheads with not-statements - r

Dear People of Stack Overflow,
While I would say that I get the job done with regular expressions most of the time, now I have a problem I cannot seem to grasp:
I have text files I need to parse (language being R, but that doesn’t seem to matter). Essentially these files are protocols of speakers and I want to extract some information. The speakers generally follow this pattern:
Mr. Paul (speaks in English): Text.
Mr. Hernandez Gabriel (speaks in Spanish): Text.
Mr. Jenchewkow (speaks in Russian, translation provided): Text.
The regex I use for these speakers is: ^(Mr\.)\s*([^\(]*?)\s*(|\(speaks.*?\)):\s*(.*)$
The problem occurs when these speakers quote somebody else or reference something like:
Mr. Puk once said: ‚Hello‘ and I want to second that.
Here, sometimes a mismatch occurs as the regex captures everything between „Mr.“ and the colon, parsing the second capture group as: „Puk once said“ and messing up the parsed document. Thus, I tried to exclude these matches with a negative lookahead, guessing the words that could occur between Mr. and the colon like „said“, „expressed“, etc.
However, a) I seem to be unable to combine the negative lookahead with the second capture group’s ([^\(]*?) and b) this approach doesn’t seem to be that universal, given that there are other mismatches like:
Mr. Peter thought it acceptable that: Some text.
So my question is twofold: How would I exclude matches that have a „said“, „expressed“, etc. after the ‚name‘? And secondly: Is there a better, more universal way of achieving this? I thought about limiting the number of words between „Mr.“ and the colon, but that doesn’t seem to solve the problem.
Thanks in advance!
Edit:
As a reaction to the very helpful answers up to this point, I should emphasize that
a) there are indeed people with multiple names in the data
and
b) there are speakers which are not followed by a "speaks in...". Thus, Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\): doesn't match them. An example would be:
Mr. Paul: Hello!
The last one was an oversight on my part when giving the initial examples. Sorry!

I suggest the following more flexible but still anchored pattern:
Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\):
Demo
Mr. acts as start anchor, and \s\(speaks in ... ): is used for the second part. The single \s is not absolutely required but the output becomes nicer.
Your updated requirement makes it hard to come up with a watertight solution. If there is only a limited number of speakers of the loose type you could add them as separate cases like: Mr. (Paul|Peter|Matt)(?=:)) and then wrap-up everything with:
(?|Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\):|Mr. (Paul|Peter|Mary)(?=:))
If this is not enough you could add alternations for the cases where there is just a name (including a second first name):
(?|Mr\.\s*([^\(]*)\s\(speaks in [^\)]*\):|Mr. ([A-Z]\w+)(?=:)|Mr. ([A-Z]\w+ [A-Z]\w+)(?=:))
Demo2

This more generic regex would catch the name in each case and then any text after the colon:
^Mr\.?\s*([^\s]*)[^:]*:\s*(.+)$
Note I've put a question mark after the first period in case you occasionally have Mr without a . Remove the question mark if you always want the period matched. Also, you might consider setting case insensitivity again in case occasionally you have mr. And are there no women potentially speaking?
Forgot to say: this regex assumes there is only one surname. If you have something like "Mr. García Hernández said:" then the regex will need to be more complicated to find the name. This one will only match García in that case.
EDIT: In response to further info, I'd now write the Regex like this (in R syntax):
grepl("Mr\\.?\\s*([A-Z](?:[^\\s:]|\\s(?=[A-Z]))+)[^:]*:\\s*(.+)", subject, perl=TRUE);
The conditions for this to work are that Mr is always with a capital letter, and that names always begin with a capital letter in the ASCII range [A-Z] (otherwise how is the Regex going to know it's a name?). As a plain regex it looks like this (without the R syntax):
Mr\.?\s*([A-Z](?:[^\s:]|\s(?=[A-Z]))+)[^:]*:\s*(.+)
Note that I've removed the start-of-string ^ and end-of-string $ because it seems matching ^ and $ at the end of lines in a long string is not supported in R (3.1-3.4). Change that if you're dealing with single strings. It seems that the dot doesn't work multi-line either in R, so the last (.+) matches to the end of the line. You could get some false positives if there is a speaker who addresses "As Mr. Hernández said...", though if there are no colons after that to the end of the line it should still work. This is where $ at the start could help, so add it back if necessary.
This will match any number of surnames before the colon so long as they begin with [A-Z]. This also has to be run in case-sensitive mode. If you want an explanation of how it works, just ask, but maybe you follow anyway.
Output of above regex by numbered capturing groups:
Mr. Paul (speaks in English): Text. -> 1. Paul -> 2. Text.
Mr. Hernandez Gabriel Theodor (speaks in Spanish): Text. -> 1. Hernandez Gabriel Theodor -> 2. Text.
Mr. Jenchewkow (speaks in Russian, translation provided): Text. -> 1. Jenchewkow -> 2. Text.
Mr. Puk once said: ‚Hello‘ and I want to second that. -> 1. Puk -> 2. ‚Hello‘ and I want to second that.
Mr. Peter thought it acceptable that: Some text. -> 1. Peter -> 2. Some text.
Mr Paul: Hello! -> 1. Paul -> 2. Hello!
FURTHER EDIT:
OK, so to exclude anything that has text other than something in parentheses before the colon, you can do this:
Mr\.?\s*([A-Z](?:[^\s:]|\s(?=[A-Z]))+)(?=[\s]*[(:])[^:]*:\s*(.+)
You can try it out and change options here: https://regex101.com/r/YzHPa0/1 - have a look at the Match Information on right-hand side of that screen to see what the capture groups match.
Note that this needs to be case sensitive. If you want to specify the text that goes in the parentheses for even more selectivity you'll have to change [^:]* to (?:\s\(speaks\sin[^:]+)?.

Related

R: Regex for identifying numbers within HTML chunk

this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!
So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).
Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.

Parsing - Adding a capturing group

I am attempting to use a fairly complex REGEX expression (see REGEX101 demos below), which I amended slightly from one created by an expert on this site. It parses specific patterns of log events:
1EXE_IN1EXE_CO2CONTENT_ACCESS3CONTENT_ACCESS
These log sequences will always begin with a random selection of EXE_IN or EXE_CO events, preceded sequence numbers. These selections can be any number, in any order. In this case, we just have two EXE events but this may be 200. Or 1. Note that there is a sequence number and we need to capture it.
The second part of the sequence will always be a series of digit-prefaced CONTENT.ACCESS events. Again from 1 to infinity in length.
The following demo shows a working example and probably conveys the concept better than I can : Demo 1
It nicely captures a full match, sequence number, and event in separate groups.
I need to add a timestamp to the pattern (after the sequence number, with a preceding underscore), and then parse this event log e.g.
1_11/08/2014 23:03EXE_IN1_11/08/2014 23:03EXE_CO2_12/08/2014 09:17CONTENT_ACCESS3_13/08/2014 09:17CONTENT_ACCESS
I need to capture the timestamps as well.
I attempted to adjust the regex expression, with mixed results. Please see this demo: demo2
Ideally I'd like to see something like this for each event:
Match n
Full match 266-308 `2_12/08/2014 09:17CONTENT_ACCESS`
Group 1. 266-267 `2`
Group 2. 268-284 `12/08/2014 09:17`
Group 3. 284-308 `CONTENT_ACCESS`
I hope you can help me. REGEX101 pcre testing is sufficient (for the record, I am using perl-compatible str_match_all_perl function in R).
Many thanks in advance.
(\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/1
Due to comments it was changed to (?:\G(?!^)(?(?=\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}(?:EXE_CO|EXE_IN))(?<!\d_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}CONTENT_ACCESS))|(?=(?:\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}(?:EXE_CO|EXE_IN))+(?:\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}CONTENT_ACCESS)+))(\d+)_(\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2})(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/3
Ans also another version, which is shorter
(?:\G(?!^)(?(?=\d+_.{16}(?:EXE_CO|EXE_IN))(?<!\d_.{16}CONTENT_ACCESS))|(?=(?:\d+_.{16}(?:EXE_CO|EXE_IN))+(?:\d+_.{16}CONTENT_ACCESS)+))(\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/4
And even more shorter (?:\G(?!^)(?(?=\d+_.{16}E)(?<!S))|(?=(?:\d+_.{16}(?:EXE_CO|EXE_IN))+\d+_.{16}C))(\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/5
And super short (?:\G|(?=\d+_.{16}E.*CON))(\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/8

Regular expression for x number of digits and only one hyphen?

I made the following regex:
(\d{5}|\d-\d{4}|\d{2}-\d{3}|\d{3}-\d{2}|\d{4}-\d)
And it seems to work. That is, it will match a 5 digit number or a 5 digit number with only 1 hyphen in it, but the hyphen can not be the lead or the end.
I would like a similar regex, but for a 25 digit number. If I use the same tactic as above, the regex will be very long.
Can anyone suggest a simpler regex?
Additional Notes:
I'm putting this regex into an XML file which is to be consumed by an ASP.NET application. I don't have access to the .net backend code. But I suspect they would do something liek this:
Match match = Regex.Match("Something goes here", "my regex", RegexOptions.None);
You need to use a lookahead:
^(?:\d{25}|(?=\d+-\d+$)[\d\-]{26})$
Explanation:
Either it's \d{25} from start to end, 25 digits.
Or: it is 26 characters of [\d\-] (digits or hyphen) AND it matched \d+-\d+ - meaning it has exactly one hyphen in the middle.
Working example with test cases
You could use this regex:
^[0-9](?:(?=[0-9]*-[0-9]*$)[0-9-]{24}|[0-9]{23})[0-9]$
The lookahead makes sure there's only 1 dash and the character class makes sure there are 23 numbers between the first and the last. Might be made shorter though I think.
EDIT: The a 'bit' shorter xP
^(?:[0-9]{25}|(?=[^-]+-[^-]+$)[0-9-]{26})$
A bit similar to Kobi's though, I admit.
If you aren't fussy about the length at all (i.e. you only want a string of digits with an optional hyphen) you could use:
([\d]+-[\d]+){1}|\d
(You may want to add line/word boundaries to this, depending on your circumstances)
If you need to have a specific length of match, this pattern doesn't really work. Kobi's answer is probably a better fit for you.
I think the fastest way is to do a simple match then add up the length of the capture buffers, why attempt math in a regex, makes no sence.
^(\d+)-?(\d+)$
This will match 25 digits and exactly one hyphen in the middle:
^(?=(-*\d){25})\d.{24}\d$

Rhyme Dictionary from CMU pronunciation database

I'm looking for a free or open source rhyming database.
I've found the CMU pronunciation "database" and its series of apps but I can't make sense of them or figure out where the data's coming from.
A simple text file with the word and its phonemes is all I need.
Does anybody here know where I'd find one or where I would begin to derive such a list from the CMU files?
cmudict
The cmudict is a text file and it's format is really simple. First, the word is listed. Then, there are two spaces. Everything following the two spaces is the pronunciation. Where a word may have two different ways of being spoken you will see two entries for the word like
word
word(1)
At the beginning of the file they've listed symbols and punctuation. The symbol is followed by the english spelling of said symbols name with no space between them. This is then followed by the two space divider and the arpabet code. Since you're only looking for rhymes you don't have to do anything special with the symbols section since you're never going to be looking for a rhyme to ...ELLIPSIS
ARPAbet
The information about how ARPAbet codes map to IPA is listed in wikipedia http://en.wikipedia.org/wiki/Arpabet and each mapping shows example words. It's pretty easy to see how the two relate to one another and that may help you to understand how to read the ARPAbet codes if you are familiar with IPA.
Summary
Basically, if you've already found the cmudict then you've already got what you asked for: a database of words and their pronunciations. To find words that rhyme you'll have to parse the flat file into a table and run a query to find words that end with the same ARPAbet code.
General Theory of Doing Stuff to Things
Part: Stuff
create a new database
create a table in the database with three fields: index, word, arpabet
read the cmudict file line by line
for each line split it into two parts where two consecutive spaces are found AND
increment the index count, then insert the index number, word, and arpabet code
Then Umm...
Once you've got the data into whatever kind of database you choose, you can then use that database to find correlations between the arpabet codes. You could find rhymes, consonance, assonance, and other mnemonic devices. It would go something like
Part: Thing
get a word you want to find a rhyme for
query the database for the arpabet equivalent of the word
split the arpabet code into pieces by breaking it up everywhere there is a space
take the last piece of the code and, query the database for words whose arpabet codes end matches said piece
Do fancy things with the rhymes
Shortcuts and Spoilers
I got bored and wrote a Node.js module that covers "Part: Stuff" listed above. If you've got Node.js installed on your machine you can get the module by running npm install cmudict-to-sqlite See https://npmjs.org/package/cmudict-to-sqlite for the README or just look in the module for docs.
Rhyme Logic using CMU Pronouncing Dictionary
OK. Suppose you want to use CMU Pronouncing Dictionary data (example file: cmudict-0.7b) to build a list of all the words that rhyme with "LOVE".
Here's how you might do it:
First, you need to learn the pronunciation of "LOVE". You'll find this line in the dictionary, where "LOVE" and "L AH1 V" are separated by two spaces:
LOVE L AH1 V
This is saying that the word LOVE is pronounced like L AH1 V.
Then, find the vowel phoneme that has primary stress. In other words, look for the number "1" in that pronunciation. The text directly to the left of the 1 is the vowel sound that has primary stress (AH). That text, and everything to the right of it are your "rhyme phonemes" (for the lack of a better term). So the rhyme phonemes for LOVE are AH1 V.
We're half done! Now we just have to find other words whose pronunciations end with AH1 V. If you're playing along in Notepad++, try a Find All In Current Document for pattern AH1 V$ using Search Mode of "Regular expression". This will match lines like:
Line 392: ABOVE AH0 B AH1 V
Line 10266: BELOVE B IH0 L AH1 V
Line 30204: DENEUVE D IH0 N AH1 V
Line 30205: DENEUVE(1) D IY0 N AH1 V
Line 34064: DOVE D AH1 V
Line 48177: GLOVE G L AH1 V
Line 49053: GOV G AH1 V
... etc
Rhyming woooooords!
There are plenty of ways to implement this, and plenty of corner cases, but this is roughly the approach that many electronic rhyming dictionaries appear to take when finding perfect rhymes.
Hypothetical SQL approach to storing rhyme data
Obviously, performance will be a problem if you just scan the dictionary every time someone wants a rhyme. If that's a concern, you might try storing or indexing the data differently.
Although it's not the most efficient on disk space, I've had a good experience storing this stuff in a SQL table with indexed columns.
For a simple conceptual example, you could compute the "rhyme phonemes" of all words in the dictionary, then insert them into a "Rhymes" table whose columns are { WordText, RhymePhonemes }. For example, you might see records like:
{"ABOVE", "AH1 V"}
{"DOVE", "AH1 V"}
{"OUTLIVE", "IH1 V"}
{"GRADUATE", "AE1 JH AH0 W AH0 T"}
{"GRADUATE", "AE1 JH AH0 W EY2 T"}
... etc
Then, to find rhymes, you'd issue a query like:
SELECT OTHER.WordText
FROM Rhymes INPUT
INNER JOIN Rhymes OTHER ON OTHER.RhymePhonemes = INPUT.RhymePhonemes
WHERE INPUT.WordText = 'love' AND
OTHER.WordText <> INPUT.WordText
ORDER BY OTHER.WordText
This also comes in handy if you're planning on printing a dictionary where all similar-sounding words are grouped together.
There are of course plenty of other ways to store/search the data of varying trade-offs, but hopefully this gets you started.
I've also had some luck storing the raw pronunciation in the database in varying "full" formats (forward and reversed strings of the pronunciation, with stress marks and without stress marks, etc) but not "chopped" into specific pieces like a rhyme-phoneme column.
Gotchas
Again, the original explanation with "love" will absolutely get you in the ballpark of rhyming. However, along the way you'll probably run into other gotchas to consider. Here's a heads-up:
Some words have multiple pronunciations. In the CMU dictionary, the alternate pronunciations are marked with text like (1), (2), etc following the word as in GRADUATE(2). If someone wants a rhyme of these words, you have to decide between showing rhymes of ALL matched pronunciations, or having the user choose which pronunciation they really meant.
What do you do when the pronunciation has two or more "1"s? Pick the first one? Pick the last one? If you pick the last one, you'll find more rhymes, but it might not be the most natural choice of stress.
What do you do when the pronunciation has no "1"s? It doesn't happen a lot, but it happens, like: ACCREDIT AH0 K R EH2 D AH0 T and AIKIN EY0 K IH0 N. In this case I'd pick the next best stress (e.g. pick the 2, 3, 4, etc if the 1 is absent). If they're all 0's, I don't have any good advice.
Some pronunciations are missing. It's a great start, but it doesn't have all the words or spellings of words you might want. US spelling is preferred over UK spelling.
Some pronunciations are not what you'd expect, and you may want to prune. For example there's a pronunciation of "or" that sounds like "er".
You may want to compare the "rhyme phonemes" with stress marks removed. This only matters for words whose primary stress is not on the last vowel (so you don't see the problem on the "love" example).
I'm actively working on something like this right now, using the general approach suggested by Plate, and extending it. Here's my source code. Hope it helps!
You could always use http://www.rhymezone.com/ and search a word and then put its rhyme matches into a text file if you are only using a small demo subset. If you want a full database of words. You could hook up a dictionary to a zombieJS UI automation and then screen scrape the words and put them into your own database. This would allow you to create your own rhyme database. Although to be honest, that's quite an undertaking for your original request

Unnecessary asterisk in regex that finds CSS comment

I thought to ask this as an update to my previous similar question but it became too long.
I was trying to understand a regex given in w3.org that matches css comments and got this doubt
Why do they use
\/\*[^*]*\*+([^/*][^*]*\*+)*\/
----------------^
instead of just
\/\*[^*]*\*+([^/][^*]*\*+)*\/
?
Both are working similarly. Why do they have an extra star there?
Let's look at this part:
\*+([^/*][^*]*\*+)*
-A- --B-- -C-
Regex engine will parse the A part and match all the stars until there is NO MORE stars or there is a line break. So once A is done, the next character must be a line break or anything else that's not a star. Then why instead of using [^/] they used [^/*]?
Also look at the repeating capturing group.
([any one char that's not / or *][zero or more chars that's not *][one or more stars])
It captures groups of characters ending with atleast one or more stars. So C will take all the stars leaving B with no stars to match in the next round.
So the B part won't get a chance to meet any stars at all. That is why I think there's no need to put a star there.
But that regex is in w3.org so I guess my understanding may be wrong. Please explain what I'm missing.
This has already been corrected in the CSS3 Syntax module:
\/\*[^*]*\*+([^/][^*]*\*+)*\/ /* ignore comments */
Notice that the extraneous asterisk is gone, making this expression identical to what you have.
So it would seem that it was simply a mistake on their part while writing the grammar for CSS2. I'm digging the mailing list archives to see if there's any discussion there that could be relevant.

Resources