Regular Expression in R: disentangling similar possibilities - r

I have a list of names looking like this
Noms<- c("André Coin", "XXXAndré Coin", "Gabriel Péri","Léon Blum", "XXXLéon Blum")
I am trying to create a function that finds each time when each of these names is found in a very long text, at the beginning of a line starting with "M" or "Mme".
My text is a vector in which each line is an element.
So at the end, a line like "M. André Coin said bla bla" would be matched; but a line like "He said bla bla bla to M. André Coin" would NOT be matched.
The final requirement is that "André Coin" can be distinguished from "XXXAndré Coin".
The solution I have found for the moment is:
findpattern <- function(name,vect) {
x<-paste0("^.{1,3}((M\\s*)|(Mme\\s*))*\\s*",name)
found<-grepl(x,vect)
return(found)
}
However, when I run findpattern(Noms,txt), it cannot distinguish "André Coin" from "XXX André Coin". Meaning that findpattern("André Coin", "M. XXXAndré Coin")=TRUE".
Can you help me find my mistake in the writing of my regular expression?

You've missed a dot after the M and the 3 first characters must be optionnal (ie. form 0 to 3 char):
"^.{0,3}((M\\.\\s*)|(Mme\\s*))*\\s*"
If the dot after M is optionnal:
"^.{0,3}((M\\.?\\s*)|(Mme\\s*))*\\s*"

Related

Is there a way in R to separate sentences where whitespace is missing, i.e. "Sentence one.Sentence two"?

I've scraped chunks of text from XML files that are often missing whitespace between sentences. I've used str_split with great success to break the chunks into digestible sentences, like below:
list_of_strings <- str_split(chunk_of_text, pattern=boundary("sentence")
This works pretty well, but it can't deal with situations where the terminal period is not followed by a space. For example, "This sentence ends.This sentence continues." It returns this as 1 sentence, not two.
Using str_split with pattern=boundary("sentence") doesn't work.
If I search and replace periods with period-space, of course that screws up numbers like 1.5 pounds.
I've explored using wildcards to detect the situation, e.g.,
str_view_all(x, "[[:alpha:]]\\.[[:alpha:]]"))
but I can't figure out how to either 1) insert a space after the period so a subsequent call to str_split works correctly, or 2) split at the period.
Any advice on separating sentences when this occurs?
Newbie R programmer here, thanks for your help!
library(stringr)
x <- "This sentence ends.This sentence continues. It costs 1.5 pounds.They needed it A.S.A.P.Here's one more sentence."
str_split(x, "\\.\\s?(?=[A-Z][^\\.])")
[[1]]
[1] "This sentence ends" "This sentence continues"
[3] "It costs 1.5 pounds" "They needed it A.S.A.P"
[5] "Here's one more sentence."
Explanation:
"\\. # literal period
\\s? # optional whitespace
(?=[A-Z] # followed by a capital letter
[^\\.])" # which isn’t followed by another period
Also note this doesn’t account for every possibility. For instance, it’ll erroneously split after "Dr." for "Dr. Perez is on call.". You could handle that case by adding a negative lookbehind:
"(?<!Dr|Mr|Mrs|Ms|Mx)\\.\\s?(?=[A-Z][^\\.])"
But the specific contents, and other edge cases to handle, will depend on your data.

An alternative to scan() in R (here-document style)

I am looking for a way to read text into a vector such that each line would be a different element, all happening within an R script.
One way that I found was something like:
bla <- scan(text = "line1
line2
line3",
what = character())
Which correctly gives me:
> bla
[1] "line1" "line2" "line3"
However, there are several problems. First, it is indented. I don't have to, but any auto indentation features will just pop it back to be aligned (which I commonly use). Second, this requires escape codes if I would like to use the double quote symbol for example.
Is there a way to do something similar to the Here-Document method (<< EOF), in R scripts?
I am using RStudio as my IDE, running on Windows. Preferably there would be a platform independent way of doing this.
EDIT
Do you need to have the text inside the R script?
Yes.
An example of what I want to do:
R script here
⋮
bla <- <SOMETHING - BEGIN>
line1
line2
line3
<SOMETHING - END>
⋮
more R script here
Where the requirement, again, is that I can type freely without worrying about auto indentation moving the lines forward, and no need to worry about escape codes when typing things like ".
Both problems can be solved with the scan function and two little tricks, I think:
scan(text = '
line1
"line2" uses quotation mark
line3
', what = character(), sep = "\n")
Read 3 items
[1] "line1" "\"line2\" uses quotation mark"
[3] "line3"
When you put the quotation marks in a line of their own, you don't have a problem with auto indentation (tested using RStudio). If you only have double quotation marks in the text, you can use single quotation marks to start and end your character object. If you have single quotation marks in the text, use double quotation marks for character. If you have both, you should probably use search and replace to make them uniform.
I also added sep = "\n", so every line is one element of the resulting character vector.
Since R version 4.0, we have raw strings (See ?Quotes)
bla <- r"(line1
line2
"line3"
'line4'
Here is indentation
Here is a backslash \
)"
#> [1] "line1\nline2\n\"line3\"\n'line4'\n Here is indentation\nHere is a backslash \\\n"
Note though it gives one single string, not separate elements. We can split it back with strsplit:
bla <- strsplit(bla, "\n")[[1]]
#> [1] "line1"
#> [2] "line2"
#> [3] "\"line3\""
#> [4] "'line4'"
#> [5] " Here is indentation"
#> [6] "Here is a backslash \\"
If authoring an Rmarkdown document instead of an R script is an option, we could use the knitr cat engine
---
title: "Untitled"
output: html_document
---
```{cat engine.opts=list(file='foo')}
line1
line2
"line3"
'line4'
```
```{r}
bla <- readLines("foo")
bla
```

Names string preparation for sex impute

I'm new at R and I need to prepare a column of names and then impute sex, but I'm having some problems with the preparation of the strings, specifically this is an example of what I have:
Name example:
"alberto eduardo etchegaray de la cerda ."
What I need to do is eliminate all the "de" "del" "lo" "los" "la" "las" "double white spaces" "end of string white spaces" and everything that is interfering with the names.
My code so far to clean the string is (in a second line i will eliminate the spaces):
str_replace_all('alberto eduardo etchegaray de la cerda',
'\\bdel*\\b|\\blos*\\b|\\blas*\\b|.$',
replacement=" ")
and the result:
"alberto eduardo etchegaray cerd "
The problem is that I'm getting some words cut when i need them complete.
Use this regular expression:
str_replace_all(name,'\\b(del?|los?|las?)\\b|\\.',replacement=" ")
Result:
"alberto eduardo etchegaray cerda "
You could also use the following regexp to avoid inserting double spaces:
str_replace_all(name,'\\s?\\b(del?|los?|las?)\\b|\\.',replacement="")
Result:
"alberto eduardo etchegaray cerda "
Others have given you better regular expressions to use, but did not explain why yours changed "cerda" to "cerd ". (I would recommend using the one by R. Schifini as it is pretty clear.
The problem with your regular expression is the .$ at the end. This tells the function that (if after checking for the other alternatives) it finds any character followed by the end of string, to replace that final character (with the space). In your first example string there is a final ., but in the string that you pass to str_replace_all the final character is the "a" in "cerda" that is being replaced. I expect that what you really want to do is to replace a literal . at the end of the string, so you need \\.$ or [.]$ to match a literal period because the unescaped . is a special character that matches any single character (except a newline in some cases).

RegEx. How to remove blank space after a period before a punctuation character

I have a question on regex. Suppose i have this string
"She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
I want to remove every blank space after period and before character ” and delete character ”
For example this part of sentence
She was like an eating machine. ”Trump, a man who wants to be president:
should become
She was like an eating machine.Trump, a man who wants to be president: "
Thanks guys, regex is not easy to learn. Appreciate any help! bye
p.s i'm using software R but i think it's irrelevant since regex works in every programming language
UPDATE
I solved my problem and i'd like to share it, maybe could help someone else. I have this dataset downloaded from kaggle about trump and hillary tweet.
I have to do some cleaning before importing data on Knime(project at university).
I have solved all encoding issues through gsub except this. i finally manage to solve it writing a csv file in R with Encoding UTF-8. Clearly i read that file in Knime with the same encoding
If you need to match any number of whitespaces (1 or more) between a dot and the curly double quote, you may use
x <- "She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
gsub("\\.\\s+”", ".", x)
## => [1] "She gained about 55 pounds in...9 months. She was like an eating machine.Trump, a man who wants to be president: "
The \\. matches a dot, \\s+ matches 1 or more whitespace symbols and ” matches a ”.
See the regex demo and an R demo.
If there is only 1 regular space between the dot and the quote, you may use a fixed string replacement:
gsub(". ”", ".", x, fixed=TRUE)
See this R demo.
May be this could help:
var str = 'She was like an eating machine. "Trump, a man who wants to be president. "New value';
str.replace(/\.\s"/g,".");
http://regexr.com/ is a great tool for learning and testing regular expressions.
The only thing I'd add to Wiktor's answer is that it won't match "machine.”Trump". To match any number of spaces after a dot and before a quote, use the * quantifier:
x <- "She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
gsub("\\.\\s*”", ".", x)

Regular expression whole sentence

I have problem with regular expression. Here is an example text:
"Status: matched: 10:36:08 09/03/2013 from=0.0.0.0:162 oid=1.3.6.1.4.1.11536.3.6.1000 trap= n/a specific= n/a traptime=60 days, 17:39:10.0 community=Cyber-Ark agent=192.118.37.30 version= v2c var1=italog var2= var3=03/09/2013 10:35:37 ITATS426E Safe oniya_gemel is out of space.__"
Which regular expression should I use to match everything after “var3 + out of space”. I need the whole sentence as match: “var3=03/09/2013 10:35:37 ITATS426E Safe oniya_gemel is out of space.__”
I have a regular expression toll and used
/(var3=)*(out of space)/
so far, but it matches only “out of space”.
Any input would be much appreciated!
Thank you in advance!!!!
Vesec
You need to specify that there are characters between var3= and out of space.
/(var3=).*(out of space)/
The star states "Previous symbol repeated 0 or more times". So your regexp looked for 0 or more repetitions of "var3=" immidiately followed by "out of space". Ie. the star affected the "var3=" part, rather than indicating that there should be characters between that and "out of space".
"." in regexp matches any character, so my proposal states "var3=", followed by 0 or more characters, followed by "out of space".

Resources