R - How to split text and punctuation with a exception? - r

Analysing Facebook comments in R for Sentimental Analysis. Emojis are coding in text between <> symbols.
Example:
"Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
<U+2764> and <U+1F628> are emojis (heavy black heart and fearful face,
respectively).
So, I need split words/numbers and punctuations/symbols, except in emoji codes.
I did, using gsub function, this:
a1 <- "([[:alpha:]])([[:punct:]])"
a2 <- "([[:punct:]])([[:alpha:]])"
b <- "\\1 \\2"
gsub(a1, b, gsub(a2, b, "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"))
...but, the results, logically, also affects emojis code:
[1] "Jesus te ama !!! < U +2764> Ou não ...?< U +1F628> ( fé em stand by )"
The objective is create a exception for the text between <>, split it externally and don't split internally - i.e.:
[1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
Note that:
sometimes the space between the sentence/word/punct and a emoji code is non-existent (needs to be created)
It is required that a punct sequence stays join (e.g. "!!!", "...?")
How can I do it?

You may use the following regex solution:
a1 <- "(?<=<)U\\+\\w+>(*SKIP)(*F)|(?<=\\S)(?=<U\\+\\w+>)|(?<=[[:alpha:]])(?=[[:punct:]])|(?<=[[:punct:]])(?=[[:alpha:]])"
gsub(a1, " ", "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)", perl=TRUE)
# => [1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
See the online R demo
This PCRE regex (see perl=TRUE argument in the call to gsub) matches:
(?<=<)U\\+\\w+>(*SKIP)(*F) - a U+ and 1+ word chars with > after if preceded with < - and the match value is discarded with the PCRE verbs (*SKIP)(*F) and the next match is looked for from the end of this match
| - or
(?<=\\S)(?=<U\\+\\w+>) - a non-whitespace char must be present immediately to the left of the current location, and a <U+, 1+ word chars and > must be present immediately to the right of the current location
| - or
(?<=[[:alpha:]])(?=[[:punct:]]) - a letter must be present immediately to the left of the current location, and a punctuation must be present immediately to the right of the current location
| - or
(?<=[[:punct:]])(?=[[:alpha:]]) - a punctuation must be present immediately to the left of the current location, and a letter must be present immediately to the right of the current location

> str <- "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
> strsplit(str,"[[:space:]]|(?=[.!?])",perl=TRUE)
[[1]]
[1] "Jesus" "te" "ama" "!" "!" "!"
[7] "" "<U+2764>" "" "Ou" "não" "."
[13] "." "." "?" "<U+1F628>" "(fé" "em"
[19] "stand" "by)"

Related

Splitting a comma- and semicolon-delimited string in R

I'm trying to split a string containing two entries and each entry has a specific format:
Category (e.g. active site/region) which is followed by a :
Term (e.g. His, Glu/nucleotide-binding motif A) which is followed by a ,
Here's the string that I want to split:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
This is what I have tried so far. Except for the two empty substrings, it produces the desired output.
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
How do I get rid of the empty substrings?
You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true
You can exclude matching a colon or comma using a negated character class before matching :
[^:,\n]+:.*?(?=,(?:\w|$))
Explanation
[^:,\n]+ Match 1+ chars other than : , or a newline
: Match the colon
.*? Match any char as least as possbiel
(?= Positive lookahead, assert that what is directly to the right from the current position:
, Match literally
(?:\w|$) Match either a single word char, or assert the end of the string
) Close the lookahead
Regex demo | R demo
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))
Output
[1] "active site: His, Glu" "region: nucleotide-binding motif A"
Much longer and not as elegant as #The fourth bird +1,
but it works:
library(stringr)
string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,$', '')
> string
[1] "active site: His, Glu"
[2] "region: nucleotide-binding motif A"

Regex to match a pattern but not two specific cases

I want to match every cases of "-", but not these ones:
[\d]-[A-Z]
[A-Z]-[\d]
I tried this pattern: ((?<![A-Z])-(?![0-9]))|((?<![0-9])-(?![A-Z])) but some results are incorrect like: "RUA VF-32 N"
Can anyone help me?
A simple approach is to use grep with your current logic and inverting the result, and then run another grep to only keep those items that have a hyphen in them:
x <- c("QUADRA 120 - ASA BRANCA","FAZENDA LAGE -RODOVIA RIO VERDE","C-15","99-B","A-A")
grep("-", grep("[A-Z]-\\d|\\d-[A-Z]", x, invert=TRUE, value=TRUE), value=TRUE, fixed=TRUE)
# => [1] "QUADRA 120 - ASA BRANCA" "FAZENDA LAGE -RODOVIA RIO VERDE"
# [3] "A-A"
Here, [A-Z]-\\d|\\d-[A-Z] matches a hyphen either in between an uppercase ASCII etter or a digit or betweena digit and an ASCII uppercase letter. If there is a match, the result is inverted due to invert=TRUE.
See the R demo.
To only match - in all contexts other than in between a letter and a digit, you may use the PCRE regex based on SKIP-FAIL technique like
> grep("(?:\\d-[A-Z]|[A-Z]-\\d)(*SKIP)(*F)|-", x, perl=TRUE)
[1] 1 2
See this regex demo
Details
(?:\d-[A-Z]|[A-Z]-\d) - a non-capturing group that matches either a digit, - and then uppercase ASCII letter, or an uppercase ASCII letter, - and a digit
(*SKIP)(*F) - omit the current match and proceed looking for the next match at the end of the "failed" match
| - or
- - a hyphen.

Remove quotes if "=" (equal) sign exists in the middle of the string. REGEX

In this string the character “=” differentiates attributes for a product, and commas distinguish variables within an attribute. However, we found that sometimes extra quotes have been added when there are no variables to put together.
The complete string is :
Uso="Protector para patas de silla,mesas,escaleras,muebles","Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
This is right:
Uso="Protector para patas de silla,mesas,escaleras,muebles"
This is wrong:
"Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
Categoría="Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
I´ve tried "|w+=" but selects all quotes. I don´t want to select text between quotes, the goal is select and remove these quotes.
We want to remove those quotes that contains an equal in between. The quotes that are ok and need to stay are those used to separate commas within the string, differentiating the variables from the string.
The regex needs to detect an = contained into and opening and closing quotes, but considering text in between. And once this is detected remove those quotes, which no need to be there.
Thanks!
I understand the quoted substring should be preceded with =. Then, you need
gsub('="([^"=]*=[^"]*)"', '=\\1', x)
See the R demo online:
x <- '10-Uso="Protector para patas de silla,mesas,escaleras,muebles",6-Características=Regaton interior 1 1/4 plástico blanco 4 unidades,1-Marca=Nagel,Tipo=Topes,5-Medidas=3 cm,3-Categoría=Topes y regatones,7-Contenido=4 unidades,4-Tipo=Regatones,2-Familia=Ferretería y Plomería,9-Incluye=4 regatones plásticos,regatones,4-Origen="Argentina,4-Material=Plástico,2-Modelo=Regatón interior 1 1/4,3-Color=Blanco"'
cat(gsub('="([^"=]*=[^"]*)"', '=\\1', x))
## => 10-Uso="Protector para patas de silla,mesas,escaleras,muebles",6-Características=Regaton interior 1 1/4 plástico blanco 4 unidades,1-Marca=Nagel,Tipo=Topes,5-Medidas=3 cm,3-Categoría=Topes y regatones,7-Contenido=4 unidades,4-Tipo=Regatones,2-Familia=Ferretería y Plomería,9-Incluye=4 regatones plásticos,regatones,4-Origen=Argentina,4-Material=Plástico,2-Modelo=Regatón interior 1 1/4,3-Color=Blanco
So, the quote after muebles is kept and quote after blanco is removed.
How does this work?
=" - matches =" substring
([^"=]*=[^"]*) - matches and captures into Group 1:
[^"=]* - zero or more chars other than " and =
= - a = sign
[^"]* - any 0+ chars other than "
" - matches ".
The replacement pattern is a = and the value stored in Group 1 memory buffer (\1, a replacement backreference).
See the regex demo.

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes.
textclean::replace_non_ascii("You don‘t get “your” money’s worth")
Received Output: "You dont get your moneys worth"
Expected Output: "You don't get "your" money's worth"
Also would appreciate if someone's got the regex to replace every such quotes in one shot.
Thanks!
Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:
> gsub("[“”]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"
See the online R demo. Tested in both Linux and Windows, and works the same.
The [“”] construct is a positive character class that matches any single char defined in the class.
To normalize all chars similar to double quotes, you might want to use
> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8")))
> cat(res, sep="\n")
You don't get "your" money's worth
Here, [«»““”„‟≪≫《》〝〞〟"″‶] matches
« 00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
“ 05F4 HEBREW PUNCTUATION GERSHAYIM
“ 201C LEFT DOUBLE QUOTATION MARK
” 201D RIGHT DOUBLE QUOTATION MARK
„ 201E DOUBLE LOW-9 QUOTATION MARK
‟ 201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪ 226A MUCH LESS-THAN
≫ 226B MUCH GREATER-THAN
《 300A LEFT DOUBLE ANGLE BRACKET
》 300B RIGHT DOUBLE ANGLE BRACKET
〝 301D REVERSED DOUBLE PRIME QUOTATION MARK
〞 301E DOUBLE PRIME QUOTATION MARK
〟 301F LOW DOUBLE PRIME QUOTATION MARK
" FF02 FULLWIDTH QUOTATION MARK
″ 2033 DOUBLE PRIME
‶ 2036 REVERSED DOUBLE PRIME
The [ʻʼʽ٬‘’‚‛՚︐] is used to normalize some chars similar to single quotes:
ʻ 02BB MODIFIER LETTER TURNED COMMA
ʼ 02BC MODIFIER LETTER APOSTROPHE
ʽ 02BD MODIFIER LETTER REVERSED COMMA
٬ 066C ARABIC THOUSANDS SEPARATOR
‘ 2018 LEFT SINGLE QUOTATION MARK
’ 2019 RIGHT SINGLE QUOTATION MARK
‚ 201A SINGLE LOW-9 QUOTATION MARK
‛ 201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚ 055A ARMENIAN APOSTROPHE
︐ FE10 PRESENTATION FORM FOR VERTICAL COMMA
There's a function in {proustr} to normalize punctuation, called pr_normalize_punc() :
https://github.com/ColinFay/proustr#pr_normalize_punc
It turns :
=> ″‶« »“”`´„“ into "
=> ՚ ’ into '
=> … into ...
For example :
library(proustr)
a <- data.frame(text = "Il l՚a dit : « La ponctuation est chelou » !")
pr_normalize_punc(a, text)
# A tibble: 1 x 1
text
* <chr>
1 "Il l'a dit : \"La ponctuation est chelou\" !"
For your text :
pr_normalize_punc(data.frame( text = "You don‘t get “your” money’s worth"), text)
# A tibble: 1 x 1
text
* <chr>
1 "You don‘t get \"your\" money's worth"
We can use gsub here for a base R option. Replace each curly quoted term at a time.
text <- "You don‘t get “your” money’s worth"
new_text <- gsub("“(.*?)”", "\"\\1\"", text)
new_text <- gsub("’", "'", new_text)
new_text
[1] "You don‘t get \"your\" money's worth"
I have assumed here that your curly quotes are always balanced, i.e. they always wrap a word. If not, then you might have to do more work.
Doing a blanket replacement of opening/closing double curly quotes may not play out as intended, if you want them to remain as is when not quoting a word.
Demo

Resources