Remove quotes if "=" (equal) sign exists in the middle of the string. REGEX - r

In this string the character “=” differentiates attributes for a product, and commas distinguish variables within an attribute. However, we found that sometimes extra quotes have been added when there are no variables to put together.
The complete string is :
Uso="Protector para patas de silla,mesas,escaleras,muebles","Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
This is right:
Uso="Protector para patas de silla,mesas,escaleras,muebles"
This is wrong:
"Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
Categoría="Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
I´ve tried "|w+=" but selects all quotes. I don´t want to select text between quotes, the goal is select and remove these quotes.
We want to remove those quotes that contains an equal in between. The quotes that are ok and need to stay are those used to separate commas within the string, differentiating the variables from the string.
The regex needs to detect an = contained into and opening and closing quotes, but considering text in between. And once this is detected remove those quotes, which no need to be there.
Thanks!

I understand the quoted substring should be preceded with =. Then, you need
gsub('="([^"=]*=[^"]*)"', '=\\1', x)
See the R demo online:
x <- '10-Uso="Protector para patas de silla,mesas,escaleras,muebles",6-Características=Regaton interior 1 1/4 plástico blanco 4 unidades,1-Marca=Nagel,Tipo=Topes,5-Medidas=3 cm,3-Categoría=Topes y regatones,7-Contenido=4 unidades,4-Tipo=Regatones,2-Familia=Ferretería y Plomería,9-Incluye=4 regatones plásticos,regatones,4-Origen="Argentina,4-Material=Plástico,2-Modelo=Regatón interior 1 1/4,3-Color=Blanco"'
cat(gsub('="([^"=]*=[^"]*)"', '=\\1', x))
## => 10-Uso="Protector para patas de silla,mesas,escaleras,muebles",6-Características=Regaton interior 1 1/4 plástico blanco 4 unidades,1-Marca=Nagel,Tipo=Topes,5-Medidas=3 cm,3-Categoría=Topes y regatones,7-Contenido=4 unidades,4-Tipo=Regatones,2-Familia=Ferretería y Plomería,9-Incluye=4 regatones plásticos,regatones,4-Origen=Argentina,4-Material=Plástico,2-Modelo=Regatón interior 1 1/4,3-Color=Blanco
So, the quote after muebles is kept and quote after blanco is removed.
How does this work?
=" - matches =" substring
([^"=]*=[^"]*) - matches and captures into Group 1:
[^"=]* - zero or more chars other than " and =
= - a = sign
[^"]* - any 0+ chars other than "
" - matches ".
The replacement pattern is a = and the value stored in Group 1 memory buffer (\1, a replacement backreference).
See the regex demo.

Related

gsub extracting string

My sample data is:
c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433"
For each line, I'm looking to extract (they are variable names):
Line 1: "PEMJNUM"
Line 2: "PRFAMTYP"
Line 3: "PUBUS1"
Line 4: "PEIO1COW"
My initial goal was to gsub remove the characters to the left and right of each variable name to leave just the variable names, but I was only able to grab everything to the right of the variable name and had issues with grabbing characters to the left. (as shown here https://regexr.com/67r6j).
Not sure if there's a better way to do this!
You can use sub in the following way:
x <- c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433")
sub("^(?:.*\\b)?(\\w+)\\s*\\b2\\b.*", "\\1", x, perl=TRUE)
# => [1] "PEMJNUM" "PRFAMTYP" "PUBUS1" "PEIO1COW"
See the online regex demo and the R demo.
Details:
^ - start of string
(?:.*\b)? - an optional non-capturing group that matches any zero or more chars (other than line break chars since I use perl=TRUE, if you need to match line breaks, too, add (?s) at the pattern start) as many as possible, and then a word boundary position
(\w+) - Group 1 (\1): one or more word chars
\s* - zero or more whitespaces
\b - a word boundary
2 - a 2 digit
\b - a word boundary
.* - the rest of the line/string.
If there are always whitespaces before 2, the regex can be written as "^(?:.*\\b)?(\\w+)\\s+2\\b.*".

Regex to match a pattern but not two specific cases

I want to match every cases of "-", but not these ones:
[\d]-[A-Z]
[A-Z]-[\d]
I tried this pattern: ((?<![A-Z])-(?![0-9]))|((?<![0-9])-(?![A-Z])) but some results are incorrect like: "RUA VF-32 N"
Can anyone help me?
A simple approach is to use grep with your current logic and inverting the result, and then run another grep to only keep those items that have a hyphen in them:
x <- c("QUADRA 120 - ASA BRANCA","FAZENDA LAGE -RODOVIA RIO VERDE","C-15","99-B","A-A")
grep("-", grep("[A-Z]-\\d|\\d-[A-Z]", x, invert=TRUE, value=TRUE), value=TRUE, fixed=TRUE)
# => [1] "QUADRA 120 - ASA BRANCA" "FAZENDA LAGE -RODOVIA RIO VERDE"
# [3] "A-A"
Here, [A-Z]-\\d|\\d-[A-Z] matches a hyphen either in between an uppercase ASCII etter or a digit or betweena digit and an ASCII uppercase letter. If there is a match, the result is inverted due to invert=TRUE.
See the R demo.
To only match - in all contexts other than in between a letter and a digit, you may use the PCRE regex based on SKIP-FAIL technique like
> grep("(?:\\d-[A-Z]|[A-Z]-\\d)(*SKIP)(*F)|-", x, perl=TRUE)
[1] 1 2
See this regex demo
Details
(?:\d-[A-Z]|[A-Z]-\d) - a non-capturing group that matches either a digit, - and then uppercase ASCII letter, or an uppercase ASCII letter, - and a digit
(*SKIP)(*F) - omit the current match and proceed looking for the next match at the end of the "failed" match
| - or
- - a hyphen.

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes.
textclean::replace_non_ascii("You don‘t get “your” money’s worth")
Received Output: "You dont get your moneys worth"
Expected Output: "You don't get "your" money's worth"
Also would appreciate if someone's got the regex to replace every such quotes in one shot.
Thanks!
Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:
> gsub("[“”]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"
See the online R demo. Tested in both Linux and Windows, and works the same.
The [“”] construct is a positive character class that matches any single char defined in the class.
To normalize all chars similar to double quotes, you might want to use
> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8")))
> cat(res, sep="\n")
You don't get "your" money's worth
Here, [«»““”„‟≪≫《》〝〞〟"″‶] matches
« 00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
“ 05F4 HEBREW PUNCTUATION GERSHAYIM
“ 201C LEFT DOUBLE QUOTATION MARK
” 201D RIGHT DOUBLE QUOTATION MARK
„ 201E DOUBLE LOW-9 QUOTATION MARK
‟ 201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪ 226A MUCH LESS-THAN
≫ 226B MUCH GREATER-THAN
《 300A LEFT DOUBLE ANGLE BRACKET
》 300B RIGHT DOUBLE ANGLE BRACKET
〝 301D REVERSED DOUBLE PRIME QUOTATION MARK
〞 301E DOUBLE PRIME QUOTATION MARK
〟 301F LOW DOUBLE PRIME QUOTATION MARK
" FF02 FULLWIDTH QUOTATION MARK
″ 2033 DOUBLE PRIME
‶ 2036 REVERSED DOUBLE PRIME
The [ʻʼʽ٬‘’‚‛՚︐] is used to normalize some chars similar to single quotes:
ʻ 02BB MODIFIER LETTER TURNED COMMA
ʼ 02BC MODIFIER LETTER APOSTROPHE
ʽ 02BD MODIFIER LETTER REVERSED COMMA
٬ 066C ARABIC THOUSANDS SEPARATOR
‘ 2018 LEFT SINGLE QUOTATION MARK
’ 2019 RIGHT SINGLE QUOTATION MARK
‚ 201A SINGLE LOW-9 QUOTATION MARK
‛ 201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚ 055A ARMENIAN APOSTROPHE
︐ FE10 PRESENTATION FORM FOR VERTICAL COMMA
There's a function in {proustr} to normalize punctuation, called pr_normalize_punc() :
https://github.com/ColinFay/proustr#pr_normalize_punc
It turns :
=> ″‶« »“”`´„“ into "
=> ՚ ’ into '
=> … into ...
For example :
library(proustr)
a <- data.frame(text = "Il l՚a dit : « La ponctuation est chelou » !")
pr_normalize_punc(a, text)
# A tibble: 1 x 1
text
* <chr>
1 "Il l'a dit : \"La ponctuation est chelou\" !"
For your text :
pr_normalize_punc(data.frame( text = "You don‘t get “your” money’s worth"), text)
# A tibble: 1 x 1
text
* <chr>
1 "You don‘t get \"your\" money's worth"
We can use gsub here for a base R option. Replace each curly quoted term at a time.
text <- "You don‘t get “your” money’s worth"
new_text <- gsub("“(.*?)”", "\"\\1\"", text)
new_text <- gsub("’", "'", new_text)
new_text
[1] "You don‘t get \"your\" money's worth"
I have assumed here that your curly quotes are always balanced, i.e. they always wrap a word. If not, then you might have to do more work.
Doing a blanket replacement of opening/closing double curly quotes may not play out as intended, if you want them to remain as is when not quoting a word.
Demo

grep formatted number using r

I have a string format that I would like to select from a character vector. The form is
123 123 1234
where the two spaces can also be a hyphen. i.e. 3 digits followed by space or hyphen, followed by 3 digits, followed by space or hyphen, followed by 4 digits
I am trying to do this by the following:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4}$)",mytext)
however this yields:
integer(0)
What am I doing wrong?
Your string has a whitespace at the end, so you can either consider that white space, like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4} $)",mytext)
Or remove the end of line assertion "$", like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4})",mytext)
Also, as pointed out by Wiktor Stribiżew, the character class [ -.] will match any character in the range between " " and ".". To match "-","." and " " you have to escape the "-" or put it at the end of the class. Like [ \-.] or [ .-]

Resources