R - Regex whitespace pattern matches excessively - r

I have what seems like a simple regular expression I want to match on and replace. I'm processing a bunch of free form text and respondents have a variety of ways of denoting a line break. One such is at least 4 sequential bits of whitespace. Another is a heavy dot. However, in R (perl=FALSE) I get some very strange behavior. The regex \\s{4,}|• replaces the whole string with one <br>, if I change the repetition to be just 4 (\\s{4}|•) then it returns 19 <br>. If I remove the |• then it works fine. If I explicitly call out 4 whitespace characters or the heavy dot, \\s\\s\\s\\s+|•, it works fine.
What is it about repeating \\s or checking for a heavy dot • that causes such erratic behavior?
x = "Call Narrative <br>11/15/2017 19:53:00 J574511 <br> <br>"
replacement = "<br>"
orig_pattern = "\\s{4,}|•"
alt1 = "\\s\\s\\s\\s+|•"
alt2 = "\\s{4,}"
alt3 = "\\s{4,}|<p>"
alt4 = "\\s{4}|•"
gsub(orig_pattern,replacement,x)
#> [1] "<br>"
gsub(alt1,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt2,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt3,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt4,replacement,x)
#> [1] "<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>"
UPDATE
It seems to be associated with the OS. The problem originated on Amazon Linux 2 but works fine on Windows.

Related

How can I use R Regular Expressions to catch a Hebrew word?

I've been trying to catch the word
עונה
plus the subsequent number after it in a string such as
כל הילדים אוכלים, עונה 2 , פרק 8-לזניית ירקות וסלמון בדבש
Demonstrating it on Regex101.com was straightforward enough, with עונה(\s+\d+|\d+), but with R I came up empty.
str<-"כל הילדים אוכלים, עונה 2 , פרק 8-לזניית ירקות וסלמון בדבש"
exp<-"עונה(\\s+\\d+|\\d+)"
str_extract_all(str,exp)
Output:
[[1]]
character(0)
You can use this regex:
/[\u0590-\u05FF]/*

Remove quotes if "=" (equal) sign exists in the middle of the string. REGEX

In this string the character “=” differentiates attributes for a product, and commas distinguish variables within an attribute. However, we found that sometimes extra quotes have been added when there are no variables to put together.
The complete string is :
Uso="Protector para patas de silla,mesas,escaleras,muebles","Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
This is right:
Uso="Protector para patas de silla,mesas,escaleras,muebles"
This is wrong:
"Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
Categoría="Topes,4-Tipo=Topes,regatones",2-Familia=Ferretería y Plomería,regatones,7-Contenido="12 unidades,4-Origen=China,4-Material=Goma,2-Modelo=Goma transparente,9-Incluye=12 unidades,3-Color=Transparente"
I´ve tried "|w+=" but selects all quotes. I don´t want to select text between quotes, the goal is select and remove these quotes.
We want to remove those quotes that contains an equal in between. The quotes that are ok and need to stay are those used to separate commas within the string, differentiating the variables from the string.
The regex needs to detect an = contained into and opening and closing quotes, but considering text in between. And once this is detected remove those quotes, which no need to be there.
Thanks!
I understand the quoted substring should be preceded with =. Then, you need
gsub('="([^"=]*=[^"]*)"', '=\\1', x)
See the R demo online:
x <- '10-Uso="Protector para patas de silla,mesas,escaleras,muebles",6-Características=Regaton interior 1 1/4 plástico blanco 4 unidades,1-Marca=Nagel,Tipo=Topes,5-Medidas=3 cm,3-Categoría=Topes y regatones,7-Contenido=4 unidades,4-Tipo=Regatones,2-Familia=Ferretería y Plomería,9-Incluye=4 regatones plásticos,regatones,4-Origen="Argentina,4-Material=Plástico,2-Modelo=Regatón interior 1 1/4,3-Color=Blanco"'
cat(gsub('="([^"=]*=[^"]*)"', '=\\1', x))
## => 10-Uso="Protector para patas de silla,mesas,escaleras,muebles",6-Características=Regaton interior 1 1/4 plástico blanco 4 unidades,1-Marca=Nagel,Tipo=Topes,5-Medidas=3 cm,3-Categoría=Topes y regatones,7-Contenido=4 unidades,4-Tipo=Regatones,2-Familia=Ferretería y Plomería,9-Incluye=4 regatones plásticos,regatones,4-Origen=Argentina,4-Material=Plástico,2-Modelo=Regatón interior 1 1/4,3-Color=Blanco
So, the quote after muebles is kept and quote after blanco is removed.
How does this work?
=" - matches =" substring
([^"=]*=[^"]*) - matches and captures into Group 1:
[^"=]* - zero or more chars other than " and =
= - a = sign
[^"]* - any 0+ chars other than "
" - matches ".
The replacement pattern is a = and the value stored in Group 1 memory buffer (\1, a replacement backreference).
See the regex demo.

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

Arranging text lines in R

My data is in this format. It's a text file and the class is "character". I have posted few lines from the file. There are about 14000 lines.
"KEY: Aback"
"SYN: Backwards, rearwards, aft, abaft, astern, behind, back."
"ANT: Onwards, forwards, ahead, before, afront, beyond, afore."
"KEY: Abandon"
"SYN: Leave, forsake, desert, renounce, cease, relinquish,"
"discontinue, castoff, resign, retire, quit, forego, forswear,"
"depart_from, vacate, surrender, abjure, repudiate."
"ANT: Pursue, prosecute, undertake, seek, court, cherish, favor,"
"protect, claim, maintain, defend, advocate, retain, support, uphold,"
"occupy, haunt, hold, assert, vindicate, keep."
Line 6 and 7 is the continuation of line 5. Line 9 and 10 is the continuation of line 8. My struggle is how can I bring up line 6 and 7 to line 5 and similarly line 9 and 10 to line 8.
Any hints gratefully received.
First thing that comes to mind (your text is stored as x):
#prefix each line starter (identifies as pattern: `CAPS:`) with a newline (\n)
strsplit(gsub("([A-Z]+:)", "\n\\1", paste(x, collapse = " ")),
split = "\n")[[1L]][-1L]
# [1] "KEY: Aback "
# [2] "SYN: Backwards, rearwards, aft, abaft, astern, behind, back. "
# [3] "ANT: Onwards, forwards, ahead, before, afront, beyond, afore. "
# [4] "KEY: Abandon "
# [5] "SYN: Leave, forsake, desert, renounce, cease, relinquish, discontinue, castoff, resign, retire, quit, forego, forswear, depart_from, vacate, surrender, abjure, repudiate. "
# [6] "ANT: Pursue, prosecute, undertake, seek, court, cherish, favor, protect, claim, maintain, defend, advocate, retain, support, uphold, occupy, haunt, hold, assert, vindicate, keep."

Is there a "quote words" operator in R? [duplicate]

This question already has answers here:
Does R have quote-like operators like Perl's qw()?
(6 answers)
Closed 5 years ago.
Is there a "quote words" operator in R, analogous to qw in Perl? qw is a quoting operator that allows you to create a list of quoted items without having to quote each one individually.
Here is how you would do it without qw (i.e. using dozens of quotation marks and commas):
#!/bin/env perl
use strict;
use warnings;
my #NAM_founders = ("B97", "CML52", "CML69", "CML103", "CML228", "CML247",
"CML322", "CML333", "Hp301", "Il14H", "Ki3", "Ki11",
"M37W", "M162W", "Mo18W", "MS71", "NC350", "NC358"
"Oh7B", "P39", "Tx303", "Tzi8",
);
print(join(" ", #NAM_founders)); # Prints array, with elements separated by spaces
Here's doing the same thing, but with qw it is much cleaner:
#!/bin/env perl
use strict;
use warnings;
my #NAM_founders = qw(B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8
);
print(join(" ", #NAM_founders)); # Prints array, with elements separated by spaces
I have searched but not found anything.
Try using scan and a text connection:
qw=function(s){scan(textConnection(s),what="")}
NAM=qw("B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8")
This will always return a vector of strings even if the data in quotes is numeric:
> qw("1 2 3 4")
Read 4 items
[1] "1" "2" "3" "4"
I don't think you'll get much simpler, since space-separated bare words aren't valid syntax in R, even wrapped in curly brackets or parens. You've got to quote them.
For R, the closest thing that I can think of, or that I've found so far, is to create a single block of text and then break it up using strsplit, thus:
#!/bin/env Rscript
NAM_founders <- "B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8"
NAM_founders <- unlist(strsplit(NAM_founders,"[ \n]+"))
print(NAM_founders)
Which prints
[1] "B97" "CML52" "CML69" "CML103" "CML228" "CML247" "CML277" "CML322"
[9] "CML333" "Hp301" "Il14H" "Ki3" "Ki11" "Ky21" "M37W" "M162W"
[17] "Mo18W" "MS71" "NC350" "NC358" "Oh43" "Oh7B" "P39" "Tx303"
[25] "Tzi8"

Resources