This question already has answers here:
R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
(2 answers)
Closed 4 years ago.
I find this really odd :
pattern <- "[[:punct:][:digit:][:space:]]+"
string <- "a . , > 1 b"
gsub(pattern, " ", string)
# [1] "a b"
library(stringr)
str_replace_all(string, pattern, " ")
# [1] "a > b"
str_replace_all(string, "[[:punct:][:digit:][:space:]>]+", " ")
# [1] "a b"
Is this expected ?
Still working on this, but ?"stringi-search-charclass" says:
Beware of using POSIX character classes, e.g. ‘[:punct:]’. ICU
User Guide (see below) states that in general they are not
well-defined, so may end up with something different than you
expect.
In particular, in POSIX-like regex engines, ‘[:punct:]’ stands for
the character class corresponding to the ‘ispunct()’
classification function (check out ‘man 3 ispunct’ on UNIX-like
systems). According to ISO/IEC 9899:1990 (ISO C90), the
‘ispunct()’ function tests for any printing character except for
space or a character for which ‘isalnum()’ is true. However, in a
POSIX setting, the details of what characters belong into which
class depend on the current locale. So the ‘[:punct:]’ class does
not lead to portable code (again, in POSIX-like regex engines).
So a POSIX flavor of ‘[:punct:]’ is more like ‘[\p{P}\p{S}]’ in
‘ICU’. You have been warned.
Copying from the issue posted above,
string <- "a . , > 1 b"
mypunct <- "[[\\p{P}][\\p{S}]]"
stringr::str_remove_all(string, mypunct)
I can appreciate stuff being locale-specific, but it still surprises me that [:punct:] doesn't even work in a C locale ...
Related
This question already has an answer here:
How to use regex character class extensions in R?
(1 answer)
Closed 3 months ago.
I am working with the R programming language.
I have a column of data that looks something like this:
string = c("a1 123-456-7899 hh", "b 124-123-9999 b3")
I would like to remove the "phone numbers" so that the final result looks like this:
[1] "a1 hh" "b b3"
I tried to apply the answer provided here Regular expression to match standard 10 digit phone number to my question:
gsub("^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$", "", string, fixed = TRUE)
But I get the following error: Error: '\+' is an unrecognized escape in character string starting ""^(\+"
Can someone please show me how to fix this?
Thanks!
Try:
library(stringr)
s <- c("a1 123-456-7899 hh", "b 124-123-9999 b3")
result <- str_replace(s, "\\d+[-]\\d+[-]\\d+\\s", "")
print(result)
OUTPUT:
[1] "a1 hh" "b b3"
This will look for :
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
\\s : a space
And replace it with "" - nothing
I am trying to extract latitudes, longitudes, and a label from a string in R (v3.4.1). My thought is that a regular expression is the way to go, and since the stringr package has the ability to extract capturing groups, I thought this is the package to use. The problem is that I am receiving an error that I cannot interpret. Any help would be appreciated.
Here is an example of a string that I would like to extract the information from. I want to grab the last set of latitude (41.505) and longitude (-81.608333) along with the label (Adelbert Hall).
a <- "Case Western Reserve University campus41°30′18″N 81°36′30″W / 41.505°N 81.608333°W / 41.505; -81.608333 (Adelbert Hall)"
Here is the regular expression that I created to grab the fields that I am interested in.
coordRegEx <- "([\\d]*\\.\\d*)(?#Capture Latitude);\\h(-\\d*\\.\\d*)(?#Capture Longitude)\\N*\\((\\N*)(?#Capture Label)\\)"
Now, when I try to match the regular expression in the string using:
s <- str_match(a,coordRegEx)
I get the following error:
Error in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : Incorrect Unicode property. (U_REGEX_PROPERTY_SYNTAX)
My guess is that this error has something to do with the Regex pattern, but using documentation and web searches, I have been unable to decipher it.
There are several issues with the current code:
The (?#:...) are comments that are only allowed when you pass an x modifier to the regex
The \N shorthand character that matches any non-line break char is not supported by the ICU regex library (it supports \N{UNICODE CHARACTER NAME} that matches a named character). You may replace \N with ..
See your fixed approach:
> a <- "Case Western Reserve University campus41°30′18″N 81°36′30″W / 41.505°N 81.608333°W / 41.505; -81.608333 (Adelbert Hall)"
> coordRegEx <- "(?x)(\\d*\\.\\d*)(?#Capture Latitude);\\h(-\\d*\\.\\d*)(?#Capture Longitude).*\\((.*)(?#Capture Label)\\)"
> s <- str_match(a,coordRegEx)
> s
[,1] [,2] [,3] [,4]
[1,] "41.505; -81.608333 (Adelbert Hall)" "41.505" "-81.608333" "Adelbert Hall"
If we need a string output
sub(".*\\/\\s*", "", a)
#[1] "41.505; -81.608333 (Adelbert Hall)"
If we need it as separate
strsplit(sub(".*\\/\\s*", "", a), ";\\s*|\\s*\\(|\\)")[[1]]
#[1] "41.505" "-81.608333" "Adelbert Hall"
I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>
This question already has an answer here:
How to use back reference with stringi package?
(1 answer)
Closed 6 years ago.
Most stringr functions are just wrappers around corresponding stringi functions. str_replace_all is one of those. Yet my code does not work with stri_replace_all, the corresponding stringi function.
I am writing a quick regex to convert (a subset of) camel case to spaced words.
I am quite puzzled as to why this works:
str <- "thisIsCamelCase aintIt"
stringr::str_replace_all(str,
pattern="(?<=[a-z])([A-Z])",
replacement=" \\1")
# "this Is Camel Case ain't It"
And this does not:
stri_replace_all(str,
regex="(?<=[a-z])([A-Z])",
replacement=" \\1")
# "this 1s 1amel 1ase ain't 1t"
If you look at the source for stringr::str_replace_all you'll see that it calls fix_replacement(replacement) to convert the \\# capture group references to $#. But the help on stringi:: stri_replace_all also clearly shows that you use $1, $2, etc for the capture groups.
str <- "thisIsCamelCase aintIt"
stri_replace_all(str, regex="(?<=[a-z])([A-Z])", replacement=" $1")
## [1] "this Is Camel Case aint It"
The below option should return the same output in both cases.
pat <- "(?<=[a-z])(?=[A-Z])"
str_replace_all(str, pat, " ")
#[1] "this Is Camel Case aint It"
stri_replace_all(str, regex=pat, " ")
#[1] "this Is Camel Case aint It"
According to the help page of ?stri_replace_all, there are examples that suggest $1, $2 are used for replacement
stri_replace_all_regex('123|456|789', '(\\p{N}).(\\p{N})', '$2-$1')
So, it should work if we replace the \\1 with $1
stri_replace_all(str, regex = "(?<=[a-z])([A-Z])", " $1")
#[1] "this Is Camel Case aint It"
This question already has answers here:
Replace specific characters within strings
(7 answers)
Closed 7 years ago.
Is there a function in R that can respond at these requirements:
if string1 exists in string2 then remove string1 from string2
I passed a day searching on a such function. So, any help would be appreciated.
Edit:
I have a dataframe. Here's a part of it:
mark name ChekMark
Caudalie Caudalie Eau démaquillante 200ml TRUE
Mustela Mustela Bébé lait hydra corps 300ml TRUE
Lierac Lierac Phytolastil gel prévention TRUE
I want to create an new dataframe in witch the mark doesn't exist on the product name.
That's my final goal.
You can use gsub and work with regular expressions:
gsub(" this part ", " ", "A Text where this part should be removed")
# [1] "A Text where should be removed"
gsub(" this part ", " ", "A Text where this 1 part should be removed")
# [1] "A Text where this 1 part should be removed"
Are you looking for string2.replace(string1, '')?
or you could:
>>> R = lambda string1, string2: string2.replace(string1, '')
>>> R('abc', 'AAAabcBBB')
'AAABBB'
>>>