Keep text between 2nd dash and first flash in R - r

I have a vector of strings that look like this:
a - bc/def_g - A/mn/us/ww
opq - rs/ts_uf - BC/wx/yza
Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE
I'd like to get the text after 2nd dash (-) but before first flash (/), i.e. the result should look like
A
BC
XYZ
What is the best way to do it (the vector has more than 500K rows.)
Thanks

Suppose your string is defined like this:
string <- c("a - bc/def_g - A/mn/us/ww",
"opq - rs/ts_uf - BC/wx/yza",
"Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
Then you can use sub
> sub(".*\\-\\s+([A-Z]+)/.*", "\\1", string)
[1] "A" "BC" "XYZ"

See regex in use here
^[^-]*-[^-]*-\s*\K[^/]+
^ Assert position at the start of the line
[^-]* Match any character except - any number of times
- Match this literally
[^-]* Match any character except - any number of times
- Match this literally
\s* Match any number of whitespace characters
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^/]+ Match any character except / one or more times
Alternatively, as suggested by Jan in the comments below (I believe it has since been deleted) ^(?:\[^-\]*-){2}\s*\K\[^/\]+ may be used. It's shorter and easily scalable, but more adds steps.
See code in use here
x <- c("a - bc/def_g - A/mn/us/ww", "opq - rs/ts_uf - BC/wx/yza", "Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
m <- regexpr("^[^-]*-[^-]*-\\s*\\K[^/]+", x, perl=T)
regmatches(x, m)
Result: [1] "A" "BC" "XYZ"

Related

cleaning phonenumbers using regex

I have the following composition of phonenumbers where 33 is the area code:
+331234567
+3301234567
00331234567
003301234567
0331234567
033-123-456-7
0033.1234567
where Im expecting only 331234567
What I have tried to clean those numbers using R
R::tidyverse::str_replace_all(c("+331234567", "033-123-456-7", "0033.1234567"), pattern = "[^0-9.]", replacement = "") removing non-numeric characters
R::tidyverse::str_replace_all("0331234567", pattern = "^0", replacement = "") removing the leading 0
R::tidyverse::str_replace_all("00331234567", pattern = "^00", replacement = "") removing the leading 00
my question is how to remove the zeros in between: 3301234567 or 003301234567 or +3301234567 or 03301234567
Appreciate any help
You can use
gsub("^(?:00?|\\+)330?|\\W", "", x, perl=TRUE)
See the regex demo. See the R demo online.
If there can be more 0s after 33 before the number you need to extract, replace 0? with 0*.
Details
^ - start of string
(?:00?|\+) - 00, 0 or +
330? - 33 or 330
| - or
\W - any non-word char.
You can use ^\+?0*3*0*|[^\s\d]
Pattern explanation:
^ - match beginning of the string
\+? - match + literally, zero or one time.
0* - match zero or more 0
3* - match zero or more 3
| - alternation
[^\s\d] - negated character class - match any character other from whitespace and digit (you could remove \s if you handle one number at a time, it just prevents from matching newline in demo)
Regex demo
It will match unwanted parts separately. First part will clean beginning of a number if it starts with + or 0, second part will clean non-digits inside the number.

How to match phonemic transcriptions with a single vowel except if a condition applies

I have phonemic transcriptions of English words such as these:
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
I'd like to match mono-syllabic words, i.e., words that contain a single vowel. My set of phonemic vowels is this:
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
Using str_count and the vector vowel as pattern, I'm able to match a fairly good set of words:
library(stringr)
test[str_count(test, vowel) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "wɒznt" "ðeər" "dɪdnt"
However, wɒznt and dɪdntcan be seen as bi-syllabic (as the nsound can replace a vowel so that nt counts as a second vowel). So the question is, how can I match mono-syllabic words except those that end in nt?
What I've tried so far is this set operation, which works well but looks clumsy:
setdiff(test[str_count(test, vowel) == 1], test[str_count(test, paste0("[^", vowel, "]nt$")) == 1])
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
I'd much rather have a single more concise regex. Any ideas?
You can use
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
library(stringr)
p <- paste0("^(?!.*(?<!",vowel,")nt$)(?:(?!",vowel,").)*(?:",vowel,")(?:(?!",vowel,").)*$")
test[str_detect(test, p)]
## => [1] "kɑːnt" "ʧeɪnʤd" "ðeər"
See the online R demo. See the regex demo. The pattern means
^ - start of string
(?!.*(?<!",vowel,")nt$) - immediately to the right, there must not be any 0+ chars other than line break chars as many as possible followed with nt (not preceded with any of the specified vowel sound sequences) and end of string
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
(?:",vowel,") - any of the specified vowel sound sequences
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
$ - end of string.
This is a somewhat concise solution (thanks to #G5W for the decisive hint):
vowel_cc <- paste0(unique(unlist(strsplit(gsub("\\|", "", vowel), ""))), collapse = "")
vowel_cc
[1] "iːaɪɔəʊɛeuɑɜɒʌæ"
test[str_count(test, paste0(vowel, "|[^", vowel_cc, "]+nt$")) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
This solution uses a vector vowel_cc consisting of all unique characters in vowels. These serve as input for a negated character class. The pattern specifies nt as one of the vowel alternatives on the condition that it be preceded by one or more non-vowel_ccs and occur at string end.

Positive Lookbehind and Lookahead to the end of string

My string patterns looks like this:
UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870 and I am trying to extract everything after the second last +, i.e. 180101:0050+10870.
Thus far, I managed to address the second last block 180101:0050 with this expression (?<=\+)[^\+]+(?=\+[^\+]*$) but fail to include the last block including the last +. Here is my sample: regex101
The expression is meant for R and I still need to escape the characters later on. This format it just for testing purposes in Regex101.
We could capture group based on the occurrence of + from the end ($) of the string.
sub(".*\\+([^+]+\\+[^+]+$)", "\\1", str1)
#[1] "180101:0050+10870"
data
str1 <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
You may use
\+\K[^+]+\+[^+]*$
Or, if you would like to use it with stringr::str_extract:
(?<=\+)[^+]+\+[^+]*$
See the regex demo. Details:
\+ - a + char
\K - match reset operator
(?<=\+) - location right after a + symbol
[^+]+ - one or more chars other than +
\+ - a +
[^+]+ - one or more chars other than +
$ - end of string.
See R demo online:
x <- "UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870"
regmatches(x, regexpr("\\+\\K[^+]+\\+[^+]*$", x, perl=TRUE))
## => [1] "180101:0050+10870"
library(stringr)
str_extract(x, "(?<=\\+)[^+]+\\+[^+]*$")
## => [1] "180101:0050+10870"
Another way you can do in this case:
library(stringr)
str_extract("UNB+UNOC:3+4399945681577+_GLN_Company__+180101:0050+10870", "\\d+:\\d+\\+\\d+")
#"180101:0050+10870"

Remove hashtags from beginning and end of tweets in R

I am trying to remove hashtags from beginning of strings in R.
For example:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
I want to remove the hashtags at the end of string which are #lateNightThoughts and #movie. Result:
- "I didn't know it could be #boring. guess I need some fun"
I tried :
stringi::stri_replace_last_regex(x,'#\\S+',"")
but it removes only the very last hashtag.
- "I didn't know it could be #boring. guess I need some fun #movie "
Any idea how to get the expected result?
Edit:
How about removing hashtag from beginning of text ?
eg:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
You may use
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
Or, if you do not care about the context of the first # you want to start matching from, you may even use
sub("(?:\\s*#\\w+)+\\s*$", "", x)
See the regex demo.
Details
\s* - zero or more whitespaces
\B - right before the current location, there can be start of string or a non-word char (this is usually used to ensure you do not match # inside a "word", so if you do not need it, you may remove this non-word boundary)
# - a # char
\w+ - 1 or more word chars (letters, digits or _)
(?:\s*#\w+)* - zero or more occurrences of:
\s* - zero or more whitespaces
# - a # char
\w+ - 1+ word chars
\s* - zero or more whitespaces
$ - end of string.

Replace some text after a string with Regex and Gsub in R

It's a simple question, but I'm not good with Regex. (I tried many expressions without success)
I want to replace all the text (replace for nothing) after a pattern.
My pattern is something like this:
/canais/*/
My data is:
/canais/b3/conheca-o-pai-dos-indices-da-b3/
/canais/cpbs/cvm-abre-audiencia-publica-de-instruc
/canais/stocche-forbes/dividendo-controverso/
The desired result is:
/canais/b3/
/canais/cpbs/
/canais/stocche-forbes/
How can I do it with gsub?
Thanks
You may use the following sub:
x <- c("/canais/b3/conheca-o-pai-dos-indices-da-b3/","/canais/cpbs/cvm-abre-audiencia-publica-de-instruc","/canais/stocche-forbes/dividendo-controverso/")
sub("^(/canais/[^/]+/).*", "\\1", x)
See the online R demo
Details:
^ - start of string
(/canais/[^/]+/) - Group 1 (later referred to with \1) capturing:
/canais/ - a substring /canais/
[^/]+ - 1 or more chars other than /
/ - a slash
.* - any 0+ chars up to the end of string.

Resources