Using R, how to use str_extract properly on this case? - r

I've learned from Ronak Shah and akrun(in this post) how to construct a regular expression to exclude every terms from a dataframe (alldata in my example) except those words,
^\BWORD1|WORD2|WORD3|WORD4|WORD5\>
but for some reasons, can't figure why it is giving me
"WORD2", "WORD3", NA
instead of
"WORD1 WORD2 WORD5", "WORD3", NA
here is my example :
library(stringr)
alldata <- data.frame(toupper(c("word1 anotherword word2 word5", "word3", "none")))
names(alldata)<-"columna"
removeex <- c("word1" , "word2" ,"word3" ,"word4", "word5")
regularexprex <- toupper(paste0("^\\b",paste0(removeex, collapse = "|"), "\\>"))
alldata$columnb <- str_extract(alldata$columna, regularexprex)
I've tried to add + or * at the end of the regular expression but without any effects.
Due to the fact i'm a beginner on regex, i surely miss something, may someone guide me on this ?
Regards,

You need to replace the last two lines in your above code to
> regularexprex <- paste0("(?i)\\s*\\b(?!(?:",paste0(removeex, collapse = "|"), ")\\b)\\w+")
## => "(?i)\\s*\\b(?!(?:word1|word2|word3|word4|word5)\\b)\\w+"
> str_replace_all(alldata$columna, regularexprex, "")
[1] "WORD1 WORD2 WORD5" "WORD3" ""
First, the toupper() turned \b to \B (non-word boundary) - you just need a case insensitive matching (I added the (?i) modifier), and the word boundaries were not applied to the group, only to the items on the both sides.
Also, what you need is a pattern to match the whole string, so .* at the start and end of the pattern.
The final regex for replacing looks like
(?i)\s*\b(?!(?:word1|word2|word3|word4|word5)\b)\w+
See the regex demo
If your entries contain newlines, you should also add s modifier: (?i) -> (?s).
Details:
(?i) - case insensitive modifier (works with PCRE and ICU regexes)
\s* - 0+ whitespaces
\b - a leading word boundary
(?!(?:word1|word2|word3|word4|word5)\b) - the word cannot equal word1, etc.
\w+ - 1+ word chars (letters, digits or underscores).

Related

Remove all words in string containing punctuation (R)

How (in R) would I remove any word in a string containing punctuation, keeping words without?
test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"
desired <- "I a see works not"
Here is an approach using sub which seems to work:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
[1] "I a see works not"
This approach is to use the following regex pattern:
[A-Za-z]* match a leading letter zero or more times
[^A-Za-z ] then match a symbol once (not a space character or a letter)
\\S* followed by any other non whitespace character
\\s* followed by any amount of whitespace
Then, we just replace with empty string, to remove the words having one or more symbols in them.
You can use this regex
(?<=\\s|^)[a-z0-9]+(?=\\s|$)
(?<=\\s|^) - positive lookbehind, match should be preceded by space or start of string.
[a-z0-9]+ - Match alphabets and digits one or more time,
(?=\\s|$) - Match must be followed by space or end of string
Demo
Tim's edit:
This answer uses a whitelist approach, namely identify all words which the OP does want to retain in his output. We can try matching using the regex pattern given above, and then connect the vector of matches using paste:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")
[1] "I a see works not"
Here are couple of more approaches
First approach:
str_split(test.string, " ", n=Inf) %>% # spliting the line into words
unlist %>%
.[!str_detect(., "\\W|\r")] %>% # detect words without punctuation or \r
paste(.,collapse=" ") # collapse the words to get the line
Second approach:
str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>%
unlist %>%
trimws() %>%
paste(., collapse=" ")
^\\w+ - words having only [a-zA-Z0-9_] and also is start of the string
\\s\\w+\\s - words with [a-zA-Z0-9_] and having space before and after the word
\\w+$ - words having [a-zA-Z0-9_] and also is the end of the string

Regular Expression - Extract Word in r

how can I extract MLA723950998 from this string?
"https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
I was able to manage to extract MLA.
gsub('.*(M\\w+).*', '\\1', "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM")
MLA
You may use
.*/(M\w+)-(\d+).*
and replace with \1\2.
Details
.*/ - any 0+ chars, as many as possible, up to and including the last / in the string
(M\w+) - Group 1 (later referred to with \1 placeholder from the replacement pattern): M and 1+ letters, digits or/and _
- - a hyphen
(\d+) - Group 2 (later referred to with \2 placeholder from the replacement pattern): one or more digits
.* - the rest of the string.
See the regex demo
See the R demo:
x <- "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
gsub('.*/(M\\w+)-(\\d+).*', '\\1\\2', x)
# => [1] "MLA723950998"
Maybe this solution works for you:
library(stringi)
x = "https://auto.mercadolibre.com.ar/MLA-723950998-peugeot-208-0km-16-active-plan-100-financiado-darc-_JM"
stri_extract_last_regex(x, "(?<=/)([A-Za-z]+.\\d+)(?=[^/]+$)")
[1] "MLA-723950998"
(i) The first lookbehind finds the position of a slash, (ii) which is then followed by letters, 1 x any character and digits, (iii) which by the lookahead may only be followed by anything but a slash.

Replace all characters except expression using gsub only

Given strings:
smple_paths <- c("/path/path/path/abc22/path/path",
"/apath/apath/paath/abc11/something/path")
I would like to replace all characters excluding phrase abc\\d{2}
Attempt
gsub(
pattern = "(?!abc\\d{2})",
replacement = "",
x = smple_paths,
perl = TRUE
)
# [1] "/path/path/path/abc22/path/path"
# [2] "/apath/apath/paath/abc11/something/path"
Desired results
abc22
abc11
Notes
I'm not looking for stringr::str_extract based solution or any other solution not based on gsub
If you do not care about the abc\d{2} context, you may use
sub(".*(abc\\d{2}).*", "\\1", smple_paths)
See this regex demo and this R demo.
If you care about the context, you may match and capture abc + 2 digits after / and before / or end of the string, while matching any text before and after this pattern using
sub("^.*/(abc\\d{2})(?:/.*)?$", "\\1", smple_paths)
See the R demo and a regex demo.
Details
^ - start of the string (not necessary here, but kept for the sake of clarity)
.* - any 0+ chars, as many as possible
/ - a / char
(abc\\d{2}) - Group 1: abc and 2 digits
(?:/.*)? - an optional (1 or 0) occurrence of a / followed with any 0+ chars as many as possible
$ - end of string.
The \1 placeholder in the replacement pattern inserts the captured text back into the result.

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Camel Case format conversion using regular expressions in R

I have two related questions regarding regular expressions in R:
[1]
I would like to convert sub-strings, containing punctuation followed by a letter, to an upper case letter.
Example:
Dr_dre to: DrDre
Captain.Spock to: CaptainSpock
spider-man to: spiderMan
[2]
I would like convert camel case strings to lower case strings with underscore delimiter.
Example:
EndOfFile to: End_of_file
CamelCase to: Camel_Case
ABC to: A_B_C
Thanks much,
Kamashay
We can use sub. We match one or more punctuation characters ([[:punct:]]+) followed by a single character which is captured as a group ((.)). In the replacement, the backreference for the capture group (\\1) is changed to upper case (\\U).
sub("[[:punct:]]+(.)", "\\U\\1", str1, perl = TRUE)
#[1] "DrDre" "CaptainSpock" "spiderMan"
For the second case, we use regex lookarounds i.e. match a letter ((?<=[A-Za-z])) followed by a capital letter and replace with _.
gsub("(?<=[A-Za-z])(?=[A-Z])", "_", str2, perl = TRUE)
#[1] "End_Of_File" "Camel_Case" "A_B_C"
data
str1 <- c("Dr_dre", "Captain.Spock", "spider-man")
str2 <- c("EndOfFile", "CamelCase", "ABC")

Resources