Extract a certain pattern string from the text by R - r

I have a column of texts look like below:
str1 = "ABCID 123456789 is what I'm looking for, could you help me to check this Item's status?"
I want to use gsub function in R to extract "ABCID 123456789" from there. The number might change with different numbers, but ABCID is a constant. Can someone know the solution with that please? Thanks very much!

We can use str_extract to select the fixed word followed by space and one or more numbers (\\d+)
library(stringr)
str_extract(df1$col1, "ABCID \\d+")
If there are multiple instances, use str_extract_all
str_extract_all(df1$col1, "ABCID \\d+")
NOTE: The OP states that to extract "ABCID 123456789" from there

If the number has constant length (9) you could you use positive lookbehind:
sub("(?<=ABCID \\d{9}).*", "", str1, perl = TRUE)
# [1] "ABCID 123456789"

Match the beginning of string (^) leading letters (ABCID), a space, digits (\d+) and everything else (.*) and replace it all with the captured portion, i.e. the portion within parentheses. Note that we want to use sub, not gsub, here because there is only one substitution.
sub("^(ABCID \\d+).*", "\\1", str1)
## [1] "ABCID 123456789"

Related

Regex get string between intervals underscores

I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

Add symbol between the letter S and any number in a column dataframe

I am trying to add a - between letter S and any number in a column of a data frame. So, this is an example:
VariableA
TRS34
MMH22
GFSR104
GS23
RRTM55
P3
S4
My desired output is:
VariableA
TRS-34
MMH22
GFSR104
GS-23
RRTM55
P3
S-4
I was trying yo use gsub:
gsub('^([a-z])-([0-9]+)$','\\1d\\2',myDF$VariableA)
but this is not working.
How can I solve this?
Thanks!
Your ^([a-z])-([0-9]+)$ regex attempts to match strings that start with a letter, then have a - and then one or more digits. This can't work as there are no hyphens in the strings, you want to introduce it into the strings.
You can use
gsub('(S)([0-9])', '\\1-\\2', myDF$VariableA)
The (S)([0-9]) regex matches and captures S into Group 1 (\1) and then any digit is captured into Group 2 (\2) and the replacement pattern is a concatenation of group values with a hyphen in between.
If there is only one substitution expected, replace gsub with sub.
See the regex demo and the online R demo.
Other variations:
gsub('(S)(\\d)', '\\1-\\2', myDF$VariableA) # \d also matches digits
gsub('(?<=S)(?=\\d)', '-', myDF$VariableA, perl=TRUE) # Lookarounds make backreferences redundant
Here is the version I like using sub:
myDF$VariableA <- gsub('S(\\d)', 'S-\\1', myDF$VariableA)
This requires using only one capture group.
Using stringr package
library(stringr)
str_replace_all(myDF$VariableA, 'S(\\d)', 'S-\\1')
You could also use lookbehinds if you set perl=TRUE:
> gsub('(?<=S)([0-9]+)', '-\\1', myDF$VariableA, perl=TRUE)
[1] "TRS-34" "MMH22" "GFSR104" "GS-23" "RRTM55" "P3" "S-4"
>

using regular expressions (regex) to make replace multiple patterns at the same time in R

I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M

How to extract a string between a symbol and a space?

I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.
You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+
The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).
Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Resources