How to substitute a character in multiples locations with R - r

I'm trying to split a dataframe with "," separators. However, some parts of the strings have the pattern [0-9][,][0-9]{2}, and i'd like to substitute only the comma inside, not the hole pattern, in order to preserve the numerical inputs.
I try to solve with stringr, but got stucked in the following pattern of error:
library(stringr)
string <- '"name: John","age: 27","height: 1,73", "weight: 78,30"'
str_replace_all(string, "[0-9][,][0-9]{2}", "[0-9][;][0-9]{2}")
[1] "\"name: John\",\"age: 27\",\"height: [0-9][;][0-9]{2}\", \"weight: 7[0-9][;][0-9]{2}\""
I know it can be done with substitution by position, but the string is too big.
I'd appreciate any help. Thanks in advance.

You need to use capturing groups around the parts of the pattern you need to keep and, in the replacement pattern, refer to those submatches with backreferences:
> str_replace_all(string, "([0-9]),([0-9]{2})", "\\1;\\2")
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Or the same regex can be used with gsub:
> gsub("([0-9]),([0-9]{2})", "\\1;\\2", string)
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Details:
([0-9]) - capturing group 1, whose value is referred to using \\1 in the replacement pattern, matching a single digit
, - a comma
([0-9]{2}) - capturing group 2, whose value is referred to using \\2 in the replacement pattern, matching 2 digits.

Related

How to create a regex expression to get a substring between 2 pipes

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^ beginning of string;
[^\\|]* not the pipe character zero or more times;
\\| the pipe character needs to be escaped since it's a meta-character;
^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
([^\\|]+) group match anything but the pipe character at least once;
\\|.*$ the second pipe plus anything until the end of the string.
Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")
[1] "ENSG00000004059.11"
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

How to extract number within but excluding brackets with str_extract() from package stringr?

There are plenty of regex questions out there but I cannot solve the following in a elegant way.
I have the following vector and would like to extract only the numbers wihtin the square brackets, that is, excluding the brackets themselves. The numbers may be negative. The question might also be:
How to extract only the first capturing group with the function str_extract from the {stringr} package?
string <- c("[1] cate 1", "[-1] cate -1", "[2] cate 2")
stringr::str_extract(string = string, pattern = "\\[[^:digit:]+\\]")
[1] "[1]" "[-1]" "[2]"
stringr::str_extract(string = string, pattern = "\\[[^(:digit:)]+\\]")
[1] "[1]" "[-1]" "[2]"
I also tried to append \\1 to the pattern in order to extract the first group and got the following error:
stringr::str_extract(string = string, pattern = "\\[[^(?:digit:)]+\\]\\1")
Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
Back-reference to a non-existent capture group. (U_REGEX_INVALID_BACK_REF)
I appreciate your time and apologize if this question is a duplicate.
You can use
stringr::str_extract(string, "(?<=\\[)-?\\d+(?=\\])")
See the R demo
If you need to match integer or float numbers, you can use
stringr::str_extract(string, "(?<=\\[)-?\\d*\\.?\\d+(?=\\])")
Details:
(?<=\[) - a positive lookbehind that matches a location immediately preceded with [
-? - an optional - char
\d+ - one or more digits
\d*\.?\d+ - matches zero or more digits, an optional . and then one or more digits
(?=\]) - a positive lookahead that matches a location immediately followed with ].

using regular expressions (regex) to make replace multiple patterns at the same time in R

I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M

Extract a certain pattern string from the text by R

I have a column of texts look like below:
str1 = "ABCID 123456789 is what I'm looking for, could you help me to check this Item's status?"
I want to use gsub function in R to extract "ABCID 123456789" from there. The number might change with different numbers, but ABCID is a constant. Can someone know the solution with that please? Thanks very much!
We can use str_extract to select the fixed word followed by space and one or more numbers (\\d+)
library(stringr)
str_extract(df1$col1, "ABCID \\d+")
If there are multiple instances, use str_extract_all
str_extract_all(df1$col1, "ABCID \\d+")
NOTE: The OP states that to extract "ABCID 123456789" from there
If the number has constant length (9) you could you use positive lookbehind:
sub("(?<=ABCID \\d{9}).*", "", str1, perl = TRUE)
# [1] "ABCID 123456789"
Match the beginning of string (^) leading letters (ABCID), a space, digits (\d+) and everything else (.*) and replace it all with the captured portion, i.e. the portion within parentheses. Note that we want to use sub, not gsub, here because there is only one substitution.
sub("^(ABCID \\d+).*", "\\1", str1)
## [1] "ABCID 123456789"

How to extract first occurrence of alphabets in a string in R?

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using
regmatches(x$symbol,regexpr("[A-z]+",x$symbol))
but getting an error:
Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC",
"CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data
has 11366772 Calls: $<- -> $<-.data.frame Execution halted
I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)
Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""
sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"
Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group
sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"
data
str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217
The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.
What you may do is:
1) Use sub
x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] "" "CHELSEAFC" "BARCAFC"
>
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)
The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).
2) If you want to keep NA for the values with no match, use stringr str_extract:
library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1 <NA>
## 2 CHELSEAFC
## 3 BARCAFC
Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

Resources