R grep pattern regex with brackets - r

I have a problem with grep in R:
patterns= c("AB_(1)","AB_(2)")
text= c("AB_(1)","DDD","CC")
grep(patterns[1],text)
>integer(0) ????
the grep command has problem with "()" brackets, is there any as.XX(patterns[1]) that I can use??

You need escape by double backslash:
> patterns= c("AB_\\(1\\)","AB_(2)")
> text= c("AB_(1)","DDD","CC")
>
> grep(patterns[1],text)
[1] 1

If there are no special pattern matching characters in the regular expression (as is the case in the example shown in the question) then use fixed=TRUE:
grep(patterns[1], text, fixed = TRUE)

Related

R get rid of string before/after special characters (pipe and <>) using regex

I'm trying to get rid of characters before or after special characters in a string.
My example string looks like this:
test <- c(">P01923|description", ">P19405orf|description2")
I'm trying to get the part between the > key and the | key, so that I'd be left with c("P01923", "P19405orf") only. I was trying to do this by using gsub twice, first to get rid of everything behind | and then to get rid of >.
I first tried this: gsub("|.*, "", test) but this seems to remove all the characters (not sure why?). I used the regex101.com website to check my regex and learned that | is a special character and that I need to use \| instead, and this worked in the regex101.com website, so I tried gsub("\|.*", "", test), but this gave me an error saying "\|' is an unrecognized escape in character string starting ""\|". I'm having the same problem with >.
How can I get R to recognize special characters like | and > using regex?
If you use "..." to specify character constants you need also escape the \ what leads to \\. But you can also use r"(...)" to specify raw character constants where you can use one \.
gsub(".*>|\\|.*", "", test)
[1] "P01923" "P19405orf"
gsub(r"(.*>|\|.*)", "", test)
[1] "P01923" "P19405orf"
Here .*> removes everything before and >, and \|.* removes | and everything after it and the | in between is an or.
Alternatively regexpr and regmatches could be used like:
regmatches(test, regexpr("(?<=>)[^|]*", test, perl=TRUE))
#[1] "P01923" "P19405orf"
Where (?<=>) is a look behind for > and [^|]* matches everything but not |.
You can extract text between > and |. Special characters can be escaped with \\.
sub('>(.*)\\|.*', '\\1', test)
#[1] "P01923" "P19405orf"
Here is a regex split option. We can split the input string on [>|], which will leave the desired substring in the second position of the output vector.
test <- c(">P01923|description", ">P19405orf|description2")
unlist(lapply(strsplit(test, "[>|]"), function(x) x[2]))
[1] "P01923" "P19405orf"
library(stringr)
test <- c(">P01923|description", ">P19405orf|description2")
#if '>' is always the first character
str_sub(test, 2, -1) %>%
str_replace('\\|.*$', '')
#> [1] "P01923" "P19405orf"
#if not
str_replace(test, '\\>', '') %>%
str_replace('\\|.*$', '')
#> [1] "P01923" "P19405orf"
#alternative way
str_match(test, '\\>(.*)\\|')[, 2]
#> [1] "P01923" "P19405orf"
Created on 2021-06-30 by the reprex package (v2.0.0)

Erase comma and apostrophe in character R

I want to remove the comma and the apostrophe but the point of the following character. After that pass to numeric
I have this:
characterExample <- "234'564,900.99"
I want 234564900.99
I try the following but I can't:
result <- gsub("[:punct:].","", characterExample)
Another option is to explicitly remove the characters you want to remove:
gsub("[',]", "", characterExample)
#[1] "234564900.99"
``
An option is to not match the digits or the . by using ^ within the square bracket
gsub("[^0-9.]+","", characterExample)
#[1] "234564900.99"
Or another option is to make use of SKIP/FAIL for the ., while matching the rest of the punct
gsub("(\\.)(*SKIP)(*F)|[[:punct:]]+", "", characterExample, perl = TRUE)
#[1] "234564900.99"
NOTE: Both solutions make sure that it matches any punct characters other than the . and replace with blank ("")
It can also use the pipe symbol like this:
#Code
gsub(",|'","", characterExample)
Output:
gsub(",|'","", characterExample)
[1] "234564900.99"

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

R Regex: removing only the immediate following character after >

I have the following string in R:
string1 = "A((..A>B)A"
I would like to remove all punctation, and the letter immediately after >, i.e. >B
Here is the output I desire:
output = "AAA"
I tried using gsub() as follows:
output = gsub("[[:punct:]]","", string1)
But this gives AABA, which keeps the immediately following character.
This would work using your work plus a leading lookbehind first to look for what comes after the > character.
gsub('(?<=>).|[[:punct:]]', '', "A((..A>B)A", perl=TRUE)
## [1] "AAA"
A slightly less complex regex without the use of perl seems to work for this example as well:
gsub("[[:punct:]]|>(.)", "", "A((..A>B)A")
[1] "AAA"
You say
remove all punctation, and the letter immediately after >
Punctuation is matched with [[:punct:]] and a letter can be matched with [[:alpha:]], thus, you may use a TRE regex with gsub:
string1 = "A((..A>B)A"
gsub(">[[:alpha:]]|[[:punct:]]", "", string1)
# => [1] "AAA"
See the online R demo
Note that > is also a char matched with [[:punct:]], thus, you do not need any lookarounds here, just remove it with a letter after it.
Pattern details:
>[[:alpha:]] - a > and any letter
| - or
[[:punct:]] - a punctuation or symbol.

R: How to extract specific digits from a string?

I want to retrieve the first Numbers (here -> 344002) from a string:
string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
I am preferably looking for a regular expression, which looks for the Numbers after the ! and before the &amp.
All I came up with is this but this catches the ! as well (!344002):
regmatches(string, gregexpr("\\!([[:digit:]]+)", string, perl =TRUE))
Any ideas?
Use this regex:
(?<=\!)\d+(?=&amp)
Use this code:
regmatches(string, gregexpr("(?<=\!)\d+(?=&amp)", string, perl=TRUE))
(?<=\!) is a lookbehind, the match will start following !
\d+ matches one digit or more
(?=&amp) stops the match if next characters are &amp
library(gsubfn)
strapplyc(string, "!(\\d+)")[[1]]
Old answer]
Test this code.
library(stringr)
str_extract(string, "[0-9]+")
similar question&answer is present here
Extract a regular expression match in R version 2.10
You may capture the digits (\d+) in between ! and &amp and get it with regexec/regmatches:
> string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
> pattern = "!(\\d+)&"
> res <- unlist(regmatches(string,regexec(pattern,string)))
> res[2]
[1] "344002"
See the online R demo

Resources