Regex: extracting matches preceding a pattern in R [duplicate] - r

This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!

We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")

Related

Info about the semantics in str_extract in R? (With an example) [duplicate]

This question already has answers here:
Extract number between underscore in text
(3 answers)
Closed 1 year ago.
I want to understand how to use the semantics in str_extract in the stringr package in R.
I have strings that are written like this and 11_3_S11.html"
and I would like to extract from them the value after the first underscore.
I mean, I want to remove the number 3.
files = c("11_3_S11.html")
I would appreciate it if someone can explain the logic or send me a link with all the semantics.
Thank you for your time
In base R, you can use sub to extract a number after 1st underscore.
sub('\\d+_(\\d+)_.*', '\\1', files)
#[1] "3"
where \\d+ refers to 1 or more number.
() is referred as capture group to capture the value that we are interested in.
You can use the same regex in str_match if you want to use stringr.
stringr::str_match(files, '\\d+_(\\d+)_.*')[, 2]
[1] "3"
Using look around.
str_extract("11_3_S11.html", '(?<=_)\\d(?=_)')
[1] "3"

Regex capture 1 character [duplicate]

This question already has answers here:
Complete word matching using grepl in R
(3 answers)
Closed 4 years ago.
Whenever english character of length 1 exists, I want that to be combined with the previous text.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida a ингибинов')
Anti-Candidaa ингибинов
For the example below, it should return 'Anti-Candida am ингибинов' as 'am' is of length 2.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida am ингибинов')
You can use this regex:
\W+([a-zA-Z])\b
replace with \\1. The trick here is to match a word boundary after the single letter.
Demo
Your regex will work as well, if you just add that \b at the end.

Difference between [A-Z] and LETTERS in grep [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 5 years ago.
I am trying to only keep rows whose id contains letters. And I find the following two ways give different results.
df[grep("[A-Z]",df$id),]
df[grep(LETTERS,df$id),]
It seems the second way will omit many rows that actually have letters.
Why?
If you want to grep patterns in a vector try this:
to_match <- paste(LETTERS, collapse = "|")
to_match
[1] "A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z"
and then
df[grep(to_match, df$id), ]
Explanation:
You will match any of the characters in "to_match" since they are separated by the "or" operator "|".

Remove characters in string before specific symbol(including it) [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?
You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

what regular expression [[:space:][:digit:]]+ stands for in r [duplicate]

This question already has answers here:
What is the difference between square brackets and parentheses in a regex?
(3 answers)
How to use double brackets in a regular expression?
(2 answers)
Closed 5 years ago.
I am trying to find out what is this regular expression [[:space:][:digit:]]+ stands for.
I learn from Wikipedia that [:space:] means Whitespace characters and [:digit:] means Digits from 0 to 9.
So I think [[:space:][:digit:]]+ matches any Whitespace characters followed by a digit like ' 1' or ' 9'.
But, when I try this in r:
> txt <- c("arm","foot","lefroo", "laura ")
> i <- grep("[[:space:][:digit:]]+", txt)
> txt[i]
[1] "laura "
there is no digit in "laura ", but it sill matched.
this really confused me, any one can explain this?

Resources