Regex capture 1 character [duplicate] - r

This question already has answers here:
Complete word matching using grepl in R
(3 answers)
Closed 4 years ago.
Whenever english character of length 1 exists, I want that to be combined with the previous text.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida a ингибинов')
Anti-Candidaa ингибинов
For the example below, it should return 'Anti-Candida am ингибинов' as 'am' is of length 2.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida am ингибинов')

You can use this regex:
\W+([a-zA-Z])\b
replace with \\1. The trick here is to match a word boundary after the single letter.
Demo
Your regex will work as well, if you just add that \b at the end.

Related

Regex: extracting matches preceding a pattern in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!
We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")

Match string between ; and % [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 2 years ago.
I wish to extract the decimal value in the string without the % sign. So in this case, I want the numeric 0.45
x <- "document.write(GIC_annual[\"12-17 MTH\"][\"99999.99\"]);0.450%"
str_extract(x, "^;[0-9.]")
My attempt fails. Here's my thinking.
Begin the extraction at the semicolon ^;
Grab any numbers between 0 and 9.
Include the decimal point
You also have this option:
stringr::str_extract(y, "\\d\\.\\d{1,}(?=%)")
[1] "0.450"
So basically you look ahead and check if there is % or not, if yes, you capture the digits before it.
Details
\\d digit;
\\. dot;
\\d digit;
{1,} capturing 1 or more digit after .;
(?=%) look ahead and check if there is % and if there is one, it retuns captured number
Since you don't want semi-colon in the output use it as lookbehind regex.
stringr::str_extract(x, "(?<=;)[0-9]\\.[0-9]+")
#[1] "0.450"
In base R using sub :
sub('.*;([0-9]\\.[0-9]+).*', '\\1', x)

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Difference between [A-Z] and LETTERS in grep [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 5 years ago.
I am trying to only keep rows whose id contains letters. And I find the following two ways give different results.
df[grep("[A-Z]",df$id),]
df[grep(LETTERS,df$id),]
It seems the second way will omit many rows that actually have letters.
Why?
If you want to grep patterns in a vector try this:
to_match <- paste(LETTERS, collapse = "|")
to_match
[1] "A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z"
and then
df[grep(to_match, df$id), ]
Explanation:
You will match any of the characters in "to_match" since they are separated by the "or" operator "|".

Remove characters in string before specific symbol(including it) [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?
You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

Resources