Remove characters which repeat more than twice in a string [duplicate] - r

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?

Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Related

Gsub in R for hyphens and digits [duplicate]

This question already has answers here:
Trim a string to a specific number of characters in R
(3 answers)
Using gsub in R to remove values in Zip Code field
(1 answer)
Closed 2 years ago.
I'm trying to use gsub on the df$Zipcode in the following data frame:
#Sample
df <-data.frame(ID = c(1,2,3,4,5,6,7),
Zipcode =c("10001-2838", "95011", "95011", "100028018", "84321", "84321", "94011"))
df
I want to take everything after the "-" (hyphen) out and replace it with nothing. Something like:
df$Zipcode <- gsub("\-", "", df$Zipcode)
But I don't think that is quite right. I also want to take the first 5 digits of all Zipcodes that are longer than 5 digits, like observation 4. Which should just be 10002. Maybe this is correct:
df$Zipcode <- gsub("[:6:]", "", df$Zipcode)
We can capture the first 5 characters that are not a - as a group and replace with the backreference (\\1) of the captured group
df$Zipcode <- sub("^([^-]{5}).*", "\\1", df$Zipcode)
df$Zipcode
#[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
I think what you're looking for is this:
sub("(\\d{5}).*", "\\1", df$Zipcode)
[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
This matches the first 5 digits, puts them into a capturing group, and 'remembers' them (but not the rest) via backreference \\1 in the replacement argument to sub.

Remove part of column name post the second "_" [duplicate]

This question already has answers here:
Exclude everything after the second occurrence of a certain string
(2 answers)
Closed 3 years ago.
I have a vector which has names of the columns
group <- c("amount_bin_group", "fico_bin_group", "cltv_bin_group", "p_region_bin")
I want to replace the part after the second "_" from each element i.e. I want it to be
group <- c("amount_bin", "fico_bin", "cltv_bin", "p_region")
I can split this into two vectors and try gsub or substr. However, it would be nice to do that in vector. Any thoughts?
I checked other posts regarding the same question, but none of them has this framework
> sub("(.*)_.*$", "\\1", group)
[1] "amount_bin" "fico_bin" "cltv_bin" "p_region"

Regex capture 1 character [duplicate]

This question already has answers here:
Complete word matching using grepl in R
(3 answers)
Closed 4 years ago.
Whenever english character of length 1 exists, I want that to be combined with the previous text.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida a ингибинов')
Anti-Candidaa ингибинов
For the example below, it should return 'Anti-Candida am ингибинов' as 'am' is of length 2.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida am ингибинов')
You can use this regex:
\W+([a-zA-Z])\b
replace with \\1. The trick here is to match a word boundary after the single letter.
Demo
Your regex will work as well, if you just add that \b at the end.

Difference between [A-Z] and LETTERS in grep [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 5 years ago.
I am trying to only keep rows whose id contains letters. And I find the following two ways give different results.
df[grep("[A-Z]",df$id),]
df[grep(LETTERS,df$id),]
It seems the second way will omit many rows that actually have letters.
Why?
If you want to grep patterns in a vector try this:
to_match <- paste(LETTERS, collapse = "|")
to_match
[1] "A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z"
and then
df[grep(to_match, df$id), ]
Explanation:
You will match any of the characters in "to_match" since they are separated by the "or" operator "|".

Remove characters in string before specific symbol(including it) [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?
You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

Resources