Remove characters in string before specific symbol(including it) [duplicate] - r

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?

You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

Related

Only remove open parentheses in R [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
How can I remove text within parentheses with a regex?
(9 answers)
Closed 3 years ago.
I need to remove a single closed parentheses from a string to fix an edge case in a simpler regex problem.
I need to remove text from within parentheses, but the solution I am currently using doesn't handle an extra single closed parentheses well. Should I use a different approach or can I add an extra step to handle this case?
Below is an example where all answers should be brother & I highlighted the line that it fails on below
cleaner = function(x){
x = tolower(x)
## if terms are in brackets - assume this is an alternative and remove
x = stringr::str_remove_all(x, "\\(.*\\)")
## if terms are seperated by semi-colons or commas, take the first, assume others are alternatives and remove
x = gsub("^(.*?)(,|;).*", "\\1", x)
## remove whitespace
x = stringi::stri_replace_all_charclass(x, "\\p{WHITE_SPACE}", "")
x
}
cleaner("brother(bro)")
cleaner("brother;bro")
cleaner("bro ther")
cleaner("(bro)brother ;bro")
cleaner("(bro)brother ;bro)") ## this fails
cleaner("(bro)brother ;(bro") # this doesnt
stringr::str_remove_all("(bro)brother ;bro)", "\\(.*\\)")
Thanks,
Sam

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Only keep part before the 2th pattern in R [duplicate]

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Remove text after second colon
(3 answers)
Remove all characters after the 2nd occurrence of "-" in each element of a vector
(1 answer)
Closed 3 years ago.
How could I remove everything before the second pattern occurence in a dataframe using R?
I used:
for (i in 1:length(df1)){
df1[, i]<- gsub(".*_", "",df1[, i])
}
But I guess there is a better way to apply that for all the dataframe?
Here is an exemple of a value in the dataframe:
name_000004_A_B_C
name_00003_C_D
and get
A_B_C
C_D
Thank you for your help.
x <- c("name_000004_A_B_C", "name_00003_C_D")
gsub("(name_[0-9]*_)(.*)", "\\2", x)
##[1] "A_B_C" "C_D"
More generalised:
gsub("([a-z0-9]*_[a-z0-9]*_)(.*)", "\\2", x)
#[1] "A_B_C" "C_D"
The global substitution takes two matching group patterns into consideration, first is the pattern (name_[0-9]*_) and the second is whatever comes after. It keeps the second matching group. Hope this hepls!

what regular expression [[:space:][:digit:]]+ stands for in r [duplicate]

This question already has answers here:
What is the difference between square brackets and parentheses in a regex?
(3 answers)
How to use double brackets in a regular expression?
(2 answers)
Closed 5 years ago.
I am trying to find out what is this regular expression [[:space:][:digit:]]+ stands for.
I learn from Wikipedia that [:space:] means Whitespace characters and [:digit:] means Digits from 0 to 9.
So I think [[:space:][:digit:]]+ matches any Whitespace characters followed by a digit like ' 1' or ' 9'.
But, when I try this in r:
> txt <- c("arm","foot","lefroo", "laura ")
> i <- grep("[[:space:][:digit:]]+", txt)
> txt[i]
[1] "laura "
there is no digit in "laura ", but it sill matched.
this really confused me, any one can explain this?

Column contains unit ($ sign) that need to be replaced [duplicate]

This question already has answers here:
How do I strip dollar signs ($) from data/ escape special characters in R?
(4 answers)
Closed 5 years ago.
I have a few columns that contain a $ in the value through the excel sheet.
[1] "$5,656.50" "$3,179.20" "$1,391.40" "$2,376.30" "$1,476.80" "$712.30" "$5,327.80"
[8] "$3,642.70" "$1,506.00" "$7,923.70" "$4,782.30" "$1,392.40" "$229.30" "$1,106.90"
[15] "$1,553.30" "$3,492.30" "$4,029.40" "$1,646.70" "$6,013.90" "$19,928.00" "$4,260.60"
There are >10,000 rows in this column and R will read it as a character due to the "$".
I tried
gsub( "$", " ", thedata$col.with.dollar.signs)
to replace the dollar sign with a space, but it didn't work.
Any other ideas are much appreciated.
This one maybe:
substring(thedata$col.with.dollar.signs, 2)
For example:
vec <- c("$5,656.50", "$3,179.20", "$1,391.40")
substring(vec,2)
#[1] "5,656.50" "3,179.20" "1,391.40"

Resources