How to remove beginning-digits only in R [duplicate] - r

This question already has answers here:
Remove numbers at the beginning and end of a string
(3 answers)
Remove string from a vector in R
(4 answers)
Closed 5 years ago.
I have some strings with digits and alpha characters in them. Some of the digits are important, but the ones at the beginning of the string (and only these) are unimportant. This is due to a peculiarity in how email addresses are stored. So the best example is:
x<-'12345johndoe23#gmail.com'
Should be transformed to johndoe23#gmail.com
unfortunately there are no spaces. I have tried gsub('[[:digit:]]+', '', x) but this removes all numbers, not just the beginning-ones
Edit: I have found some solutions in other languages: Python: Remove numbers at the beginning of a string

As per my comment:
See regex in use here
^[[:digit:]]+
^ Asserts position at the start of the string

You can do this:
x<-'12345johndoe23#gmail.com'
gsub('^[[:digit:]]+', '', x) #added ^ as begin of string

Another regex is :
sub('^\\d+','',x)

Related

How do I gsub ' with empty? [duplicate]

This question already has answers here:
How to remove single quote from a string in R?
(3 answers)
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 11 months ago.
For example for a string like this
NANYANG-GIRLS'-HIGH-SCHOOL
how do I use gsub to replace ' to empty and make it
NANYANG-GIRLS-HIGH-SCHOOL
when I do it in R, it shows error
You can use either of the following two approaches:
sec_name <- gsub('\'', '', sec_name, fixed=TRUE)
sec_name <- gsub("'", "", sec_name, fixed=TRUE)
This first approach is a correct version of what you were doing. Here, we use single quotes for the strings, but we escape the single quote to make it a literal single quote.

Is there a way to keep only defined charaters in a string from a whitelist? [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I'm looking for a way to use a whitelist that contains digits and the Plus sign "+" to replace all other chars from a string.
string <- "opiqr8929348t89hr289r01++r42+3525"
I tried first to use:
gsub("[[:punct:][:alpha:]]", "", string)
but this excludes also the "+":
# [1] "89293488928901423525"
How can I exclude the "+" from [:alpha:] ?
So my intension is to use a whitelist instead:
whitelist <- c("0123456879+")
Is there a way to use gsub() in the other way around? Because when I use my whitelist it will identify the chars that should remain.
What about this:
string <- "opiqr8929348t89hr289r01++r42+3525"
gsub("[^0-9+]", "", string)
# [1] "89293488928901++42+3525"
This replaces everything that's not a 0-9 or plus with "".

How to use regex to match upto third forward slash in R using gsub? [duplicate]

This question already has answers here:
How to Select everything up to and including the 3rd slash (RegExp)?
(2 answers)
Extract a regular expression match
(12 answers)
Closed 2 years ago.
So this question is relating to specifically how R handles regex - I would like to find some regex in conjunction with gsub to extract out the text all but before the 3rd forward slash.
Here are some string examples:
/google.com/images/video
/msn.com/bing/chat
/bbc.com/video
I would like to obtain the following strings only:
/google.com/images
/msn.com/bing
/bbc.com/video
So it is not keeping the information after the 3rd forward slash.
I cannot seem to get any regex working along with using gsub to solve this!
The closest I have got is:
gsub(pattern = "/[A-Za-z0-9_.-]/[A-Za-z0-9_.-]*$", replacement = "", x = the_data_above )
I think R has some issues regarding forward slashes and escaping them.
From the start of the string match two instances of slash and following non-slash characters followed by anything and replace with the two instances.
paths <- c("/google.com/images/video", "/msn.com/bing/chat", "/bbc.com/video")
sub("^((/[^/]*){2}).*", "\\1", paths)
## [1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
You can take advantage of lazy (vs greedy) matching by adding the ? after the quantifier (+ in this case) within your capture group:
gsub("(/.+?/.+?)/.*", "\\1", text)
[1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
Data:
text <- c("/google.com/images/video",
"/msn.com/bing/chat",
"/bbc.com/video")
Try this out:
^\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+
As seen here: https://regex101.com/r/9ZYppe/1
Your problem arises from the fact that [A-Za-z0-9_.-] matches only one such character. You need to use the + operator to specify that there are multiple of them. Also, the $ at the end is pretty unnecessary because using ^ to assert the start of the sentence solves a great many problems.

How to remove '+ off' from the end of string? [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
Similar to R - delete last two characters in string if they match criteria except I'm trying to get rid of the special character '+' as well. I also attached a picture of my output.
When I attempt to use the escape command of '+', I get an error message saying
Error: '\+' is an unrecognized escape in character string starting ""\\s\+"
As you noticed, + is a metacharacter in regex so it needs to be escaped. \+ escapes that character, but \, itself, is a special character in R character strings so it, too, needs to be escaped. This is an R requirement, not a regex requirement.
This means that, instead of '\+', you need to write '\\+'.

Using Gsub in R to remove a string containing brackets [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 6 years ago.
I'm trying to use gsub to remove certain parts of a string. However, I can't get it to work, and I think it's because the string to be removed contains brackets. Is there any way around this? Thanks for any help.
The command I want to use:
gsub('(4:4aCO)_','', '(5:3)_(4:4)_(5:3)_(4:4)_(4:4aCO)_(6:2)_(4:4a)')
Returns:
#"(5:3)_(4:4)_(5:3)_(4:4)_(4:4aCO)_(6:2)_(4:4a)"
Expected output:
#"(5:3)_(4:4)_(5:3)_(4:4)_(6:2)_(4:4a)"
A quick test to see if brackets were the problem:
gsub('te','', 'test')
#[1] "st"
gsub('(te)','', '(te)st')
#[1] "()st"
We can by placing the brackets inside the square brackets as () is a metacharacter
gsub('[(]4:4aCO[)]','', '(5:3)(4:4)(5:3)(4:4)(4:4aCO)(6:2)_(4:4a)')
Or with fixed = TRUE to evaluate the literal meaning of that character
gsub('(4:4aCO)','', '(5:3)(4:4)(5:3)(4:4)(4:4aCO)(6:2)_(4:4a)', fixed = TRUE)

Resources