How this gsub() works? [duplicate] - r

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Learning Regular Expressions [closed]
(1 answer)
Closed 4 years ago.
So, it's the code and I don't understand the output.
Original theStr:"C:\\Users\\codep\\Desktop\\DM_HW2\\2017-07-9.csv"
gsub("^(.*)\\\\.*$",'\\1',theStr)
it become:"C:\\Users\\codep\\Desktop\\DM_HW2"
what is the "\\\\." in the pattern and the '\\1' in the replacement?

Your pattern may be explained as follows:
^(.*)\\\\ - match and capture everything up but excluding the LAST path separator
.*$ - then match/consume the remainder of the file path
Then, you replace the original input with the captured quantity, which is \\1, the second parameter passed to gsub. This effectively removes everything from the final path separator to the end of the file path.
Here is a regex demo which you can use to see for yourself how the pattern is matching, and what the capture group is:
Demo

Related

How to use regex to match upto third forward slash in R using gsub? [duplicate]

This question already has answers here:
How to Select everything up to and including the 3rd slash (RegExp)?
(2 answers)
Extract a regular expression match
(12 answers)
Closed 2 years ago.
So this question is relating to specifically how R handles regex - I would like to find some regex in conjunction with gsub to extract out the text all but before the 3rd forward slash.
Here are some string examples:
/google.com/images/video
/msn.com/bing/chat
/bbc.com/video
I would like to obtain the following strings only:
/google.com/images
/msn.com/bing
/bbc.com/video
So it is not keeping the information after the 3rd forward slash.
I cannot seem to get any regex working along with using gsub to solve this!
The closest I have got is:
gsub(pattern = "/[A-Za-z0-9_.-]/[A-Za-z0-9_.-]*$", replacement = "", x = the_data_above )
I think R has some issues regarding forward slashes and escaping them.
From the start of the string match two instances of slash and following non-slash characters followed by anything and replace with the two instances.
paths <- c("/google.com/images/video", "/msn.com/bing/chat", "/bbc.com/video")
sub("^((/[^/]*){2}).*", "\\1", paths)
## [1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
You can take advantage of lazy (vs greedy) matching by adding the ? after the quantifier (+ in this case) within your capture group:
gsub("(/.+?/.+?)/.*", "\\1", text)
[1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
Data:
text <- c("/google.com/images/video",
"/msn.com/bing/chat",
"/bbc.com/video")
Try this out:
^\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+
As seen here: https://regex101.com/r/9ZYppe/1
Your problem arises from the fact that [A-Za-z0-9_.-] matches only one such character. You need to use the + operator to specify that there are multiple of them. Also, the $ at the end is pretty unnecessary because using ^ to assert the start of the sentence solves a great many problems.

how to get the last part of strings with different lengths ended by ".nc" [duplicate]

This question already has answers here:
Get filename without extension in R
(9 answers)
Find file name from full file path
(4 answers)
Closed 3 years ago.
I have several download links (i.e., strings), and each string has different length.
For example let's say these fake links are my strings:
My_Link1 <- "http://esgf-data2.diasjp.net/pr/gn/v20190711/pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
My_Link2 <- "http://esgf-data2.diasjp.net/gn/v20190711/pr_-present_r1i1p1f1_gn_19500101-19591231.nc"
My goals:
A) I want to have only the last part of each string ended by .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc
pr_-present_r1i1p1f1_gn_19500101-19591231.nc
B) I want to have only the last part of each string before .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231
pr_-present_r1i1p1f1_gn_19500101-19591231
I tried to find a way on the net, but I failed. It seems this can be done in Python as documented here:
How to get everything after last slash in a URL?
Does anyone know the same method in R?
Thanks so much for your time.
A shortcut to get last part of the string would be to use basename
basename(My_Link1)
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
and for the second question if you want to remove the last ".nc" we could use sub like
sub("\\.nc", "", basename(My_Link1))
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231"
With some regex here is another way to get first part :
sub(".*/", "", My_Link1)

How to remove '+ off' from the end of string? [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
Similar to R - delete last two characters in string if they match criteria except I'm trying to get rid of the special character '+' as well. I also attached a picture of my output.
When I attempt to use the escape command of '+', I get an error message saying
Error: '\+' is an unrecognized escape in character string starting ""\\s\+"
As you noticed, + is a metacharacter in regex so it needs to be escaped. \+ escapes that character, but \, itself, is a special character in R character strings so it, too, needs to be escaped. This is an R requirement, not a regex requirement.
This means that, instead of '\+', you need to write '\\+'.

How to remove beginning-digits only in R [duplicate]

This question already has answers here:
Remove numbers at the beginning and end of a string
(3 answers)
Remove string from a vector in R
(4 answers)
Closed 5 years ago.
I have some strings with digits and alpha characters in them. Some of the digits are important, but the ones at the beginning of the string (and only these) are unimportant. This is due to a peculiarity in how email addresses are stored. So the best example is:
x<-'12345johndoe23#gmail.com'
Should be transformed to johndoe23#gmail.com
unfortunately there are no spaces. I have tried gsub('[[:digit:]]+', '', x) but this removes all numbers, not just the beginning-ones
Edit: I have found some solutions in other languages: Python: Remove numbers at the beginning of a string
As per my comment:
See regex in use here
^[[:digit:]]+
^ Asserts position at the start of the string
You can do this:
x<-'12345johndoe23#gmail.com'
gsub('^[[:digit:]]+', '', x) #added ^ as begin of string
Another regex is :
sub('^\\d+','',x)

Regular expression for 6 characters [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I need couple of regular expression in ASP.NET.
First should accept only 6 characters and second and third character should accept only _ (underscore).
Like: a__cde
And I want one other regex that should also take 6 characters and at second position it should accept only underscore (_) and at third position it should accept maybe underscore (_) or hash (#) and at fourth position it should accept only hash (#).
Note: In both the regex user can only enter: Number, Alphabets or Star (*) at any position instead of above mentioned positions.
Can any one help me out on this? I have tried by below website:
http://www.regexr.com/
But not able to generate proper regex.
Try something like:
1. ^[a-zA-Z0-9\*]__[a-zA-Z0-9\*]{3}$
2. ^[a-zA-Z0-9\*]_[_#]#[a-zA-Z0-9\*]{2}$

Resources