How to use regex to match upto third forward slash in R using gsub? [duplicate] - r

This question already has answers here:
How to Select everything up to and including the 3rd slash (RegExp)?
(2 answers)
Extract a regular expression match
(12 answers)
Closed 2 years ago.
So this question is relating to specifically how R handles regex - I would like to find some regex in conjunction with gsub to extract out the text all but before the 3rd forward slash.
Here are some string examples:
/google.com/images/video
/msn.com/bing/chat
/bbc.com/video
I would like to obtain the following strings only:
/google.com/images
/msn.com/bing
/bbc.com/video
So it is not keeping the information after the 3rd forward slash.
I cannot seem to get any regex working along with using gsub to solve this!
The closest I have got is:
gsub(pattern = "/[A-Za-z0-9_.-]/[A-Za-z0-9_.-]*$", replacement = "", x = the_data_above )
I think R has some issues regarding forward slashes and escaping them.

From the start of the string match two instances of slash and following non-slash characters followed by anything and replace with the two instances.
paths <- c("/google.com/images/video", "/msn.com/bing/chat", "/bbc.com/video")
sub("^((/[^/]*){2}).*", "\\1", paths)
## [1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"

You can take advantage of lazy (vs greedy) matching by adding the ? after the quantifier (+ in this case) within your capture group:
gsub("(/.+?/.+?)/.*", "\\1", text)
[1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
Data:
text <- c("/google.com/images/video",
"/msn.com/bing/chat",
"/bbc.com/video")

Try this out:
^\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+
As seen here: https://regex101.com/r/9ZYppe/1
Your problem arises from the fact that [A-Za-z0-9_.-] matches only one such character. You need to use the + operator to specify that there are multiple of them. Also, the $ at the end is pretty unnecessary because using ^ to assert the start of the sentence solves a great many problems.

Related

Remove multiple instances with a regex expression, but not the text in between instances [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
In long passages using bookdown, I have inserted numerous images. Having combined the passages into a single character string (in a data frame) I want to remove the markdown text associated with inserting images, but not any text in between those inserted images. Here is a toy example.
text.string <- "writing ![Stairway scene](/media/ClothesFairLady.jpg) writing to keep ![Second scene](/media/attire.jpg) more writing"
str_remove_all(string = text.string, pattern = "!\\[.+\\)")
[1] "writing more writing"
The regex expression doesn't stop at the first closed parenthesis, it continues until the last one and deletes the "writing to keep" in between.
I tried to apply String manipulation in R: remove specific pattern in multiple places without removing text in between instances of the pattern, which uses gsubfn and gsub but was unable to get the solutions to work.
Please point me in the right direction to solve this problem of a regex removal of designated strings, but not the characters in between the strings. I would prefer a stringr solution, but whatever works. Thank you
You have to use the following regex
"!\\[[^\\)]+\\)"
alternatively you can also use this:
"!\\[.*?\\)"
both solution offer a lazy match rather than a greedy one, which is the key to your question
I think you could use the following solution too:
gsub("!\\[[^][]*\\]\\([^()]*\\)", "", text.string)
[1] "writing writing to keep more writing"

Replace latex with r strings using gsub [duplicate]

This question already has an answer here:
"'\w' is an unrecognized escape" in grep
(1 answer)
Closed 1 year ago.
I would like to find and replace tabular instances by tabularx. I tried with gsub but it seems to enter me into a world of escaping pain. Following other questions and answers I find fixed=TRUE which is the best I so far have. The code snippet below almost works, \B is unrecognized. If I escape it twice I get \BEGIN as output!
texText <- '\begin{tabular}{rl}\begin{tabular}{rll}'
texText <- gsub("\begin{tabular}{rl}", "\BEGIN{tabular}{rll}", texText, fixed=TRUE)
I'm using BEGIN as my test to see what is happening. This is before I get to tackling the question of what goes on in the brackets {rl} {ll} {rrl} etc. Ideally I'm looking for a regex that would output:
\begin{tabularx}{rX}\begin{tabularx}{rlX}
That is the final column is replaced by X.
Try using proper escaping:
texText <- "\begin{tabular}{rl}\begin{tabular}{rll}"
output <- gsub("\begin\\{tabular\\}", "\begin{tabularx}", texText)
output
[1] "\begin{tabularx}{rl}\begin{tabularx}{rll}"
A literal backslash requires two backslashes, and also metacharacters such as { and } require two backslashes.

how to get the last part of strings with different lengths ended by ".nc" [duplicate]

This question already has answers here:
Get filename without extension in R
(9 answers)
Find file name from full file path
(4 answers)
Closed 3 years ago.
I have several download links (i.e., strings), and each string has different length.
For example let's say these fake links are my strings:
My_Link1 <- "http://esgf-data2.diasjp.net/pr/gn/v20190711/pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
My_Link2 <- "http://esgf-data2.diasjp.net/gn/v20190711/pr_-present_r1i1p1f1_gn_19500101-19591231.nc"
My goals:
A) I want to have only the last part of each string ended by .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc
pr_-present_r1i1p1f1_gn_19500101-19591231.nc
B) I want to have only the last part of each string before .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231
pr_-present_r1i1p1f1_gn_19500101-19591231
I tried to find a way on the net, but I failed. It seems this can be done in Python as documented here:
How to get everything after last slash in a URL?
Does anyone know the same method in R?
Thanks so much for your time.
A shortcut to get last part of the string would be to use basename
basename(My_Link1)
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
and for the second question if you want to remove the last ".nc" we could use sub like
sub("\\.nc", "", basename(My_Link1))
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231"
With some regex here is another way to get first part :
sub(".*/", "", My_Link1)

R regex using stringr::str_detect and grepl don't seem to be matching "\\+" when it is surrounded by "\\b" [duplicate]

This question already has answers here:
Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl
(2 answers)
Closed 3 years ago.
I'm pretty new to regex and am trying to detect a word with the "+" symbol when surrounded by "\\b" in long strings of words but both stringr and grepl are giving me the wrong result.
This is the code that I have wrote:
library(stringr)
str_detect("coversyl +", "\\bcoversyl(plus| plus|\\+| \\+)\\b")
The output is FALSE which is wrong.
What would be the right way to do it?
My guess is that your expression is just fine, maybe missing an space,
\\bcoversyl\\b\\s(\\bplus\\b|\\+)
Please see the demo for additional explanation.
If we might want more than one space, we would simply change \\s to \\s+ and it might work:
\\bcoversyl\\b\\s+(\\bplus\\b|\\+)

How this gsub() works? [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Learning Regular Expressions [closed]
(1 answer)
Closed 4 years ago.
So, it's the code and I don't understand the output.
Original theStr:"C:\\Users\\codep\\Desktop\\DM_HW2\\2017-07-9.csv"
gsub("^(.*)\\\\.*$",'\\1',theStr)
it become:"C:\\Users\\codep\\Desktop\\DM_HW2"
what is the "\\\\." in the pattern and the '\\1' in the replacement?
Your pattern may be explained as follows:
^(.*)\\\\ - match and capture everything up but excluding the LAST path separator
.*$ - then match/consume the remainder of the file path
Then, you replace the original input with the captured quantity, which is \\1, the second parameter passed to gsub. This effectively removes everything from the final path separator to the end of the file path.
Here is a regex demo which you can use to see for yourself how the pattern is matching, and what the capture group is:
Demo

Resources