Extract a sentence from mail with Regex [duplicate] - r

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I need to extract with Regex a sentence without the tag <br> but it's give me issues with that.
(?<=Status:) (.*)[^<br>]
Status: i3 Naviera indicates that the container is already released<br>
This sentence comes from an mail
"<html>\r\n<head>\r\n<meta http-equiv=\"Content-Type\"
content=\"text/html; charset=utf-8\">\r\n</head>\r\n<body>\r\nStatus:
i3 Naviera indicates that the container is already
released<br>\r\nObservations: data requested.<br>\r\n<br>\r\n<img
src=\"http://test/logo/Logo2.png\">\r\n</body>\r\n</html>\r\n"
I just need to extract:
i3 Naviera indicates that the container is already released

This regex would work for your content:
(?<=Status: )(.*?)(?=<br>)
It matches the Status: with space, and stops at the first <br> and does not include it in the match.
Please note that using regex for html parsing requires that the html content does not change much.

Related

how to get the last part of strings with different lengths ended by ".nc" [duplicate]

This question already has answers here:
Get filename without extension in R
(9 answers)
Find file name from full file path
(4 answers)
Closed 3 years ago.
I have several download links (i.e., strings), and each string has different length.
For example let's say these fake links are my strings:
My_Link1 <- "http://esgf-data2.diasjp.net/pr/gn/v20190711/pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
My_Link2 <- "http://esgf-data2.diasjp.net/gn/v20190711/pr_-present_r1i1p1f1_gn_19500101-19591231.nc"
My goals:
A) I want to have only the last part of each string ended by .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc
pr_-present_r1i1p1f1_gn_19500101-19591231.nc
B) I want to have only the last part of each string before .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231
pr_-present_r1i1p1f1_gn_19500101-19591231
I tried to find a way on the net, but I failed. It seems this can be done in Python as documented here:
How to get everything after last slash in a URL?
Does anyone know the same method in R?
Thanks so much for your time.
A shortcut to get last part of the string would be to use basename
basename(My_Link1)
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
and for the second question if you want to remove the last ".nc" we could use sub like
sub("\\.nc", "", basename(My_Link1))
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231"
With some regex here is another way to get first part :
sub(".*/", "", My_Link1)

R Convert HTML ASCII characters to character [duplicate]

This question already has answers here:
convert HTML Character Entity Encoding in R
(5 answers)
Convert HTML Entity to proper character R
(1 answer)
Closed 4 years ago.
Is there a standard way in R to transliterate ASCII HTML codes to a standard character? For example, ' is an apostrophe, like ' or ' (I typed an apostrophe for the second one and the HTML code for the first). I'd like to change the following text
text = "Met with Mark's boss today to discuss performance"
to be
"Met with Mark's boss today to discuss performance"
I tried using iconv like below but the HTML code is all valid encoding, so nothing changes.
iconv(text, from="ASCII", to="UTF-8//TRANSLIT")
I could get a lookup table and do it that way but thought I'd check if there's an existing method to accomplish this.

How this gsub() works? [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Learning Regular Expressions [closed]
(1 answer)
Closed 4 years ago.
So, it's the code and I don't understand the output.
Original theStr:"C:\\Users\\codep\\Desktop\\DM_HW2\\2017-07-9.csv"
gsub("^(.*)\\\\.*$",'\\1',theStr)
it become:"C:\\Users\\codep\\Desktop\\DM_HW2"
what is the "\\\\." in the pattern and the '\\1' in the replacement?
Your pattern may be explained as follows:
^(.*)\\\\ - match and capture everything up but excluding the LAST path separator
.*$ - then match/consume the remainder of the file path
Then, you replace the original input with the captured quantity, which is \\1, the second parameter passed to gsub. This effectively removes everything from the final path separator to the end of the file path.
Here is a regex demo which you can use to see for yourself how the pattern is matching, and what the capture group is:
Demo

How to remove beginning-digits only in R [duplicate]

This question already has answers here:
Remove numbers at the beginning and end of a string
(3 answers)
Remove string from a vector in R
(4 answers)
Closed 5 years ago.
I have some strings with digits and alpha characters in them. Some of the digits are important, but the ones at the beginning of the string (and only these) are unimportant. This is due to a peculiarity in how email addresses are stored. So the best example is:
x<-'12345johndoe23#gmail.com'
Should be transformed to johndoe23#gmail.com
unfortunately there are no spaces. I have tried gsub('[[:digit:]]+', '', x) but this removes all numbers, not just the beginning-ones
Edit: I have found some solutions in other languages: Python: Remove numbers at the beginning of a string
As per my comment:
See regex in use here
^[[:digit:]]+
^ Asserts position at the start of the string
You can do this:
x<-'12345johndoe23#gmail.com'
gsub('^[[:digit:]]+', '', x) #added ^ as begin of string
Another regex is :
sub('^\\d+','',x)

Regular expression for 6 characters [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I need couple of regular expression in ASP.NET.
First should accept only 6 characters and second and third character should accept only _ (underscore).
Like: a__cde
And I want one other regex that should also take 6 characters and at second position it should accept only underscore (_) and at third position it should accept maybe underscore (_) or hash (#) and at fourth position it should accept only hash (#).
Note: In both the regex user can only enter: Number, Alphabets or Star (*) at any position instead of above mentioned positions.
Can any one help me out on this? I have tried by below website:
http://www.regexr.com/
But not able to generate proper regex.
Try something like:
1. ^[a-zA-Z0-9\*]__[a-zA-Z0-9\*]{3}$
2. ^[a-zA-Z0-9\*]_[_#]#[a-zA-Z0-9\*]{2}$

Resources