Regular expressions in R to get the date - r

What could be the better solution to get the date only, it is a tag of a webpage.
I hope someone could help me.
The patterns is this value in many pages "publishedAtDate":"2020-02-07"
I would like to have the next outcome:
2020-02-07
I am using this code:
art_publishdate<-regexpr("publishedAtDate\":\"[0-9]{4}-[0-9]{2}-[0-9]{2}\"", thepage)
but the result include many backslashes.
[1] "publishedAtDate\":\"2020-02-07\""
Thank you

You could try to just pick out the numbers and format them as.Date.
as.Date(gsub("\\D", "\\1", '"publishedAtDate":"2020-02-07\"'), format="%Y%m%d")
# [1] "2020-02-07"

Two ways to capture the output.
Using gsub we remove everything till a colon (:) is encountered.
string <- '"publishedAtDate":"2020-02-07"'
gsub('.*:|"', '', string)
#[1] "2020-02-07"
Or using sub we can extract date pattern.
sub('.*?(\\d+-\\d+-\\d+).*', '\\1', string)
#[1] "2020-02-07"

Another solution using str_extract from the stringr package:
str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}")
[1] "2020-02-07"
Alternatively, the date can be extracted thus:
str_extract(string, "[0-9-]+")
[1] "2020-02-07"
Another alternative is using positive look-behind (which encodes the instruction "Match if you see on the left...") as well as a negated character class [^"], which excludes the quote mark but no other character:
str_extract(string, '(?<=:")[^"]*')
[1] "2020-02-07"

Related

Extract a number from a string which precedes a phrase in R

I am in R and would like to extract a two digit number 38y from the following string:
"/Users/files/folder/file_number_23a_version_38y_Control.txt"
I know that _Control always comes after the 38y and that 38y is preceded by an underscore. How can I use strsplit or other R commands to extract the 38y?
You could use
regmatches(x, regexpr("[^_]+(?=_Control)", x, perl = TRUE))
# [1] "38y"
or equivalently
stringr::str_extract(x, "[^_]+(?=_Control)")
# [1] "38y"
Using gsub.
gsub('.*_(.*)_Control.*', '\\1', x)
# [1] "38y"
See demo with detailed explanation.
A possible solution:
library(stringr)
text <- "/Users/files/folder/file_number_23a_version_38y_Control.txt"
str_extract(text, "(?<=_)\\d+\\D(?=_Control)")
#> [1] "38y"
You can find an explanation of the regex part at:
https://regex101.com/r/PQSZHX/1

Parsing String - Extract Numeric Characters At End

Parsing string fields in R data frames is a bit of a mystery to me I'm afraid...would be grateful for help.
I have a string field which always ends in an indeterminate number of numeric characters. I'd like to write a bit of code to just extract the numeric part at the end of each.
An example of the data format is:
df_test <- data.frame(my_string = c("XXX-0387", "XXXX-1-999999", "XXX 12345432", "XXX-2345", "XXX1234"))
What I'd like is to put the numeric part at the end into a new field but to keep any leading zeros - so presumably the new field would have to be chr rather than int. So my output would look like:
c("0387", "999999", "12345432", "2345", "1234)
Is there an easy way to do this please?
Thank you.
A way using sub to capture the last part of string which is number.
sub('.*?(\\d+)$', '\\1', df_test$my_string)
#[1] "0387" "999999" "12345432" "2345" "1234"
Using stringr :
stringr::str_extract(df_test$my_string, '\\d+$')
You can use regexpr with \\d+$ to find the numbers at the end and extracti it with regmatches.
regmatches(df_test$my_string, regexpr("\\d+$", df_test$my_string))
#[1] "0387" "999999" "12345432" "2345" "1234"
We can use stri_extract_last from stringi
library(stringi)
stri_extract_last(df_test$my_string, regex = "\\d+")
#[1] "0387" "999999" "12345432" "2345" "1234"

How to extract from string with regex, a word and or condition

I'd like to extract a word from a string, but don't know how to proceed :
Say I have these character strings :
a_toto_matthew
a_tutu_matthew
In both cases, I'd like to extract matthew
I tried
gsub("^a_[toto|tutu]_(.*)$", "\\1", "a_toto_matthew")
But it doesn't work.
I could have done :
gsub("^a_.*_(.*)$", "\\1", "a_toto_matthew")
But I find it less elegant. I'd like to know the syntax for mentioning "toto" or "tutu" in the regexpr
Thanks in advance for any guidance,
Mathieu
Another option could be using a capturing group matching u or o and a backreference and for the word use \w+ or match any word char except an underscore.
^a_t([uo])t\1_([^\W_]+)$
Regex demo
In the replacement use group 2
Try
gsub('a_(toto|tutu)_(.*)', '\\2', x)
#[1] "matthew" "matthew"
Using stringr with positive lookbehind
x <- c("a_toto_matthew", "a_tutu_matthew")
stringr::str_extract(x, "(?<=(toto|tutu)_)\\w+")
#[1] "matthew" "matthew"
Or using non-capturing group in str_match
stringr::str_match(x, "(?:toto|tutu)_(\\w+)")[,2]

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources