How to get a date out of a string? [duplicate] - r

This question already has answers here:
R Regexp - extract number with 5 digits
(4 answers)
Closed 1 year ago.
I have a file with name "test_result_20210930.xlsx". I would like to get "20210930" out to a new variable date. How should I do that? I think I can say pattern="[0-9]+" What if I have more numbers in the file name, and I only want the part that will stand for the date? (8digt together?)
Any suggestion?

Using gsub with \\D+ matches all non-digits and in the replacement, specify blank ("")
gsub("\\D+", "", str1)
[1] "20210930"
If the pattern also includes other digits, and want to return only the 8 digits
sub(".*_(\\d{8})_.*", "\\1", "test_result_20210930_01.xlsx")
[1] "20210930"
Or use str_extract
library(stringr)
str_extract("test_result_20210930_01.xlsx", "(?<=_)\\d{8}(?=_)")
[1] "20210930"
If we need to automatically convert to Date object
library(parsedate)
parse_date(str1)
[1] "2021-09-30 UTC"
-output
str1 <- "test_result_20210930.xlsx"

You can also use str_extract from the stringr package to obtain the desired result.
library(stringr)
str_extract("test_result_20210930.xlsx", "[0-9]{8}")
# [1] "20210930"

Related

Regex: Match first two digits of a four digit number

I have:
'30Jun2021'
I want to skip/remove the first two digits of the four digit number (or any other way of doing this):
'30Jun21'
I have tried:
^.{0,5}
https://regex101.com/r/hAJcdE/1
I have the first 5 characters but I have not figured out how to skip/remove the '20'
Manipulating datetimes is better using the dedicated date/time functions.
You can convert the variable to date and use format to get the output in any format.
x <- '30Jun2021'
format(as.Date(x, '%d%b%Y'), '%d%b%y')
#[1] "30Jun21"
You can also use lubridate::dmy(x) to convert x to date.
You don't even need regex for this. Just use substring operations:
x <- '30Jun2021'
paste0(substr(x, 1, 5), substr(x, 8, 9))
[1] "30Jun21"
Use sub
sub('\\d{2}(\\d{2})$', "\\1", x)
[1] "30Jun21"
or with str_remove
library(stringr)
str_remove(x, "\\d{2}(?=\\d{2}$)")
[1] "30Jun21"
data
x <- '30Jun2021'
You could also match the format of the string with 2 capture groups, where you would match the part that you want to omit and capture what you want to keep.
\b(\d+[A-Z][a-z]+)\d\d(\d\d)\b
Regex demo
sub("\\b(\\d+[A-Z][a-z]+)\\d\\d(\\d\\d)\\b", "\\1\\2", "30Jun2021")
Output
[1] "30Jun21"

Regular expressions in R to get the date

What could be the better solution to get the date only, it is a tag of a webpage.
I hope someone could help me.
The patterns is this value in many pages "publishedAtDate":"2020-02-07"
I would like to have the next outcome:
2020-02-07
I am using this code:
art_publishdate<-regexpr("publishedAtDate\":\"[0-9]{4}-[0-9]{2}-[0-9]{2}\"", thepage)
but the result include many backslashes.
[1] "publishedAtDate\":\"2020-02-07\""
Thank you
You could try to just pick out the numbers and format them as.Date.
as.Date(gsub("\\D", "\\1", '"publishedAtDate":"2020-02-07\"'), format="%Y%m%d")
# [1] "2020-02-07"
Two ways to capture the output.
Using gsub we remove everything till a colon (:) is encountered.
string <- '"publishedAtDate":"2020-02-07"'
gsub('.*:|"', '', string)
#[1] "2020-02-07"
Or using sub we can extract date pattern.
sub('.*?(\\d+-\\d+-\\d+).*', '\\1', string)
#[1] "2020-02-07"
Another solution using str_extract from the stringr package:
str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}")
[1] "2020-02-07"
Alternatively, the date can be extracted thus:
str_extract(string, "[0-9-]+")
[1] "2020-02-07"
Another alternative is using positive look-behind (which encodes the instruction "Match if you see on the left...") as well as a negated character class [^"], which excludes the quote mark but no other character:
str_extract(string, '(?<=:")[^"]*')
[1] "2020-02-07"

Delete everything after second comma from string [duplicate]

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Closed 3 years ago.
I would like to remove anything after the second comma in a string -including the second comma-. Here is an example:
x <- 'Day,Bobby,Jean,Gav'
gsub("(.*),.*", "\\1", x)
and it gives:
[1] "Day, Bobby, Jean"
while I want:
[1] "Day, Bobby
regardless of the number of names that may exist in x
Use
> x <- 'Day, Bobby, Jean, Gav'
> sub("^([^,]*,[^,]*),.*", "\\1", x)
[1] "Day, Bobby"
The ^([^,]*,[^,]*),.* pattern matches
^ - start of string
([^,]*,[^,]*) - Group 1: 0+ non-commas, a comma, and 0+ non-commas
,.* - a comma and the rest of the string.
The \1 in the replacement pattern will keep Group 1 value in the result.
We can also use strsplit and then paste
toString(head(strsplit(x, ",")[[1]], 2))
#[1] "Day, Bobby"

How to extract everything after a specific string?

I'd like to extract everything after "-" in vector of strings in R.
For example in :
test = c("Pierre-Pomme","Jean-Poire","Michel-Fraise")
I'd like to get
c("Pomme","Poire","Fraise")
Thanks !
With str_extract. \\b is a zero-length token that matches a word-boundary. This includes any non-word characters:
library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"
We can also use a back reference with sub. \\1 refers to string matched by the first capture group (.+), which is any character one or more times following a - at the end:
sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"
This also works with str_replace if that is already loaded:
library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"
Third option would be using strsplit and extract the second word from each element of the list (similar to word from #akrun's answer):
sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"
stringr also has str_split variant to this:
str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"
We can use sub to match characters (.*) until the - and in the replacement specify ""
sub(".*-", "", test)
Or another option is word
library(stringr)
word(test, 2, sep="-")
I think the other answers might be what you're looking for, but if you don't want to lose the original context you can try something like this:
library(tidyverse)
tibble(test) %>%
separate(test, c("first", "last"), remove = F)
This will return a dataframe containing the original strings plus components, which might be more useful down the road:
# A tibble: 3 x 3
test first last
<chr> <chr> <chr>
1 Pierre-Pomme Pierre Pomme
2 Jean-Poire Jean Poire
3 Michel-Fraise Michel Fraise
For some reason the responses here didn't work for my particular string. I found this response more helpful (i.e., using Stringr's lookbehind function): stringr str_extract capture group capturing everything.

formatting the date in R

I have a date value as follows
"'2015-10-24'"
class Character
I am trying to format this value such that it looks like this '10/24/2015'
I know how to use noquote function and strip the quotes and gsub function to replace the - with / but I am not sure how to switch the year, date and month such that it looks like this '10/24/2015'
Any help is much appreciated.
We can convert to Date class after removing the ' with gsub, and then use format to get the expected output
format(as.Date(gsub("'", '', v1)), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
Or without using the gsub to remove ', we can specify the ' also in the format within as.Date
format(as.Date(v1, "'%Y-%m-%d'"), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
This can be made more compact if we are using library(lubridate)
library(lubridate)
format(ymd(v1), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
If we don't need the ' in the output, we don't have to specify that in the format,
format(ymd(v1), "%m/%d/%Y")
#[1] "10/24/2015" "10/25/2015"
Or we can do this using only gsub by capturing the characters as a group. In the below code, we capture the first 4 characters (.{4}) as a group by wrapping with parentheses followed by matching the -, then capturing the next two characters, followed by -, and capturing the last two characters. In the replacement, we can shuffle the capture groups as per the requirement. In this case, the second capture group should come first (\\2) followed by /, then the third (\\3) and so on...
gsub('(.{4})-(.{2})-(.{2})', '\\2/\\3/\\1', v1)
#[1] "'10/24/2015'" "'10/25/2015'"
To avoid the quotes,
gsub('.(.{4})-(.{2})-(.{2}).', '\\2/\\3/\\1', v1)
#[1] "10/24/2015" "10/25/2015"
In addition, there are other ways such as splitting the string
vapply(strsplit(v1, "['-]"), function(x) paste(x[c(3,4,2)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
or extracting the numeric part with str_extract_all and pasteing as before.
library(stringr)
vapply(str_extract_all(v1, '\\d+'), function(x)
paste(x[c(2,3,1)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
data
v1 <- c("'2015-10-24'", "'2015-10-25'")
You can also use the function strftime to get the result
d <- "'2015-10-24'"
strftime(as.Date(gsub("'", "", d)), "%m/%d/%Y")
# [1] "10/24/2015"

Resources