How to split filename strings and convert to a datetime in R - r

In R I'd like to split file names in the format "a_b_c_d.jpg"
For example:
20190104_080314_2048_1700.jpg
The Date: 2019.01.04 and time 08:03:14 is important to me. The other numbers (2048= pixel, 1700= filter) are not.
So I need the a and b value.
If I use strsplit I get: [1]"a" "b" "c" "d.jpg", but i want [1] a [2] b only.
And in the end i want to use the [1] date and [2] time and put it together into one value: 2019-01-04T08:03:14
Has anyone an idea how to do this?
Thanks for helping me with programming for my astrological research about the sun activity :)

You can use a regular expression to get the pieces of the string you need.
library(stringr)
x <- '20190104_080314_2048_1700.jpg'
str_replace(x, '(^.{4})(.{2})(.{2})_(.{2})(.{2})(.{2}).*', '\\1-\\2-\\3T\\4:\\5:\\6')
#[1] "2019-01-04T08:03:14"
The expression is anchored to the start of the string, then gets the first four characters, then the next 2 characters etc. The first bracket is capture group 1 (i.e. \1)

There are two steps here. First is to split the string, as you suggest, and second to convert those outputs to a datetime object.
Step 1:
strsplit produces a list object. To access individual parts of that list, you need to unlist() it and then call the specific elements you're after.
t <- "20190104_080314_2048_1700.jpg"
t.split <- unlist(strsplit(t, "_"))[c(1,2)]
# [1] "20190104" "080314"
Step 2:
Now you can convert these two strings to a datetime object of your choice. Using lubridate makes it pretty easy:
library(lubridate)
ymd_hms(paste(t.split[1], t.split[2]))
# [1] "2019-01-04 08:03:14 UTC"
or you can use the base R function strptime:
strptime(paste(t.split[1], t.split[2]), format="%Y%m%d %H%M%S")
# [1] "2019-01-04 08:03:14 PST"
Note the difference in the default timezones, and be sure to specify the right one (both functions take a tz= argument).

Related

R turn 6 digit number into HMS i.e. "130041" into "13:00:41"

As the question states, I want to turn "130041" into "13:00:41" i.e. HMS data
lubridate::ymd("20220413") works no problems but lubridate::hms("130041") does not.
I assume there should be a reasonably simply solution?!
Thank you.
If you need the output as a lubridate Period object, rather than a character vector, as you need to perform operations on it, you can use the approach suggested by Tim Biegeleisen of adding colon separators to the character vector and then using lubridate:
x <- "130041"
gsub("(\\d{2})(?!$)", "\\1:", x, perl = TRUE) |>
lubridate::hms()
# [1] "13H 0M 41S"
The output is similar but it is a Period object. I used a slightly different regex as well (add a colon when there are two digits not followed by the end of string) but it is fundamentally the same approach.
You could use sub here:
x <- "130041"
output <- sub("(\\d{2})(\\d{2})(\\d{2})", "\\1:\\2:\\3", x)
output
[1] "13:00:41"
Another regex which will will also work in case when hour uses only on digit.
gsub("(?=(..){1,2}$)", ":", c("130041", "30041"), perl=TRUE)
#[1] "13:00:41" "3:00:41"
Which can than be used in e.g. in lubridate::hms or hms::as_hms.
In base::as.difftime a format could be given: as.difftime("130041", "%H%M%S")

Extract dates in a complex string

I have a problem for extract dates in files names, in my example a have the file.name object:
file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif","RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")
I need to extract in a new object the specific dates: 20190518, 20210107 and 20181018 inside in the files names. But for this a can't use substr because a have different lengths of areas names (AZAMBUJAI002A,RINCAODOSSOARES051B and VILAPALMA33K) and not to use remove letters too (a cause of numeric area id - 002, 051 and 33). The dates in the end before ".tif" separated by "_" is not useful information.
My desirable output is:
mydates
[1] 2019-05-18
[2] 2021-01-07
[3] 2018-10-18
Is there any solution to the problem described? Thanks!!
Solution using base R functions. Works as long as the format is always "yyyymmdd" and the relevant string appears before the first underscore:
file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif",
"RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")
Using gsub twice: First (in the inner function) to get rid of everything after the first underscore, and then to extract the sequence of eight numbers ([0-9]{8}:
dates <- gsub(".*([0-9]{8}).*", "\\1", gsub("^([^_]*)_.*", "\\1", file.name))
Finally using as.Date to convert the strings to a R date object (can be re-cast to a string using format):
dates_as_actual_date <- as.Date(dates, format("%Y%m%d"))
dates_as_actual_date is a R date object and looks like this:
[1] "2019-05-18" "2021-01-07" "2018-10-18"
Here is a way to extract using regex - assume you only have year start with 20xx
library(stringr)
library(lubridate)
date_string <- str_extract(file.name,
"20\\d{2}\\[0,1][1-9]\\[0-3][1-9]")
date_string
#> [1] "20190518" "20210107" "20181018"
ymd(date_string)
#> [1] "2019-05-18" "2021-01-07" "2018-10-18"
Created on 2021-05-19 by the reprex package (v2.0.0)
library(lubridate)
ymd(gsub("(^.*_)(20[0-9]{2}_)([0-9]{2}_)([0-9]{2}_)(.*$)",
"\\2\\3\\4",
file.name))
ymd is a lubridate function that identifies YYYY-MM-DD dates, almost irrespective of the separator used.
gsub converts a string. The regex inside:
(^.*_) is the first capture group. Takes anything from the beginning to an underscore.
(20[0-9]{2}_) is the second capture group. It takes a string that starts with 20 and is followed by any two digits and an underscore.
([0-9]{2}_) is the third and fourth capture groups. It takes two digits followed by an underscore.
(.*$) is the last (5th) capture group. Takes anything to the end of the string.
"\2\3\4" returns second, third and fourth capture groups.
EDIT:
The explanation to the code is still OK, but to retrieve the dates just after the names then the code needed is this:
ymd(gsub("(^.*[A-Z])(20[0-9]{2})([0-9]{2})([0-9]{2})(.*$)",
"\\2\\3\\4",
file.name))

Replace dashes with colon in part of string

I have a dataframe with date and time values as characters, as extracted from a database. Currently the date/time looks like this: 2017-03-17 11-56-20
I want it to loook like this 2017-03-17 11:56:20
It doesn't seem to be as simple as replacing all the dashes using gsub as in R: How to replace . in a string?
I'm thinking it has something to do with the positioning, like telling R to only look after the space. Ideas?
Since you're dealing with a date-time object, you can use strptime:
x <- "2017-03-17 11-56-20"
as.character(strptime(x, "%Y-%m-%d %H-%M-%S", tz = ""))
# [1] "2017-03-17 11:56:20"
Try matching the following pattern:
(\\d\\d)-(\\d\\d)-(\\d\\d)$
and then replace that with:
\\1:\\2:\\3
This will match your timestamp exclusively, because of the terminal anchor $ at the end of the pattern. Then, we rebuild the timestamp the way you want using colons and the three capture groups.
gsub("(\\d\\d)-(\\d\\d)-(\\d\\d)$", "\\1:\\2:\\3", x)
[1] "2017-03-17 11:56:20"
You can use library(anytime) to take care of the formatting for you too (which also coerces to POSIX)
library(anytime)
anytime(x)
# [1] "2017-03-17 11:56:20 AEDT"
as.character(anytime(x))
# [1] "2017-03-17 11:56:20"

Converting to a standard as.Date date-time then back again, without leading zeros

Is it possible to convert from R's default date format to a user-defined format ("m/d/yyyy" here) and avoid getting leading zeros in the resulting date?
In the example below I want to have date_2 look just like date_1. Is there a way to do this with format or another function (ideally in one line of code), or will I need to resort to gsub to find and remove the leading zeros in front of the month ("09") and day ("05") in date_2?
I looked in documentation on DateTimeClasses, strptime, POSIXct, and format, but didn't come across an answer.
date_1<-"9/5/2008"
date_num<-as.Date(date_1,"%m/%d/%Y")
> date_num
[1] "2008-09-05"
date_2<-format(date_num,"%m/%d/%Y")
date_2
[1] "09/05/2008"
We can use gsub
gsub('(?<=\\/)0|^0', '', date_2, perl=TRUE)
#[1] "9/5/2008"
Or another version is
gsub('0(?=[1-9]\\/)', '', date_2, perl=TRUE)
#[1] "9/5/2008"
We could also convert to POSIXlt class and then extract the components, paste it.
v1 <- as.POSIXlt(date_num)
paste(v1$mon+1L, v1$mday, v1$year+1900, sep='/')
#[1] "9/5/2008"
Here is a way to go straight from date_num to your desired result, using the chron package.
paste(chron::month.day.year(date_num), collapse = "/")
# [1] "9/5/2008"
This also works nicely for multiple dates. The code is slightly different as we need do.call() here.
do.call(paste, c(chron::month.day.year(Sys.Date()-0:3), sep = "/"))
# [1] "10/9/2015" "10/8/2015" "10/7/2015" "10/6/2015"

How can I append dates to a vector in R?

I created a vector using the vector() function:
actual_dates_vector <- vector()
I then extract the Julian date (eg: 2008201) from a text string:
julian_date<-substr(files[r],10,16)
I then convert the Julian date into YYYY-MM-DD format:
actual_date<-strptime(julian_date, "%Y %j")
This gives me a value like "2009-07-28". I then need to append this to the vector initially created. For which I do this:
actual_dates_vector<-c(actual_dates_vector,actual_date)
But this gives me:
$sec
[1] 0
$min
[1] 0
$hour
[1] 0
$mday
[1] 28
$mon
[1] 6
$year
[1] 109
$wday
[1] 2
$yday
[1] 208
$isdst
[1] 1
I don't understand what's going on. This code actually runs in a loop over multiple dates, so I want the date to be extracted from each date string, converted to YYYY-MM-DD format and appended to the vector. Is there a way to do this?
Thanks.
If you prefer a "loop & append" approach, you can do as follows :
# random data to emulate your files
files <- c("2008281","2009128","2010040")
n_files <- length(files)
# loop & append
actual_dates_vector <- vector()
for(r in 1:n_files){
dts <- as.POSIXct(files[r],format="%Y%j")
# convert dts (POSIXct class objects) to character with the desired format
dts <- format(dts,format="%Y-%m-%d")
actual_dates_vector <- c(actual_dates_vector,dts)
}
Date objects actually are something else under the hood. As you have seen POSIXlt's are actually lists of the date components while POSIXct's are basically doubles, so they're not what you see when you print them (also the printed format depends on the local settings so you can get different results on differnt machines).
For this reason, since you stated you want a specific representation of the dates (namely YYYY-MM-DD), I suggest you to follow the described approach and store the result into a vector of characters having the desired format.
strptime returns a POSIXlt object which is actually a list like you're seeing. If you use as.POSIXct instead of strptime you'll get the result you want.
Also, all the functions you're calling are vectorized so you don't need to do this append strategy, instead you should be able to:
strptime(substr(files, 10 ,16), '%Y %j')
Or something along those lines.
As pointed out in the comments, as.POSIXct calls strptime under the hood.

Resources