Extract two combined dates and times from a string

Extract two combined dates and times from a string - r

Could you please let me know how I can extract date and time from ("2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC"). Note that this string contains a time interval with two dates and two times. I would like to break it down into 4 individual strings such as Date 1, Time 1, Date 2, Time 2 and then store them in 4 separate vectors.
Thanks.

Try the following.
x <- "2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC"
y <- strsplit(x, "--")[[1]]
dates <- as.Date(y)
times <- strftime(y, format = "%H:%M:%S")

You never mentioned whether you need functional dates and times from your input string. If you need to simply parse each portion of your timestamp then using gsub is one option.
x <- "2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC"
y <- unlist(strsplit(x, "--"))
dates <- sapply(y, function(x) gsub("(\\d{4}-\\d{2}-\\d{2}).*", "\\1", x))
times <- sapply(y, function(x) gsub(".*(\\d{2}:\\d{2}:\\d{2}.*)", "\\1", x))
dates
[1] "2015-08-11" "2015-08-11"
times
[1] "03:14:00 UTC" "04:14:00 UTC"
Demo here:
Rextester

Related

parse multiple date formats by index in R

I'm trying to parse multiple date formats based on their position in a vector of dates. At some the data switched the format it used from y/m/d to y/d/m. This is annoying for dates like 2010/07/03 where specifying the order in lubridate .
This is an example of dates
datevec <- c("2011/07/01", "2011/07/02", "2011/07/03", "2011/02/07" )
The dates are set up so before a certain row the dates are one format and after another row the dates are another format, so I'm trying to provide an index to the function
when I tried to parse them using this plus lubridate it only returned 3 dates.
lapply(datevec, function(x, i) ifelse( x[i] <4, parse_date_time(x, "%Y-%m-%d"), parse_date_time(x,"%Y-%d-%m" )) )

1) If we changed the ifelse in the question to a plain if then the basic idea in the question works with appropriate modifications. Note that it gives a list L so assuming we really want a vector we add the last line of code.
f <- function(x, i) if (i < 4)
parse_date_time(x, "ymd") else parse_date_time(x, "ydm")
L <- Map(f, datevec, seq_along(datevec), USE.NAMES = FALSE)
do.call("c", L)
## [1] "2011-07-01 UTC" "2011-07-02 UTC" "2011-07-03 UTC" "2011-02-07 UTC"
2) Use the ifelse on the format part rather than on the date part and use as.Date instead of parse_date_time:
ix <- seq_along(datevec)
as.Date(datevec, ifelse(ix < 4, "%Y/%m/%d", "%Y/%d/%m"))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
3) Convert the first 3 using ymd and the rest using ydm and then concatenate.
c(ymd(head(datevec, 3)), ydm(tail(datevec, -3)))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
4) or with only base R:
c(as.Date(head(datevec, 3)), as.Date(tail(datevec, -3), "%Y/%d/%m"))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
5) Another approach is to convert the later dates using string manipulation so that all the dates are in the same format and then use as.Date or ymd:
ix <- seq_along(datevec)
swap <- sub("(..)/(..)$", "\\2/\\1", datevec)
as.Date(ifelse(ix < 4, datevec, swap))
## [1] "2011-07-01" "2011-07-02" "2011-07-03" "2011-07-02"
6) The above codes return Date class, which is more appropriate for dates without times but if for some reason you really need POSIXct use as.POSIXct on the above or else use parse_date_time like this:
c(parse_date_time(head(datevec, 3), "ymd"), parse_date_time(tail(datevec, -3), "ydm"))
## [1] "2011-07-01 UTC" "2011-07-02 UTC" "2011-07-03 UTC" "2011-07-02 UTC"

How to change only the year value in a POSIXct Value

I would like to change only the year format on a POSIX date-time value. I would like to change 2013-12-30 XX:XX:XX to 2012-12-30 XX:XX:XX . I would like this to be general as there are hundreds of incidences with different hours. Is this possible to do while keeping the column as a POSIX value

1) Base R. Convert to POSIXlt, subtract one from the year component and convert back to POSIXct. No packages are used.
yearMinus <- function(x, n = 1) {
lt <- as.POSIXlt(x)
lt$year <- lt$year - n
as.POSIXct(lt)
}
# test
datetimes <- as.POSIXct( c("2013-12-30 03:02:01", "2013-12-30 03:02:01") )
yearMinus(datetimes)
## [1] "2012-12-30 03:02:01 EST" "2012-12-30 03:02:01 EST"
2) gsubfn Convert to character, match 4 digits, convert the match to numeric and subtract 1 (done in the second argument which represents the transformation in formula notation) and then convert back to POSIXct. This is done in one gsubfn call.
library(gsubfn)
as.POSIXct(gsubfn("\\d{4}", ~ as.numeric(year) - 1, as.character(datetimes)))
## [1] "2012-12-30 03:02:01 EST" "2012-12-30 03:02:01 EST"

If you want to subtract a year from the current timestamp
df$time - lubridate::years(1)
If you want to change only specific date without changing the time we can use sub
df$time <- as.POSIXct(sub("2013-12-30", "2012-12-30", df$time))

How to fast convert different time formats in large data frames?

I want to calculate length in different time dimensions but I have problems dealing with the two slightly different time formats in my data frame column.
The original data frame column has about a million rows with the two formats (shown in the example code) mixed up .
Example code:
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z", "2018-10-04T12:13:45.479Z")
length <- c(15.8, 132.1, 12.5, 33.2)
df <- data.frame(time, length)
df$time <- format(as.POSIXlt(strptime(df$time,"%Y-%m-%dT%H:%M:%SZ", tz="")))
df
The formats "2018-10-04T12:13:41.333Z" and "2018-10-04T12:13:45.479Z" result in NA.
Is there a solution that would also be applicable to a big data frame where the two formats are mixed up?

We may use %OS instead of %S to account for decimals in seconds.
help("strptime")
Specific to R is %OSn, which for output gives the seconds truncated to
0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it
uses the setting of getOption("digits.secs"), or if that is unset, n =
0).
as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"
This base R code is considerably faster than the package solutions, try it yourself.
Update 1
time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")
This one is trickier. ?strptime says we should use %z for offsets from UTC, but somehow it won't work with as.POSIXct. Instead we could do this,
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
(os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct" object.
If there are only hours as in time2, we could also say:
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.
Update 2
You could wrap the current three variants into a function with if (nchar(x) == 29) ... else structure, such as this one:
fixDateTime <- function(x) {
s <- split(x, nchar(x))
if ("20" %in% names(s))
s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
else if ("24" %in% names(s))
s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
else if ("29" %in% names(s))
s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
(os[1]*60 + os[2])*60}
return(unsplit(s, nchar(x)))
}
res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"
Compared to the packages only fixDateTime can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.
Note: The function logically fails if different date formats have the same nchar, and it should be customized in the case (e.g. by another split condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct.
Benchmark
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# fixDateTime 35.46387 35.94761 40.07578 36.05923 39.54706 68.46211 10 c
# as.POSIXct 20.32820 20.45985 21.00461 20.62237 21.16019 23.56434 10 b # to compare
# lubridate 11.59311 11.68956 12.88880 12.01077 13.76151 16.54479 10 a # produces NAs!
# anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272 10 d # produces NAs!
Data
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z")
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z",
"2018-10-01T11:42:37.000+03:00")
Benchmark code
n <- 1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)
library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
lubridate=parse_date_time(t2, "ymd_HMS"),
anytime=anytime(t2),
times=10L)

You can use library anytime
library(anytime)
time<- c("2018-07-29T15:02:05Z",
"2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
anytime(time)
#[1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST" "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"

or you can also use:
time<- c("2018-07-29T15:02:05Z",
"2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
length<-c(15.8,132.1,12.5,33.2)
df<-data.frame(time,length)
library(lubridate)
# df$time2<-as_datetime(df$time)
df$time2 <-parse_date_time(df$time, "ymd_HMS")
df

Most efficient way to have R recognize " 41520092010" as "04/15/2009 20:10"

I need to find the duration of a large number of events by using the start and end time variables in a dataset, but both the variables encode the time in the annoying format "mmddyyyyhhmm," with the cherry on top being that the first nine months are encoded as single digits (January is " 1" rather than "01"). At least the time uses a twenty-four clock (assuming the people filling out each event did it right).
I know there has to be a fairly simple way to do this, but I can't think of one and suspect one of you fine folks have it memorized and can write it out in a couple of seconds.

One possibility consists in using the stringr library in combination with the lubridate library:
DatesAndTimes <- c("41520092010","121520092010")
library(stringr)
library(lubridate)
mdy_hm(str_pad(DatesAndTimes, 12, side="left", pad="0"))
#[1] "2009-04-15 20:10:00 UTC" "2009-12-15 20:10:00 UTC"

If you have a vector x with character values for conversion ...
x <- c("41520092010", "11520092010", "121520092010")
... you can check this vector for 11 characters (or whatever). If an element has 11 characters, we paste a zero on the front, then convert the whole vector to POSIXt.
as.POSIXct(
ifelse(nchar(x) == 11, paste0("0", x), x),
format = "%m%d%Y%H%M",
tz = "UTC"
)
# [1] "2009-04-15 20:10:00 UTC" "2009-01-15 20:10:00 UTC"
# [3] "2009-12-15 20:10:00 UTC"
If you don't like ifelse(), you can use replace().
replace(x, nchar(x) == 11, paste0("0", x[nchar(x) == 11]))
or formatC()
formatC(as.numeric(x), digits = 12, width = 12, flag = "0")
The most efficient of these is likely formatC().

Extract time from timestamp?

Essentially, I want only the hour, minute, and seconds from a column of timestamps I have in R, because I want to view how often different data points occur throughout different times of day and day and date is irrelevant.
However, this is how the timestamps are structured in the dataset:
2008-08-07T17:07:36Z
And I'm unsure how to only get that time from this timestamp.
Thank you for any help you can provide and please just let me know if I can provide more information!

We can use strptime to convert to a datetime class and then format to extract the hour:min:sec.
dtime <- strptime(str1, "%Y-%m-%dT%H:%M:%SZ")
format(dtime, "%H:%M:%S")
#[1] "17:07:36"
If the OP wants to have the hour, min, sec as separate columns
read.table(text=format(dtime, "%H:%M:%S"), sep=":", header=FALSE)
# V1 V2 V3
#1 17 7 36
Another option is using lubridate
library(lubridate)
format(ymd_hms(str1), "%H:%M:%S")
#[1] "17:07:36"
data
str1 <- "2008-08-07T17:07:36Z"

Just
x <- '2008-08-07T17:07:36Z'
substr(x, 12, 19)
#[1] "17:07:36"
...will do it if the timestamp is consistent, which I imagine it would be given it is an ISO_8601 ( https://en.wikipedia.org/wiki/ISO_8601 ) string.

I think you are expecting this...
Sys.time()
[1] "2016-04-19 11:09:30 IST"
format(Sys.time(),format = '%T')
[1] "11:09:30"
if you want to give your own timestamp, then use bellow code:
format(as.POSIXlt("2016-04-19 11:02:22 IST"),format = '%T')
[1] "11:02:22"

A regular expression will probably be quite efficient for this:
x <- '2008-08-07T17:07:36Z'
x
## [1] "2008-08-07T17:07:36Z"
sub('.*T(.*)Z', '\\1', x)
## [1] "17:07:36"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract two combined dates and times from a string - r

Try the following. x <- "2015-08-11 03:14:00 UTC--2015-08-11 04:14:00 UTC" y <- strsplit(x, "--")[[1]] dates <- as.Date(y) times <- strftime(y, format = "%H:%M:%S")

Related

parse multiple date formats by index in R

How to change only the year value in a POSIXct Value

How to fast convert different time formats in large data frames?

Most efficient way to have R recognize " 41520092010" as "04/15/2009 20:10"

Extract time from timestamp?

Categories

Resources