R parse timestamp of form %j%Y with no leading zeroes - r

I am working with csv timestamp data given in the form '%j%Y %H:%M with no leading zeroes. Here are some time stamp examples:
112005 22:00
1292005 6:00
R is reading the first line at the 112th day of the 005th year. How can I make R correctly parse this information?
Code I'm using which doesn't work:
train$TIMESTAMP <- strptime(train$TIMESTAMP, format='%j%Y %H:%M', tz='GMT')
train$hour <- as.numeric(format(train$TIMESTAMP, '%H'))

I don't think there's any simple way to decipher where the day stops and the year starts. Maybe you could split it at something that looks like a relevant year (20XX):
gsub("^(\\d{1,3})(20\\d{2})","\\1 \\2",train$TIMESTAMP)
#[1] "11 2005 22:00" "129 2005 6:00"
and do:
strptime(gsub("^(\\d{1,3})(20\\d{2})","\\1 \\2",train$TIMESTAMP), "%j %Y %H:%M")
#[1] "2005-01-11 22:00:00 EST" "2005-05-09 06:00:00 EST"

Related

Read improper time formats in R [duplicate]

I would like to convert a string to time. I have a time field where the string has only four digits and a letter (A or P). There is no colon between the digits showing it is a time. I would like to convert the string, which is 12 hours, to a 24 hour time so I can drop the A and P.
Here is an example:
time = c("1110A", "1120P", "0420P", "0245P")
I'm looking for a time class that loos like this:
Answer= c('11:10', '23:20', '16:20', '14:45')
Any help would be greatly appreciated.
You can use the function strptime to create dates from strings after making one small change to your strings.
time <- c("1110A", "1120P", "0420P", "02:45P")
time <- gsub(":", "", time)
time <- strptime(x = paste0(time, "m"), format = "%I%M%p")
paste is needed for strptime to parse with the format that we've given it. %I is an hour (00-24), %M is the minute and %p is for parsing AM/PM.
Once it's parsed as a date, you can use format for pretty printing, or use the normal operators on it like +, -, diff, etc....
strptime gives you a lot of flexibility when parsing dates, but sometimes you have to try a few things when dates are not in a standard format.
We could also use the lubridate functions to parse the format after pasteing the date
library(lubridate)
library(glue)
ymd_hm(glue("2018-01-01 {time}M"))
#[1] "2018-01-01 11:10:00 UTC" "2018-01-01 23:20:00 UTC"
#[3] "2018-01-01 16:20:00 UTC" "2018-01-01 14:45:00 UTC"
In your question, you say that you want to be able to subtract these times. I think it makes the most sense to convert it to a POSIXct object. If you want a specific day/month/year you need to append it to your string like below, otherwise you can not specify one and it will assume the date is today:
date2 = as.POSIXct(paste0("01-01-2018 ", time, "m"), format = "%m-%d-%Y %I%M%p")
date2
#[1] "2018-01-01 11:10:00 EST" "2018-01-01 23:20:00 EST" "2018-01-01 16:20:00 EST" "2018-01-01 14:45:00 EST"

NA for 1 particular date when converting dates from "character" format to "POSIXct" with as.POSIXct

I'm converting a string vector to date format with as.POSIXct().
Here is the strange thing:
as.POSIXct("2017-03-26 03:00:00.000",format="%Y-%m-%d %H")
#Gives
"2017-03-26 03:00:00 CEST"
#While
as.POSIXct("2017-03-26 02:00:00.000",format="%Y-%m-%d %H")
#Outputs
NA
This is really confusing and frustrating. It seem like the function really doesn't like the specific time:
02:00:00.000
We can specify the %T for time. In the format, there are minutes, seconds and millseconds. So, the %H is only matching the hour part
as.POSIXct("2017-03-26 02:00:00.000",format="%Y-%m-%d %T")
[1] "2017-03-26 02:00:00 EDT"
Or to take care of the milliseconds as well
as.POSIXct("2017-03-26 02:00:00.000",format="%Y-%m-%d %H:%M:%OS")
#[1] "2017-03-26 02:00:00 EDT"
Or using lubridate
library(lubridate)
ymd_hms("2017-03-26 02:00:00.000")
This was a daylight savings issue, the time:
"2017-03-26 02:00:00.000" does not exist in Sweden as we lost an hour this date when changing to "summer time".

R - Formatting dates in dataframe - mix of decimal and character values

I have a date column in a dataframe. I have read this df into R using openxlsx. The column is 'seen' as a character vector when I use typeof(df$date).
The column contains date information in several formats and I am looking to get this into the one format.
#Example
date <- c("43469.494444444441", "12/31/2019 1:41 PM", "12/01/2019 16:00:00")
#What I want -updated
fixed <- c("2019-04-01", "2019-12-31", "2019-12-01")
I have tried many work arounds including openxlsx::ConvertToDate, lubridate::parse_date_time, lubridate::date_decimal
openxlsx::ConvertToDateso far works best but it will only take 1 format and coerce NAs for the others
update
I realized I actually had one of the above output dates wrong.
Value 43469.494444444441 should convert to 2019-04-01.
Here is one way to do this in two-step. Change excel dates separately and all other dates differently. If you have some more formats of dates that can be added in parse_date_time.
temp <- lubridate::parse_date_time(date, c('mdY IMp', 'mdY HMS'))
temp[is.na(temp)] <- as.Date(as.numeric(date[is.na(temp)]), origin = "1899-12-30")
temp
#[1] "2019-01-04 11:51:59 UTC" "2019-12-31 13:41:00 UTC" "2019-12-01 16:00:00 UTC"
as.Date(temp)
#[1] "2019-01-04" "2019-12-31" "2019-12-01"
You could use a helper function to normalize the dates which might be slightly faster than lubridate.
There are weird origins in MS Excel that depend on platform. So if the data are imported from different platforms, you may want to work woth dummy variables.
normDate <- Vectorize(function(x) {
if (!is.na(suppressWarnings(as.numeric(x)))) # Win excel
as.Date(as.numeric(x), origin="1899-12-30")
else if (grepl("A|P", x))
as.Date(x, format="%m/%d/%Y %I:%M %p")
else
as.Date(x, format="%m/%d/%Y %R")
})
For additional date formats just add another else if. Format specifications can be found with ?strptime.
Then just use as.Date() with usual origin.
res <- as.Date(normDate(date), origin="1970-01-01")
# 43469.494444444441 12/31/2019 1:41 PM 12/01/2019 16:00:00
# "2019-01-04" "2019-12-31" "2019-12-01"
class(res)
# [1] "Date"
Edit: To achieve a specific output format, use format, e.g.
format(res, "%Y-%d-%m")
# 43469.494444444441 12/31/2019 1:41 PM 12/01/2019 16:00:00
# "2019-04-01" "2019-31-12" "2019-01-12"
format(res, "%Y/%d/%m")
# 43469.494444444441 12/31/2019 1:41 PM 12/01/2019 16:00:00
# "2019/04/01" "2019/31/12" "2019/01/12"
To lookup the codes type ?strptime.

Two Timestamp Formats in R

Im have a time stamp column that I am converting into a POSIXct. The problem is that there are two different formats in the same column, so if I use the more common conversion the other gets converted into NA.
MC$Date
12/1/15 22:00
12/1/15 23:00
12/2/15
12/2/15 1:00
12/2/15 2:00
I use the following code to convert to a POSIXct:
MC$Date <- as.POSIXct(MC$Date, tz='MST', format = '%m/%d/%Y %H:%M')
The results:
MC$Date
15-12-01 22:00:00
15-12-01 23:00:00
NA
15-12-02 01:00:00
15-12-02 02:00:00
I have tried using a logic vector to identify the issue then correct it but can't find an easy solution.
The lubridate package was designed to deal with situations like this.
dt <- c(
"12/1/15 22:00",
"12/1/15 23:00",
"12/2/15",
"12/2/15 1:00",
"12/2/15 2:00"
)
dt
[1] "12/1/15 22:00" "12/1/15 23:00" "12/2/15" "12/2/15 1:00" "12/2/15 2:00"
lubridate::mdy_hm(dt, truncated = 2)
[1] "2015-12-01 22:00:00 UTC" "2015-12-01 23:00:00 UTC" "2015-12-02 00:00:00 UTC"
[4] "2015-12-02 01:00:00 UTC" "2015-12-02 02:00:00 UTC"
The truncated parameter indicates how many formats can be missing.
You may add the tz parameter to specify which time zone to parse the date with if UTC is not suitable.
I think the logic vector approach could work. Maybe in tandem with an temporary vector for holding the parsed dates without clobbering the unparsed ones. Something like this:
dates <- as.POSIXct(MC$Date, tz='MST', format = '%m/%d/%Y %H:%M')
dates[is.na(dates)] <- as.POSIXct(MC[is.na(dates),], tz='MST', format = '%m/%d/%Y')
MC$Date <- dates
Since all of your datetimes are separated with a space between date and time, you could use strsplit to extract only the date part.
extractDate <- function(x){ strsplit(x, split = " " )[[1]][1] }
MC$Date <- sapply( MC$Date, extractDate )
Then go ahead and convert any way you like, without worrying about the time part getting in the way.

Error in converting date time to 24 hour format

I have the following dataframe and am trying to calculate the difference in minutes between dates in vectors and store it into a new one.
Reportnumber OpenedDate
00001 22/1/2016 5:52:12 PM
00002 20/1/2016 4:15:06 PM
00003 18/1/2016 1:09:46 PM
00004 15/1/2016 10:47:40 AM
00005 15/1/2016 10:32:37 AM
00006 14/1/2016 2:13:48 PM
00007 14/1/2016 11:12:29 AM
00008 14/1/2016 10:17:30 AM
00009 12/1/2016 2:25:03 PM
Before using difftime to get the difference, I'm trying to convert the time to a 24 hour format and strip AM/PM, I'm doing the following:
dataset$convertedDate <- as.POSIXct('dataset$OpenedDate', format="%d/%b/%Y %H:%M:%s")
I don't get an error in the console but the dataset$convertedDate vector isn't updated.
Is this the right way to approach the problem?
Update:
Get ready for a facepalm.
Look closely at the call you are making:
dataset$convertedDate <- as.POSIXct('dataset$OpenedDate', format="%d/%b/%Y %H:%M:%s")
You are passing in 'dataset$OpenedDate' instead of dataset$OpenedDate. In other words, you are actually passing in a text string to as.POSIXct()! I verified that passing in a string to as.POSIXct() indeed returns NA, which is what you are seeing.
You were also missing a format parameter for PM (%p). Try the following, which assumes that the timezone is UTC (which you can change to fit your needs):
as.POSIXct(df$OpenedDate, format="%d/%m/%Y %I:%M:%S %p", tz="UTC")
Output:
[1] "2016-01-22 17:52:12 UTC" "2016-01-20 16:15:06 UTC"
Data:
df <- data.frame(Reportnumber=c('00001', '00002'),
OpenedDate=c('22/1/2016 5:52:12 PM', '20/1/2016 4:15:06 PM'),
ClosedDate=c('25/1/2016 1:35:05 PM', '20/1/2016 4:30:06 PM'))

Resources