Find dates that fail to parse in R Lubridate - r

As a R novice I'm pulling my hair out trying to debug cryptic R errors. I have csv that containing 150k lines that I load into a data frame named 'date'. I then use lubridate to convert this character column to datetimes in hopes of finding min/max date.
dates <- csv[c('datetime')]
dates$datetime <- ymd_hms(dates$datetime)
Running this code I receive the following error message:
Warning message:
3 failed to parse.
I accept this as the CSV could have some janky dates in there and next run:
min(dates$datetime)
max(dates$datetime)
Both of these return NA, which I assume is from the few broken dates still stored in the data frame. I've searched around for a quick fix, and have even tried to build a foreach loop to identify the problem dates, but no luck. What would be a simple way to identify the 3 broken dates?
example date format: 2015-06-17 17:10:16 +0000

Credit to LawyeR and Stibu from above comments:
I first sorted the raw csv column and did a head() & tail() to find
which 3 dates were causing trouble
Alternatively which(is.na(dates$datetime)) was a simple one liner to also find the answer.

Lubridate will throw that error when attempting to parse dates that do not exist because of daylight savings time.
For example:
library(lubridate)
mydate <- strptime('2020-03-08 02:30:00', format = "%Y-%m-%d %H:%M:%S")
ymd_hms(mydate, tz = "America/Denver")
[1] NA
Warning message:
1 failed to parse.
My data comes from an unintelligent sensor which does not know about DST, so impossible (but correctly formatted) dates appear in my timeseries.

If the indices of where lubridate fails are useful to know, you can use a for loop with stopifnot() and print each successful parse.
Make some dates, throw an error in there at a random location.
library(lubridate)
set.seed(1)
my_dates<-as.character(sample(seq(as.Date('1900/01/01'),
as.Date('2000/01/01'), by="day"), 1000))
my_dates[sample(1:length(my_dates), 1)]<-"purpleElephant"
Now use a for loop and print each successful parse with stopifnot().
for(i in 1:length(my_dates)){
print(i)
stopifnot(!is.na(ymd(my_dates[i])))
}

To provide a more generic answer, first filter out the NAs, then try and parse, then filter only the NAs. This will show you the failures. Something like:
dates2 <- dates[!is.na(dates2$datetime)]
dates2$datetime <- ymd_hms(dates2$datetime)
Warning message:
3 failed to parse.
dates2[is.na(dates2$datetime)]

Here is a simple function that solves the generic problem:
parse_ymd = function(x){
d=lubridate::ymd(x, quiet=TRUE)
errors = x[!is.na(x) & is.na(d)]
if(length(errors)>0){
cli::cli_warn("Failed to parse some dates: {.val {errors}}")
}
d
}
x = c("2014/20/21", "2014/01/01", NA, "2014/01/02", "foobar")
my_date = lubridate::ymd(x)
#> Warning: 2 failed to parse.
my_date = parse_ymd(x)
#> Warning: Failed to parse some dates: "2014/20/21" and "foobar"
Created on 2022-09-29 with reprex v2.0.2
Of course, replace ymd() with whatever you want.

Use the truncate argument. The most common type of irregularity in date-time data is the truncation due to rounding or unavailability of the time stamp.
Therefore, try truncated = 1, then potentially go up to truncated = 3:
dates <- csv[c('datetime')]
dates$datetime <- ymd_hms(dates$datetime, truncated = 1)

Related

Lubridate or ANYTIME to convert from 24hr to 12hr time

As the title suggests, I am trying to use either lubridate or ANYTIME (or similar) to convert a time from 24 hour into 12 hour.. To make life easier I don't need the whole time converted.
What I mean is I have a column of dates in this format:
2021-02-15 16:30:33
I can use inbound$Hour <- hour(inbound$Timestamp) to grab just the hour from the Timestamp which is great.. except that it is still in 24hr time. (this creates an integer column for the hour number)
I have tried several mutates such as inbound <- inbound %>% mutate(Hour = ifelse(Hour > 12, sum(Hour - 12),Hour)
This technically works.. but I get some really wonky values (I get a -294 in several rows for example)..
is there an easier way to get the 12hr time converted?
Per recommendation below I tried to use a base FORMAT as follows:
inbound$Time <- format(inbound$Timestamp, "%H:%M:%S")
inbound$Time <- format(inbound$Time, "%I:%M:%S")
and on the second format I am getting an error
Error in format.default(inbound$Time, "%I:%M:%S") :
invalid 'trim' argument
I did notice the first format converts to a class CHARACTER column.. not sure if that is causing issues with the 2nd format or not..
I then also tried:
`inbound$time <- format(strptime(inbound$Timestamp, "%H:%M:%S"), "%I:%M %p")`
Which runs without error.. but it creates a full column of NA's
Final edit::::: I made the mistake of mis-reading/applying the solution and that caused errors.. when using the inbound$Time <- format(inbound$Time, "%I:%M:%S") or as.numeric(format(inbound$Timestamp, "%I")) from the comments... both worked and solved the issue I was having.
To be clear... From 2021-02-15 16:30:33 you want just 04:30:33 as a result?
No need for lubridate or anytime. Assuming that is a Posixct
a <- as.POSIXct("2021-02-15 16:30:33")
a
# [1] "2021-02-15 16:30:33 UTC"
b <- format(a, "%H:%M:%S")
b
#[1] "16:30:33"
c <- format(a, "%I:%M:%S")
c
#[1] "04:30:33"

Problems with parse_date_time converting a character vector

I have an imported CSV in R which contains a column of dates and times - this is imported into R as character. The format is "30/03/2020 08:59". I want to convert these strings into a format that allows me to work on them. For simplicity I have made a dataframe which has a single column of these dates (854) in this format.
I'm trying to use the parse_date_time function from lubridate.
It works fine when I reference a single value, e.g.
b=parse_date_time(consults_dates[3,1],orders="dmy HM")
gives b=2020-03-30 09:08:00
However, when I try to perform this on the entire(consults_dates), I get an error, e.g.
c= parse_date_time(consults_dates,orders="dmy HM") gives error:
Warning message:
All formats failed to parse. No formats found.
Apologies - if this is blatantly a simple question, day 1 of R after years of Matlab.
You need to pass the column to parse_date_time function and not the entire dataframe.
library(lubridate)
consults_dates$colum_name <- parse_date_time(consults_dates$colum_name, "dmy HM")
However, if you have only one format in the column you can use dmy_hm
consults_dates$colum_name <- dmy_hm(consults_dates$colum_name)
In base R, we can use :
consults_dates$colum_name <- as.POSIXct(consults_dates$colum_name,
format = "%d/%m/%Y %H:%M", tz = "UTC")

R datetime format issues

I am currently trying to determine the time and date on the observations in my dataset.
The date/timestamp is as follows:
1458024601.18659
1458024660.818
The observation are recorded ever minute.
I am trying to convert the above date/time stamp into something for understandable/ interpretable.
Could you please help me with this issue.
Many thanks.
Looks like seconds, but seconds starting from when? Typically, 1970-01-01:
> x = 1458024601.18659
> as.POSIXct(x, origin="1970-01-01")
[1] "2016-03-15 06:50:01 GMT"
So if you are expecting that timestamp to be that time, we've got the origin right.
If you are expecting a date in 1946, then origin="1900-01-01" is probably what you want.
Since, according to your most recent post, the data is stored as a factor class, some further manipulations are required.
To convert the factor column into the required numeric class, this modification of #Spacedman's answer should work:
as.POSIXct(as.numeric(as.character(all_prices$timestamp)), origin="1970-01-01")
Your solution is perfect, except that i have another issue :(
I tried to run this code on the data.frame that i have. Unfortunately, i keep getting this following error after running the code.
dates <- as.POSIXct(all_prices$timestamp, origin="2016-03-15")
Error in as.POSIXlt.character(as.character(x), ...) :
character string is not in a standard unambiguous format
Data "all_prices" is a data.frame.
class(all_prices)
[1] "data.frame"
data "all_prices$timestamp" is a factor.
class(all_prices$timestamp)
[1] "factor"

mdy {lubridate} unable to identify "January"

I am working with a list of birth dates in the format "January131973". To get the dates from this string, I am using mdy function from lubridate library. Strangely the code was returning NA only for dates in January, but working fine for other months, as below
> mdy("January131973")
[1] NA
Warning message:
All formats failed to parse. No formats found.
> mdy("April241973")
[1] "1973-04-24"
The data spans across all months and dates, and years ranging from 1971 to 1990. But the error occurs only for dates in January. I have worked around the input string to get "13January1973" and proceeded with dmy function instead, which has resolved the issue at hand. (ymd also works perfectly fine.)
However, if it can be verified that I am not overlooking any underlying conflicts etc, it will be helpful the next times, and can also help identify unseen issues elsewhere.
Here is a test code I have tried out to check different combinations
library(tidyr)
library(lubridate)
x <- data.frame(mmm=month.name, dd=c(15:26), yyyy=c(1973:1984))
x_mdy <- unite(x, test, mmm,dd,yyyy, sep = "",remove = FALSE)
lapply(x_mdy$test, mdy)
x_dmy <- unite(x, test, dd,mmm,yyyy, sep = "", remove = FALSE)
lapply(x_dmy$test, dmy)
x_ymd <- unite(x, test, yyyy,mmm,dd, sep = "", remove = FALSE)
lapply(x_ymd$test, ymd)
After running the above code, I have faced the issue only while using mdy with "January". Also note that abbreviated form of the month name also gives the same error (mmm=month.abb in the above df creation.)
Any clarification of this behavior appreciated.

Smartbind date format and error in R

I'm getting an error using smartbind to append two datasets. First, I'm pretty sure the error I'm getting:
> Error in as.vector(x, mode) : invalid 'mode' argument
is coming from the date variable in both datasets. The date variable in it's raw format is such: month/day/year. I transformed the variable after importing the data using as.Date and format
> rs.month$xdeeddt <- as.Date(rs.month$xdeeddt, "%m/%d/%Y")
> rs.month$deed.year <- as.numeric(format(rs.month$xdeeddt, format = "%Y"))
> rs.month$deed.day <- as.numeric(format(rs.month$xdeeddt, format = "%d"))
> rs.month$deed.month <- as.numeric(format(rs.month$xdeeddt, format = "%m"))
The resulting date variable is as such:
> [1] "2014-03-01" "2014-03-13" "2014-01-09" "2013-10-09"
The transformation for the date was applied to both datasets (the format of the raw data was identical for both datasets). When I try to use smartbind, from the gtools package, to append the two datasets it returns with the error above. I removed the date, month, day, and year variables from both datasets and was able to append the datasets successfully with smartbind.
Any suggestions on how I can append the datasets with the date variables.....?
I came here after googling for the same error message during a smartbind of two data frames. The discussion above, while not so conclusive about a solution, definitely helped me move through this error.
Both my data frames contain POSIXct date objects. Those are just a numeric vector of UNIXy seconds-since-epoch, along with a couple of attributes that provide the structure needed to interpret the vector as a date object. The solution is simply to strip the attributes from that variable, perform the smartbind, and then restore the attributes:
these.atts <- attributes(df1$date)
attributes(df1$date) <- NULL
attributes(df2$date) <- NULL
df1 <- smartbind(df1,df2)
attributes(df1$date) <- these.atts
I hope this helps someone, sometime.
-Andy

Resources