How to validate date in R - r

I have a date in the format dd-mm-yyyy HH:mm:ss
What is the best and easiest way to validate this date?
I tried
d <- format.Date(date, format="%d-%m-%Y %H:%M:%S")
But how can I catch the error when an illegal date is passed?

Simple way:
d <- try(as.Date(date, format="%d-%m-%Y %H:%M:%S"))
if("try-error" %in% class(d) || is.na(d)) {
print("That wasn't correct!")
}
Explanation: format.Date uses as.Date internally to convert date into an object of the Date class. However, it does not use a format option, so as.Date uses the default format, which is %Y-%m-%dand then %Y/%m/%d.
The format option from format.Date is used only for the output, not for the parsing. Quoting from the as.Date man page:
The ‘as.Date’ methods accept character strings, factors, logical
‘NA’ and objects of classes ‘"POSIXlt"’ and ‘"POSIXct"’. (The
last is converted to days by ignoring the time after midnight in
the representation of the time in specified timezone, default
UTC.) Also objects of class ‘"date"’ (from package ‘date’) and
‘"dates"’ (from package ‘chron’). Character strings are processed
as far as necessary for the format specified: any trailing
characters are ignored.
However, when you directly call as.Date with a format specification, nothing else will be allowed than what fits your format.
See also: ?as.Date

You may want to look at the gsubfn package. This has functions (gsubfn specifically) that work like other regular expression functions to match pieces to a string, but then it calls a user supplied function and passes the matching pieces to this function. So you would write your own function that looks at the year, moth, and day and makes sure that they are in the correct ranges (and the range for day can depend on the passed month and year.

This might be helpful if flexibility is desired in a date-time entry.
I have a function where I want to allow either a date-only entry or a date-time entry, then set a flag - for use inside the function only. I'm calling this flag data_type. The flag will be used later in the larger function to select units for getting a difference in two dates with difftime. (In most cases, the function will be perfectly fine with date only, but in some cases a user might need a shorter time frame. I don't want to inconvenience users with the shorter time frame if they don't need it.)
I am posting this for two reasons: 1) to help anyone trying to allow flexibility in date arguments and 2) to welcome sanity checks in case there's a problem with the method, since this is going into a function in an R package.
dat_time_check_fn <- function(dat_time) {
if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d %H:%M:%S"))) date_type <- 1
else if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d"))) date_type <- 2
else stop("Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59' ")
date_type
}
Date-time case
date5 <- "1999-12-31 23:59:59"
date_type <- dat_time_check_fn(date5)
date_type
[1] 1
Date only case:
date6 <- "1999-12-31"
date_type <- dat_time_check_fn(date6)
date_type
[1] 2
Note that if the order above in the function is reversed, the longer date-time can be inadvertently coerced to the shorter version and both types result in date_type = 1.
My larger function has more than one date, but I need them to be compatible. Below, I'm checking the two dates checked above, where one was type 1 and one was type 2. Combining types gives the result with date only (type 2):
date_type <- dat_time_check_fn(c(date5, date6))
date_type
[1] 2
Here's a non-compliant version:
date7 <- "1/31/2011"
date_type <- dat_time_check_fn(date7)
Error in dat_time_check_fn(date7) :
Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59'

Many solutions here are prone to SQL injection. They return TRUE for date = "2020-08-11; DROP * FROM my_table". Here is a vectorized base R function that works with NA:
is_date = function(x, format = NULL) {
formatted = try(as.Date(x, format), silent = TRUE)
is_date = as.character(formatted) == x & !is.na(formatted) # valid and identical to input
is_date[is.na(x)] = NA # Insert NA for NA in x
return(is_date)
}
Let's try:
> is_date(c("2020-08-11", "2020-13-32", "2020-08-11; DROP * FROM table", NA), format = "%Y-%m-%d")
## TRUE FALSE FALSE NA

I believe that what you are looking for is the tryCatch function.
The following as an excerpt from a script I wrote which accepts any .csv file with two series that have a common x axis. The first column in 'data' is the common x axis variable, and columns 2 & 3 are the y axis variables. I needed the tryCatch statement to make sure the script would create a plot regardless of whether the x axis data is a time series, or some other type of variable.
### READ DATA FROM A CSV FILE
data = read.csv("STLDvsNEM2.csv", header = TRUE)
#CONVERT FIRST ROW OF DATA (IN MY CASE, THE COLUMN INTENDED TO BE THE X AXIS)
#TO AN ACCEPTABLE DATE FORMAT
#IF FIRST ROW OF DATA IS NOT IN AN ACCEPTABLE DATE FORMAT
#USE THE VALUE WITHOUT ANY TRANSFORMATION
x <- tryCatch({
as.Date(data[,1])},
warning = function(w) {},
error = function(e) {
x <- data[,1]
})
y1 <- data[,2]
y2 <- data[,3]

Related

R Convert char to time

I have a column of military time values, df1$appt_times in the format of "13:30" All of them, 5 characters, "00:00". I have tried POSIXct but it added today's date to the values. I have also tried lubridate and couldn't get that to work. Most recently I am trying to use chron and am so far unsuccessful at that too
The goal is that once this is done I am going to group the times into factor levels, I cannot perform any conditional operations on them currently, unless I am wrong about that as well ;)
> df1$Time <- chron(times = df1$appt_time)
Error in convert.times(times., fmt) : format h:m:s may be incorrect
In addition: Warning message:
In unpaste(times, sep = fmt$sep, fnames = fmt$periods, nfields = 3) :
106057 entries set to NA due to wrong number of fields
also df1$Time <- chron(times(df1$appt_time)) same error as above
as well as different tries at being explicit with the format:
> df1$appt_time <- chron(df1$appt_time, format = "h:m")
Error in widths[, fmt$periods, drop = FALSE] : subscript out of bounds
I would be very grateful if someone could point out my error or suggest a better way to accomplish this task.
You can use as.POSIXct :
df1$date_time <- as.POSIXct(df1$appt_time, format = '%H:%M', tz = 'UTC')
Since you don't have dates this will assign today's date and time would be according to appt_time.
For example -
as.POSIXct('13:30', format = '%H:%M', tz = 'UTC')
#[1] "2021-02-01 13:30:00 UTC"
One way to overcome this problem if you need to perform arithmetic on the times prior to grouping them is to treat the minutes as a fraction of the hour:
# If you need to do some extra arithmetic prior to coercing to factor:
as.numeric(substr(test1, 1, 2)) + (as.numeric(substr(test1, 4, 5))/60)
# Otherwise:
as.factor(test1)
Where df1$appt_times == test1
test1 <- c('13:30','13:45', '14:00', '14:15', '14:30', '14:45', '15:00')
Not being able to find a solution to work with the time in the way I thought I came up with this DIIIIIRRRRRRRRRRRTY solution.
#converted appt_time to POSIXct format, which added toady's date
df9$appt_time <- as.POSIXct(df9$appt_time, format = '%H:%M')
#Since I am only interesting in creating a value based on if the time falls within a specific range I decided I could output this new value, 'unclassed', to a column and then manually eyeball the values I needed that corresponded to my ranges
df9$convert <- unclass(df9$appt_time)
#Using the, manually obtained, unclassed values I was able create the factor levels I wanted
group_appt_time <- function(convert){
ifelse (convert >= 1612624500 & convert <= 1612637100, 'Morning',
ifelse (convert >= 1612638000 & convert <= 1612647900, 'Mid-Day',
ifelse (convert >= 1612648800 & convert <= 1612658700, 'Afternoon',
'Invalid Time')))
}
df9$appt_time_grouped <- as.factor(group_appt_time(df9$convert))
This is a research project, not something I need to recreate in an ongoing manner so it works

window() function exclude the date sent as end argument, any work around?

I want to use window function to subset a time series. However, the function excludes the date I input as end argument.
window(ts1, end = "2018-09-24")
I couldn't find any argument to change this behavior. Any thought?
The problem arose because of comparing two different types of data, Date and POSIXct.
I solved the issue by finding the indexes of the rows that are after that date and then excluded them from the dataset:
evaluation_date <- "2018-09-24"
indexes_removed <- which(as.numeric(as.Date(index(ts1))) > as.numeric(as.Date(evaluation_date)))
ts1 <- ts1[[-indexes_removed]

Slow String to Date Conversion Function

I wrote the following function to convert a vector of Strings to a vector of Dates (the code inside the for loop was inspired by this post: R help converting factor to date). When I pass in a vector of size 1000, this takes about 30 seconds. Not terribly slow, but I ultimately need to pass in about 100,000 so this could be a problem. Any ideas why this is slow and/or how to speed it up?
toDate <- function (dates)
{
theDates <- vector()
for(i in 1:length(dates))
{
temp <- factor(dates[i])
temp <- as.Date(temp, format = "%m/%d/%Y")
theDates[i] <- temp
}
class(theDates) <- "Date"
return(theDates)
}
Just do:
as.Date(dates, format = "%m/%d/%Y")
You don't need to loop over the dates vector as as.Date() can handle a vector of characters just fine in a single shot. Your function is incurring length(dates) calls to as.Date() plus some assignments to other functions, which all have overhead that is totally unnecessary.
You don't want to convert each individual date to a factor. You don't want to convert them at all (as.Date() will just convert them back to characters). If you did want to convert them, factor() is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove the factor() line and insert dates <- as.factor(dates) outside the for() loop. But again, you don't need to do this at all!

R how to format date for an element of a list

This is what I want to do:
First, I randomly generate a sequence of dates.
Then, I assign the earliest date to the variable.
site_start<-list()
for(i in 1:l0){
for(j in 1:10){
date<-seq.Date(from="1900-01-01",to="2000-01-01",by=week)
site_start[[i]][j]<-sample(date,1)
}
}
Now, let us assume the date variable is correctly generated. The reason I say this is because in my real case, I acquired the date variable from dozens of other steps that is irrelevant here.
My question is, why the site_start[[i]][j] I generated, kept on coming out as POSIXct, and R requires me to provide 'origin'? I format it with origin of 1970-01-01, it is still a numeric date, such as 15600. I simply don't know how to format this number anymore.
Any help is appreciated!!
W
Why don't you use this vectorized approach:
date.pool <- seq(from=as.Date("1900-01-01"), to=as.Date("2000-01-01"), by="1 week")
site_start <- replicate(10, sample(date.pool, 10, rep=T), simplify=F)
This produces a list with 10 items, each of which is a 10 length vector with random dates pulled from date.pool. Here are the first two items (site_start[1:2]):
[[1]]
[1] "1969-09-15" "1955-10-10" "1959-04-13" "1992-02-10" "1905-07-31" "1901-09-23"
[7] "1926-10-18" "1959-06-01" "1924-06-02" "1906-05-14"
[[2]]
[1] "1979-01-01" "1998-02-23" "1929-09-02" "1968-07-01" "1924-03-17" "1914-11-02"
[7] "1928-02-13" "1937-10-25" "1915-02-08" "1974-05-06"
In the past, when I have had to grab the oldest or most-recent entry I will use arrange. E.g.,
# read dataset
enforce <- read.csv(paste(input.dir, "provider_enforcement.csv", sep="/"))
# use lubridate package to parse date format
enforce$SNAPSHOT_DATE <- mdy_hm(enforce$SNAPSHOT_DATE)
# this function sorts a data.frame and returns a data.frame with one row containing the most recent SNAPSHOT
MostRecent <- function(data) {
return(arrange(data, SNAPSHOT_DATE, decreasing=TRUE)[1, ])
}
# use plyr to apply MostRecent to my dataset for each provider
enforce <- ddply(enforce, .(PROVIDER_IDNO), MostRecent)

Error in Simple User Define function related to data conversion in R

The purpose of this very simple function is just to transform a date column to a date variable and a numeric time (hourly) column to a factor variable, which will be used with plyr later in the code.
I can get this code to run successfully in the command line, but when I attempt to run it in the function I get an error.
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:1080
dates <-seq(as.Date("2010-01-01"), by = "day", length.out= 1080)
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Adspend <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Adspend)
#############################################################
myData
# Function Creation
AddCal <-function(DF,Date,Time) {
DF$Date<-as.Date(DF$Date, format="%m/%d/%Y")#Change Date variable into a date type
DF$Time<-factor(DF$Time,levels=`c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24))
}
#Test Function
Bob<-AddCal(myData,Date,Hour)
#Error I receive
Error in `$<-.data.frame`(`*tmp*`, "Time", value = integer(0)) :
replacement has 0 rows, data has 25920
I spent about 2 hours searching for answers and trying different things. Because I can run the individual lines of code at the command line and get the desired result, I am assuming this is an advanced coding problem beyond my novice capabilities.
In your function, replace all instances DF$Time with DF[[Time]] same for DF$Date.
Also see the two comments below from #Dwin & #mrip:
Make sure to return a value
Make sure to pass string arguments where strings are expected
What's going on:
When you use DF$Time, R is looking for a column named Time in DF. It is not treating Time as the string variable that you expect.
DF[[Time]] on the other hand does treat Time as a variable.
The reason the error only refers to Time and not Date is because Date is both the name of your variable and the name of a column in DF. (If in your function call you would have used something like AddCal(.. Date=Demand) or whatever other column name, you would not get back the results you would expect)
Side Note:
c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
is equivalent to
seq(24) and to 1:24

Resources