Date formatting a data frame string column - julia

Is there a way to do something like this (this is in R)
df$dataCol <- as.Date(df$dataCol, format="%Y%m%d")
where the dataCol is of the format "20151009".
Is there a way to change the column type to date in julia ?
I didnt find a way to do this with the Date.jl package.

There is a Date constructor with an argument for the format,
but the syntax is slightly different.
using Dates
Date( "20141123", DateFormat("yyyymmdd") )

Is this the best way to do the first part of my question
using Dates
dateReported = map((x) -> string(x), df[:DateReported])
df[:DateOccurred] = map((x) -> if match(r"^((19|20)\d\d)(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])", x)!=nothing Date(x, DateFormat("yyyymmdd")) end, dateOccurred)

Related

trying to compare POSIXct objects in if statements

I have something like this within a function:
x <- as.POSIXct((substr((dataframe[z, ])$variable, 1, 8)), tz = "GMT",
format = "%H:%M:%S")
print(x)
if ( (x >= as.POSIXct("06:00:00", tz = "GMT", format = "%H:%M:%S")) &
(x < as.POSIXct("12:00:00", tz = "GMT", format = "%H:%M:%S")) ){
position <- "first"
}
but I get this output:
character(0)
Error in if ((as.numeric(departure) - as.numeric(arrival)) < 0) { : argument is of length zero
how can I fix this so my comparison works and it prints the correct thing?
some examples of the dataframe$variable column:
16:33:00
15:34:00
14:51:00
07:26:00
05:48:00
11:10:00
17:48:00
06:17:00
08:22:00
11:31:00
Welcome to Stack Overflow!
First, the reason you've gotten some down votes is most likely because you haven't given much in your question to go on. For one thing, you haven't shown us what
(dataframe[z, ])$variable
is, which makes it hard for us to formulate a complete answer. You seem to be trying to extract a single value from a dataframe, is that right? If so, I've never seen it done that way, try replacing the above with:
dataframe$variable[z]
My guess is what you're trying to achieve is a comparison of an entire column of the dataframe called "variable", since that's generally more useful...
Having said that, I often come up against issues with time data, and from what I've heard, my experiences are not uncommon. When I'm dealing with just times, as it appears you are here, I prefer the chron::times format over POSIXct (POSIX is a date-time format, so a date is always included, it also tries to correct for timezone changes, as well as daylight savings changes, which tends to get in my way more than help). If you've got your data in the format you've specified in your first as.POSIXct call, you won't even need to specify that in calling the times function instead.
x <- chron::times( dataframe$variable )
print(x)
position <- ifelse ( x >= chron::times( "06:00:00" ) &
x < chron::times( "12:00:00" ),
"first", "not first"
)
This will output a vector "position", with a result for all values taken from dataframe$variable. Does that achieve what you're hoping for?
From here, if you did want to extract the comparison result for the particular row "z" in dataframe, you can still do that with
position[z]
EDIT to add:
It might be worth checking for missing values in "variable". This should return TRUE:
sum( is.na( dataframe$variable ) ) == 0
Also check for any that aren't correctly formatted. Again, this should return TRUE:
sum( is.na( chron::times( dataframe$variable ) ) ) == 0
EDIT to add:
As per the comments, it looks like some values in your "variables" column aren't converting properly. You should be able to find them with
subset( dataframe, is.na( chron::times( variable ) ) )
That should let you see what's wrong. It may be a single cell, or it may be a number of them. You'll need to tidy up that data, which you can do in a few ways. You could go through and fix them manually, you could add a function in your script to repair them before the conversion (this might be a good idea if there is a common issue between all of those values, or if you expect the same issue to happen again as new data comes in, if indeed you need to allow for that).
The other option is simply to exclude those rows from your analysis. If you go this route, make sure it's appropriate to the analysis you're running. If it is appropriate in your case, you can add a step to clean up the dataframe before running the steps in your question:
dataframe <- subset( dataframe, !is.na( chron::times( variable ) ) )
NOTE: there's a good chance this will come up with a warning. If you run the same line twice, and the warning goes away the second time (after the offending rows have been removed), you may need to look further into it.
That should drop the offending values, leaving only values that are properly converting to the times format, which should help with the steps you're trying to run. Check how your dataframe dimensions change before and after that step; that'll tell you how many rows you're dropping.
You could do the same thing with POSIXct if that's what you're comfortable with, I'm just personally more comfortable with times for what you're doing.

How to assign value to a date in an xts object in R

I assumed the following code
date = as.Date('2015-05-30')
timeseries = xts()
timeseries[date] = 1
should assign the value of 1 to a date '2015-05-30'. However, it gives me an error
Error in xts(rep(NA, length(index(x))), index(x)) :
order.by requires an appropriate time-based object
What is the proper way to assign the value to an empty xts object?
Thanks,
Vladimir
I think you misunderstand the purpose of the [<-.xts function. You're asking to replace the value at date "2015-05-30" with 1, but your xts object has no data, so there's nothing to replace. What are you actually trying to accomplish?
If you want to insert, you should call rbind(xts(1, as.Date('2015-05-30')), timeseries).
And you should heed Mike Wise's wise advice: it is very inefficient to grow objects like this.
Try something like this:
d1 <- rep(1,21)
d2 <- seq(as.Date("2001-01-01",tz="GMT"),as.Date("2021-01-01",tz="GMT"),length.out=21)
xtsdat <- as.xts(d1,d2)
If you need to build it up row by row, then build the individual vectors that way and form the xts at the end.

Selecting dates in a data.table for new column in r

I have a data table with 65 variables. I want to create a new column for Semester which is allocated to semester 1 all IDs dated before 2015-03-31 (all others are Semester 2).
students<-data.table(studid=c(1:6) ,FAC = c("IT","SCIENCE","LAW","IT","COMMERCE","COMMERCE"),dates = c("2010-12-01","2010-03-01", "2010-03-01","2010-05-20", "2010-03-01","2010-03-31"))
I have set the date class:
students$dates<-as.Date(students$dates)
I have then specified the new column:
students[,Semester:=2,]
Then I have tried:
students$Semester[students$dates < 2015-05-31]<-1
But this does not work. Any advice?
First of all, I would recommend start using data.table proper syntax. All of these $, <- etc. is base R syntax which doesn't take advantage of data.table capabilities. Please read the vignettes in this link
In other words, converting to date, for example, is done using (no need in <- or $)
students[, dates := as.IDate(dates)]
Which will update your data by reference
Second of all, when you just do 2015-05-31, you are basically just writing an equation: 2015-05-31 = 1979. Post it in the console and see what you get. In other words, you need to quote "2015-05-31" so R will know it's a string (which will be dispatched to a Date class later while parsed to <).
Finally, here's the solution using data.table syntax
students[dates < "2015-05-31", Semester := 1]

How to validate date in R

I have a date in the format dd-mm-yyyy HH:mm:ss
What is the best and easiest way to validate this date?
I tried
d <- format.Date(date, format="%d-%m-%Y %H:%M:%S")
But how can I catch the error when an illegal date is passed?
Simple way:
d <- try(as.Date(date, format="%d-%m-%Y %H:%M:%S"))
if("try-error" %in% class(d) || is.na(d)) {
print("That wasn't correct!")
}
Explanation: format.Date uses as.Date internally to convert date into an object of the Date class. However, it does not use a format option, so as.Date uses the default format, which is %Y-%m-%dand then %Y/%m/%d.
The format option from format.Date is used only for the output, not for the parsing. Quoting from the as.Date man page:
The ‘as.Date’ methods accept character strings, factors, logical
‘NA’ and objects of classes ‘"POSIXlt"’ and ‘"POSIXct"’. (The
last is converted to days by ignoring the time after midnight in
the representation of the time in specified timezone, default
UTC.) Also objects of class ‘"date"’ (from package ‘date’) and
‘"dates"’ (from package ‘chron’). Character strings are processed
as far as necessary for the format specified: any trailing
characters are ignored.
However, when you directly call as.Date with a format specification, nothing else will be allowed than what fits your format.
See also: ?as.Date
You may want to look at the gsubfn package. This has functions (gsubfn specifically) that work like other regular expression functions to match pieces to a string, but then it calls a user supplied function and passes the matching pieces to this function. So you would write your own function that looks at the year, moth, and day and makes sure that they are in the correct ranges (and the range for day can depend on the passed month and year.
This might be helpful if flexibility is desired in a date-time entry.
I have a function where I want to allow either a date-only entry or a date-time entry, then set a flag - for use inside the function only. I'm calling this flag data_type. The flag will be used later in the larger function to select units for getting a difference in two dates with difftime. (In most cases, the function will be perfectly fine with date only, but in some cases a user might need a shorter time frame. I don't want to inconvenience users with the shorter time frame if they don't need it.)
I am posting this for two reasons: 1) to help anyone trying to allow flexibility in date arguments and 2) to welcome sanity checks in case there's a problem with the method, since this is going into a function in an R package.
dat_time_check_fn <- function(dat_time) {
if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d %H:%M:%S"))) date_type <- 1
else if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d"))) date_type <- 2
else stop("Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59' ")
date_type
}
Date-time case
date5 <- "1999-12-31 23:59:59"
date_type <- dat_time_check_fn(date5)
date_type
[1] 1
Date only case:
date6 <- "1999-12-31"
date_type <- dat_time_check_fn(date6)
date_type
[1] 2
Note that if the order above in the function is reversed, the longer date-time can be inadvertently coerced to the shorter version and both types result in date_type = 1.
My larger function has more than one date, but I need them to be compatible. Below, I'm checking the two dates checked above, where one was type 1 and one was type 2. Combining types gives the result with date only (type 2):
date_type <- dat_time_check_fn(c(date5, date6))
date_type
[1] 2
Here's a non-compliant version:
date7 <- "1/31/2011"
date_type <- dat_time_check_fn(date7)
Error in dat_time_check_fn(date7) :
Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59'
Many solutions here are prone to SQL injection. They return TRUE for date = "2020-08-11; DROP * FROM my_table". Here is a vectorized base R function that works with NA:
is_date = function(x, format = NULL) {
formatted = try(as.Date(x, format), silent = TRUE)
is_date = as.character(formatted) == x & !is.na(formatted) # valid and identical to input
is_date[is.na(x)] = NA # Insert NA for NA in x
return(is_date)
}
Let's try:
> is_date(c("2020-08-11", "2020-13-32", "2020-08-11; DROP * FROM table", NA), format = "%Y-%m-%d")
## TRUE FALSE FALSE NA
I believe that what you are looking for is the tryCatch function.
The following as an excerpt from a script I wrote which accepts any .csv file with two series that have a common x axis. The first column in 'data' is the common x axis variable, and columns 2 & 3 are the y axis variables. I needed the tryCatch statement to make sure the script would create a plot regardless of whether the x axis data is a time series, or some other type of variable.
### READ DATA FROM A CSV FILE
data = read.csv("STLDvsNEM2.csv", header = TRUE)
#CONVERT FIRST ROW OF DATA (IN MY CASE, THE COLUMN INTENDED TO BE THE X AXIS)
#TO AN ACCEPTABLE DATE FORMAT
#IF FIRST ROW OF DATA IS NOT IN AN ACCEPTABLE DATE FORMAT
#USE THE VALUE WITHOUT ANY TRANSFORMATION
x <- tryCatch({
as.Date(data[,1])},
warning = function(w) {},
error = function(e) {
x <- data[,1]
})
y1 <- data[,2]
y2 <- data[,3]

Converting time interval in R

My knowledge and experience of R is limited, so please bear with me.
I have a measurements of duration in the following form:
d+h:m:s.s
e.g. 3+23:12:11.931139, where d=days, h=hours, m=minutes, and s.s=decimal seconds. I would like to create a histogram of these values.
Is there a simple way to convert such string input into a numerical form, such as seconds? All the information I have found seems to be geared towards date-time objects.
Ideally I would like to be able to pipe a list of data to R on the command line and so create the histogram on the fly.
Cheers
Loris
Another solution based on SO:
op <- options(digits.secs=10)
z <- strptime("3+23:12:11.931139", "%d+%H:%M:%OS")
vec_z <- z + rnorm(100000)
hist(vec_z, breaks=20)
Short explanation: First, I set the option in such a way that the milliseconds are shown. Now, if you type z into the console you get "2012-05-03 23:12:11.93113". Then, I parse your string into a date-object. Then I create some more dates and plot a histogramm. I think the important step for you is the parsing and strptime should help you with that
I would do it like this:
str = "3+23:12:11.931139"
result = sum(as.numeric(unlist(strsplit(str, "[:\\+]", perl = TRUE))) * c(24*60*60, 60*60, 60, 1))
> result
[1] 342731.9
Then, you can wrap it into a function and apply over the list or vector.

Resources