Formatting 24-hour time variable to capture observations in different ranges - r

I currently have a data frame with a column for Start.Time (imported from a *.csv file), and the format is in 24 hour format (e.g., 20:00:00 equals 8pm). My goal is to capture observations with a start time in various intervals (e.g., between 9:00:00 and 10:00:00), which also meet other criteria. However, it seems that R sorts this 'character' variable in a way that does not align with how our day goes (e.g., 14:00:00 is considered a lower value than 9:00:00).
For example, below is a line of code that works as intended, where I am capturing observations on two different trail segments, which had a start time between 8:00:00 and 9:00:00.
RLLtoMist8.9<-sum((dataset1$Trail.Segment==52|dataset1$Trail.Segment==55) &
(dataset1$Start.Time>="8:00" & dataset1$Start.Time < "9:00"),
na.rm=TRUE)
RLLtoMist8.9
But, this code below does not work as intended, as R is 'valuing' 9:00:00 as greater than 10:00:00.
RLLtoMist9.10 <-
sum((dataset1$Trail.Segment==52|dataset1$Trail.Segment==55) &
(dataset1$Start.Time>="9:00:00 AM" & dataset1$Start.Time < "10:00:00 AM"),
na.rm=TRUE)

It's certainly true that character types are sorted so that "14:00" is less than "9:00". However R has a datetime class which would sort times correctly once a character representation has been parsed.
a <- as.POSIXct("14:00", format="%H:%M")
b <- as.POSIXct("8:00", format="%H:%M")
# test
> a < b
[1] FALSE
You would be able to convert an entire column with:
dataset1$Start.Time <- as.POSIXct(dataset1$Start.Time, format="%H:%M")
The dates of a and b were the system date at the time of conversion, so if you printed them you would see dates and times in the default format. There are packages, such as chron, that let you use just times, but POSIXt objects have dates and times necessarily. See ?DateTimeClasses. The lubridate package also has an 'interval' class and there exist a difftime function in base-R.
There's also seq.POSIXt and cut.POSIXt functions, either of which could be used to create multiple time or date boundaries for categorical transformations of datetimes.

Using the data.table library:
# convert to data table
dataset1<-data.table(dataset1)
# format to a date format rather that character
dataset1[, Start.Time := as.POSIXct(Start.Time, format="%H:%M:%S")]
#now do your filtering
dataset1[between(Start.Time, as.POSIXct("09:00:00", format="%H:%M:%S"), as.POSIXct("10:00:00", format="%H:%M:%S")) & (Trail.Segment==52 | Trail.Segment==55)]

Related

Mixed Date formats in R data frame

how do you work with a column of mixed date types, for example 8/2/2020,2/7/2020, and all are reflecting February,
I have tried zoo::as.Date(mixeddatescolumn,"%d/%m/%Y").The first one is right but the second is wrong.
i have tried solutions here too
Fixing mixed date formats in data frame? but the questions seems different from what i am handling.
It is really tricky to know even for a human if dates like '8/2/2020' is 8th February or 2nd August. However, we can leverage the fact that you know all these dates are in February and remove the "2" part of the date which represents the month and arrange the date in one standard format and then convert the date to an actual Date object.
x <- c('8/2/2020','2/7/2020')
lubridate::mdy(paste0('2/', sub('2/', '', x, fixed = TRUE)))
#[1] "2020-02-08" "2020-02-07"
Or same in base R :
as.Date(paste0('2/', sub('2/', '', x, fixed = TRUE)), "%m/%d/%Y")
Since we know that every month is in February search for /2/ or /02/ and if found the middle number is the month; otherwise, the first number is the month. In either case set the format appropriately and use as.Date. No packages are used.
dates <- c("8/2/2020", "2/7/2020", "2/28/2000", "28/2/2000") # test data
as.Date(dates, ifelse(grepl("/0?2/", dates), "%d/%m/%Y", "%m/%d/%Y"))
## [1] "2020-02-08" "2020-02-07" "2000-02-28" "2000-02-28"

R time interval data type

Is there a time interval data (variable) type in R? I have a CSV file with datetime and timeinterval columns. The data type of the datetime column can be POSIXlt, but I don't know how to set a timeinterval data type for the other column. Is it possible, or what is the best way to handle time inervals in R?
The time interval values in my CSV file looks like this [<number of days> %H:%M:%S]:
'0 20:32:59'
In Python pandas, there is a timedelta64[ns] data type for time intervals.
Thank you!
Split the strings into the number of days and the time, using stringi, then use lubridate to manipulate the components.
library(stringi)
library(lubridate)
In the following example:
([0-9]+) means capture one or more digits.
+ means one or more spaces (not captured).
([0-9]{2}:[0-9]{2}:[0-9]{2}) means capture 2 digits, a colon, 2 digits, another colon, and 2 more digits.
x <- "0 20:32:59"
matches <- stri_match_first_regex(x, "([0-9]+) +([0-9]{2}:[0-9]{2}:[0-9]{2})")
The number of days is the second column, and the hours/minutes/seconds are in the third column.
days creates a period of the number of days; hms creates a period of hours, minutes, and seconds.
n_days <- days(as.integer(matches[, 2]))
time <- hms(matches[, 3])
Now your total is just n_days + time, though presumably you want this relative to some origin, for example:
Sys.time() + n_days + time
Yes, see ? difftime
If your csv is already split into columns, apply as.difftime to one and as.POSIXlt to the other.
For example:
as.difftime(0, units="days") + as.POSIXlt("20:32:59", format="%H:%M:%S")
[Edit]
If the entire result is to be an interval, this would do it:
as.difftime(0, units="days") + as.difftime("20:32:59", format="%H:%M:%S")

R difftime subtracts 2 days

I have some timedelta strings which were exported from Python. I'm trying to import them for use in R, but I'm getting some weird results.
When the timedeltas are small, I get results that are off by 2 days, e.g.:
> as.difftime('26 days 04:53:36.000000000',format='%d days %H:%M:%S.000000000')
Time difference of 24.20389 days
When they are larger, it doesn't work at all:
> as.difftime('36 days 04:53:36.000000000',format='%d days %H:%M:%S.000000000')
Time difference of NA secs
I also read into 'R' some time delta objects I had processed with 'Python' and had a similar issue with the 26 days 04:53:36.000000000 format. As Gregor said, %d in strptime is the day of the month as a zero padded decimal number so won't work with numbers >31 and there doesn't seem to be an option for cumulative days (probably because strptime is for date time objects and not time delta objects).
My solution was to convert the objects to strings and extract the numerical data as Gregor suggested and I did this using the gsub function.
# convert to strings
data$tdelta <- as.character(data$tdelta)
# extract numerical data
days <- as.numeric(gsub('^.*([0-9]+) days.*$','\\1',data$tdelta))
hours <- as.numeric(gsub('^.*ys ([0-9]+):.*$','\\1',data$tdelta))
minutes <- as.numeric(gsub('^.*:([0-9]+):.*$','\\1',data$tdelta))
seconds <- as.numeric(gsub('^.*:([0-9]+)..*$','\\1',data$tdelta))
# add up numerical components to whatever units you want
time_diff_seconds <- seconds + minutes*60 + hours*60*60 + days*24*60*60
# add column to data frame
data$tdelta <- time_diff_seconds
That should allow you to do computations with the time differences. Hope that helps.

Converting time format to numeric with R

In most cases, we convert numeric time to POSIXct format using R. However, if we want to compare two time points, then we would prefer the numeric time format. For example, I have a date format like "2001-03-13 10:31:00",
begin <- "2001-03-13 10:31:00"
Using R, I want to covert this into a numeric (e.g., the Julian time), perhaps something like the passing seconds between 1970-01-01 00:00:00 and 2001-03-13 10:31:00.
Do you have any suggestions?
The Julian calendar began in 45 BC (709 AUC) as a reform of the Roman calendar by Julius Caesar. It was chosen after consultation with the astronomer Sosigenes of Alexandria and was probably designed to approximate the tropical year (known at least since Hipparchus). see http://en.wikipedia.org/wiki/Julian_calendar
If you just want to remove ":" , " ", and "-" from a character vector then this will suffice:
end <- gsub("[: -]", "" , begin, perl=TRUE)
#> end
#[1] "20010313103100"
You should read the section about 1/4 of the way down in ?regex about character classes. Since the "-" is special in that context as a range operator, it needs to be placed first or last.
After your edit then the answer is clearly what #joran wrote, except that you would need first to convert to a DateTime class:
as.numeric(as.POSIXct(begin))
#[1] 984497460
The other point to make is that comparison operators do work for Date and DateTime classed variables, so the conversion may not be necessary at all. This compares 'begin' to a time one second later and correctly reports that begin is earlier:
as.POSIXct(begin) < as.POSIXct(begin) +1
#[1] TRUE
Based on the revised question this should do what you want:
begin <- "2001-03-13 10:31:00"
as.numeric(as.POSIXct(begin))
The result is a unix timestamp, the number of seconds since epoch, assuming the timestamp is in the local time zone.
Maybe this could also work:
library(lubridate)
...
df <- '24:00:00'
as.numeric(hms(df))
hms() will convert your data from one time format into another, this will let you convert it into seconds. See full documentation.
I tried this because i had trouble with data which was in that format but over 24 hours.
The example from ?as.POSIX help gives
as.POSIXct(strptime(begin, "%Y-%m-%d %H:%M:%S"))
so for you it would be
as.numeric(as.POSIXct(strptime(begin, "%Y-%m-%d %H:%M:%S")))

R: Aggregating by dates with POSIXct?

I have some zoo series that use POSIXct index.
In order to aggregate by days I've tried these two ways:
aggregate(myzoo,format((index((myzoo)),"%Y-%m-%d")),sum)
aggregate(myzoo,as.Date(index(myzoo)),sum)
I don't know why they don't give the same output.
myzoo series had the weekends removed. The "as.Date way" seems to be OK but the "format way" aggregation gives me data on the weekends.
Why?
Which one is the right?
I've even tried it as.POSIXct(format(...))
As I mentioned in my comment, you need to be careful when changing the format of a timestamp that includes time with a time zone, because it can get shifted between days. Without any data, it's hard to say exactly what your problem is, but you might also try apply.daily from xts:
apply.daily(myzoo, sum)
Here's a working example:
> x <- zoo(2:20, as.POSIXct("2003-02-01") + (2:20) * 7200)
> apply.daily(x, sum)
> 2003-02-01 22:00:00 2003-02-02 16:00:00
65 144

Resources