R: strptime() and is.na () unexpected results - r

I have a data frame with roughly 8 million rows and 3 columns. I used strptime() in the following manner:
df$date.time <- strptime(df$date.time, "%m/%d/%y %I:%M:%S %p")
This works fine for all but 1104 of the rows, which I checked using
df[is.na(df$date.time), ]
When I look at these "problem" data, the date.time entries seem to be formatted in the way I would expect. For example, here is an observation that comes up as a problem, but doesn't appear to be an NA:
id date.time outcome
observation543490 2012-03-11 02:14:01 C
What could possibly be going on here that is.na(df$date.time) returns a TRUE value for this row that has apparently been converted correctly?
Here's a reproducible example (if you're in CST):
is.na(strptime("03/11/12 2:14:01 AM", "%m/%d/%y %I:%M:%S %p", "CST6CDT"))
#[1] TRUE

The problem is likely that all the times that return NA do not exist in whatever timezone you're using, due to daylight saving time.
Check with the data source to determine the timezone the data were recorded in, then set the tz argument to that value in your call to strptime.

Related

Formatting 24-hour time variable to capture observations in different ranges

I currently have a data frame with a column for Start.Time (imported from a *.csv file), and the format is in 24 hour format (e.g., 20:00:00 equals 8pm). My goal is to capture observations with a start time in various intervals (e.g., between 9:00:00 and 10:00:00), which also meet other criteria. However, it seems that R sorts this 'character' variable in a way that does not align with how our day goes (e.g., 14:00:00 is considered a lower value than 9:00:00).
For example, below is a line of code that works as intended, where I am capturing observations on two different trail segments, which had a start time between 8:00:00 and 9:00:00.
RLLtoMist8.9<-sum((dataset1$Trail.Segment==52|dataset1$Trail.Segment==55) &
(dataset1$Start.Time>="8:00" & dataset1$Start.Time < "9:00"),
na.rm=TRUE)
RLLtoMist8.9
But, this code below does not work as intended, as R is 'valuing' 9:00:00 as greater than 10:00:00.
RLLtoMist9.10 <-
sum((dataset1$Trail.Segment==52|dataset1$Trail.Segment==55) &
(dataset1$Start.Time>="9:00:00 AM" & dataset1$Start.Time < "10:00:00 AM"),
na.rm=TRUE)
It's certainly true that character types are sorted so that "14:00" is less than "9:00". However R has a datetime class which would sort times correctly once a character representation has been parsed.
a <- as.POSIXct("14:00", format="%H:%M")
b <- as.POSIXct("8:00", format="%H:%M")
# test
> a < b
[1] FALSE
You would be able to convert an entire column with:
dataset1$Start.Time <- as.POSIXct(dataset1$Start.Time, format="%H:%M")
The dates of a and b were the system date at the time of conversion, so if you printed them you would see dates and times in the default format. There are packages, such as chron, that let you use just times, but POSIXt objects have dates and times necessarily. See ?DateTimeClasses. The lubridate package also has an 'interval' class and there exist a difftime function in base-R.
There's also seq.POSIXt and cut.POSIXt functions, either of which could be used to create multiple time or date boundaries for categorical transformations of datetimes.
Using the data.table library:
# convert to data table
dataset1<-data.table(dataset1)
# format to a date format rather that character
dataset1[, Start.Time := as.POSIXct(Start.Time, format="%H:%M:%S")]
#now do your filtering
dataset1[between(Start.Time, as.POSIXct("09:00:00", format="%H:%M:%S"), as.POSIXct("10:00:00", format="%H:%M:%S")) & (Trail.Segment==52 | Trail.Segment==55)]

R am/pm datetime issue

I haven't found an answer that works for me yet. I have data that has observations on a 5 minute basis. I have start and end time columns named (creatively) STARTTIME and ENDTIME. I read.csv the data into R. When I do the STARTTIME and ENDTIME are treated as factors. The "cells" in the data frame are populated with date time values such as "7/2/2016 11:25:00 PM". So, I tried the following:
df$STARTTIME <- as.POSIXct(as.character(df$STARTTIME), format = "%m%d%Y %I:%M:%S %p", tz ="EST")
When I run that code the whole column is replaced with NA. Any Help is appreciated.

Adding/subtracting date-time columns in data-frames

I have a data frame with two text columns which are essentially time-stamps columns namely BEFORE and AFTER and the format is 12/29/2016 4:29:00 PM. I want to compare if the gap between the BEFORE and AFTER time for each row is more than 5 minutes or not. Which package in R will allow subtraction between time stamp?
No need for extra packages, using base R you can achieve the date time comparison.
First coerce your character to a valid date time structure. Here I am using POSIXct:
d1= as.POSIXct("12/29/2016 4:29:00 PM",format="%m/%d/%Y %H:%M:%S")
d2= as.POSIXct("12/30/2016 5:29:00 PM",format="%m/%d/%Y %H:%M:%S")
Then to get the difference in minutes you can use difftime:
difftime(d1,d2,units="mins")
Note the all those functions are vectorized So the same code will work with vectors.
manage PM/AM
to take care of PM/AM we should add %p to the format that should be used in conjonction with %I not %H:
d1= as.POSIXct("12/28/2016 11:53:00 AM",
format="%m/%d/%Y %I:%M:%S %p")
d2= as.POSIXct("12/28/2016 12:03:00 PM",
format="%m/%d/%Y %I:%M:%S %p")
difftime(d2,d1,units="mins")
## Time difference of 10 mins
for more help about date/time format you can read ?strptime.

Problems using POSIXct with ">" and "<" in R

I'm getting inconsistent results when trying to subset data based on a date being before or after some POSIXct date and time. When I make a string of dates like this:
myDates <- c(as.POSIXct("2014-12-27 08:10:00 UTC"),
as.POSIXct("2014-12-27 08:15:00 UTC"),
as.POSIXct("2014-12-27 09:30:00 UTC"))
and then try to subset to find all the entries in myDates that were before 8:15 a.m. on Dec. 27, 2014 like this:
myDates[myDates < as.POSIXct("2014-12-27 08:15:00")]
that works fine and I get
"2014-12-27 08:10:00 PST"
(although I don't understand why it says "PST" for the time zone; that's where I am, but I set it to UTC).
However, my original date and time data were in Excel, where they were in numeric format. I imported them as a data.frame called Samples and converted the date and time column into POSIXct format by doing:
as.POSIXct(Samples$DateTime, origin = "1970-01-01", tz = "UTC")
Now, I'm having hair-pulling, head-onto-desk-bashing frustrations with subsetting those dates. Take one date in particular, x <- Samples$DateTime[34], which, according to the output R gives me, is "2014-12-27 08:10:00 UTC". If I check whether x < 2014-12-27 08:15, that should be true, and here's what I see:
x < as.POSIXct("2014-12-27 08:15:00 UTC")
TRUE
But x should NOT be less 2014-12-27 8:09:00 UTC, right? This is what I see:
X < as.POSIXct("2014-12-27 08:09:00 UTC")
TRUE
Why, for the love of Pete, does R tell me that 8:10 is before 8:09?!? This doesn't seem to be a problem for data that I just type in like above, only for data I've imported from Excel.
You probably need to get everything in the same timezone first. Try
as.numeric(as.POSIXct("2014-12-27 08:10:00 UTC", tz="UTC"))
#[1] 1419667800
# equivalent to "2014-12-27 08:10:00 UTC"
vs.
as.numeric(as.POSIXct("2014-12-27 08:10:00 UTC"))
#[1] 1419631800
# equivalent to 8:10 in local timezone - in my case Aust. EST.
# "2014-12-27 08:10:00 AEST"
You can see that they are actually numerically different.
To fix this, specify the tz= explicitly when importing as the "UTC" in your text strings will not be detected on input.
Also, be really careful with variable names. Likely you just slipped in typing it in here, but in the description of the problem and the first logical comparison you used x and in the second one you used X.
R is case sensitive, so it would not compare your date to the one stored in x. If anything else was stored in memory with X it may actually be that you were given the right answer for the question you asked.

removing date from %d/%m/%Y %H:%M in R

The r code that I am working on is supposed to use the data collected in every five minute intervals.
The data is saved in csv format. However, due to inconsistency in the data collected, the time column in the data sometimes represent timestamp instead of just time.(dd/mm/yyyy HH:MM, instead of HH:MM)
This causes an error to my system as the system reads the data as having multiple different values for the same time value. Therefore, I would like to omit the date format from the timestamp such that the code would only read the time value.
My failed attempt was:
as.Date(data[[1]],"%H:%M")
which gave me all NA values for the time column.
I have searched for similar questions in SO, but I did not manage to find a clear answer to my question. Can anyone suggest me some possible functions to use?
I appreciate your help.
You could just strip the date portion of the text and then use as.POSIXct to convert them all to a %H:%M timestamp, e.g.:
x <- c("10:25","01/01/2014 10:30")
x <- gsub("^.+(\\d{2}:\\d{2})$","\\1",x)
as.POSIXct(x,format="%H:%M",tz="UTC")
#[1] "2014-06-02 10:25:00 UTC" "2014-06-02 10:30:00 UTC"

Resources