Converting character to date with time zone using looping functions - r

Given:
A data frame with three character columns.
Date Time Zone
1950-04-18 01:30 CST
1950-04-18 01:45 CST
1951-02-20 16:00 CST
1951-06-08 09:00 CST
1951-11-15 15:00 CST
1951-11-15 20:00 CST
Required:
1. Combine Date, Time and, Zone
2. Convert from Character to Date
What I have tried:
1. datetime <- paste(Date, Time)
2. strptime(datetime[1], "%Y-%m-%d %H:%M", tz=Zone[1])
This successfully parses the first element, however, I would like to convert the entire data using one of the looping functions lapply or sapply.
How can I use loop functions to parse the entire vector?
NOTE: Forgot to mention earlier, the data contains various abbreviated time zones other than CST

Timezones can be a tricky thing to handle as there are different formats being used. To get a list of the used timezones on your system, run OlsonNames() for the list.
The CST timezone you used in your example is not always supported and you might therefore get the following warning message when trying to use that as a timezone:
In as.POSIXct.POSIXlt(x) : unknown timezone 'CST'
I've constructed an example dataset (see below) to show how you can update your datetime with timezone information. The following for loop:
for (i in 1:nrow(d))
d$datetime[i] <- strftime(paste(d$Date, d$Time)[i],
format="%Y-%m-%d %H:%M",
tz = as.character(d$Zone[i]),
usetz = TRUE)
will give:
> d
Date Time Zone datetime
1 1950-04-18 01:30 GMT 1950-04-18 01:30 GMT
2 1950-04-18 01:45 CET 1950-04-18 01:45 CET
3 1951-02-20 16:00 EET 1951-02-20 16:00 EET
4 1951-06-08 09:00 EST 1951-06-08 09:00 EST
5 1951-11-15 15:00 WET 1951-11-15 15:00 WET
6 1951-11-15 20:00 MST 1951-11-15 20:00 MST
As said, your dataset might contain timezone abbreviations that are not recognized by your system. You could replace these with the help of this list for example.
Used data:
d <- read.table(text="Date Time Zone
1950-04-18 01:30 GMT
1950-04-18 01:45 CET
1951-02-20 16:00 EET
1951-06-08 09:00 EST
1951-11-15 15:00 WET
1951-11-15 20:00 MST", header=TRUE, stringsAsFactors = FALSE)

I think this should work:
df1<-data.frame(x = paste(df$Date,df$Time), Zone =df$Zone)
d<-mapply(FUN = strptime,x=df1$x,tz=as.character(df1$Zone),format="%Y-%m-%d %H:%M",SIMPLIFY = F,USE.NAMES = F)

Related

Remove seconds from some observations to work in HM format using R

I have a column called "time" with some observations in "hours: minutes: seconds" and others only with "hours: minutes". I would like to remove the seconds and be left with only hours and minutes.
So far I have loaded the lubridate package and tried:
format(data$time ,format = "%H:%M")
but no change occurs.
And with:
data$time <- hm(data$time)
all the observations with h:m:s become NAs
What should I do?
You can use parse_date_time from lubridate to bring time into POSIXct format and then use format to keep the information that you need.
data <- data.frame(time = c('10:04:00', '14:00', '15:00', '12:34:56'))
data$time1 <- format(lubridate::parse_date_time(x, c('HMS', 'HM')), '%H:%M')
data
# time time1
#1 10:04:00 10:04
#2 14:00 14:00
#3 15:00 15:00
#4 12:34:56 12:34

How to calculate business hours between two dates when business hours vary depending on the day in R?

I'm trying to calculate business hours between two dates. Business hours vary depending on the day.
Weekdays have 15 business hours (8:00-23:00), saturdays and sundays have 12 business hours (9:00-21:00).
For example: start date 07/24/2020 22:20 (friday) and end date 07/25/2020 21:20 (saturday), since I'm only interested in the business hours the result should be 12.67hours.
Here an example of the dataframe and desired output:
start_date end_date business_hours
07/24/2020 22:20 07/25/2020 21:20 12.67
07/14/2020 21:00 07/16/2020 09:30 18.50
07/18/2020 08:26 07/19/2020 10:00 13.00
07/10/2020 08:00 07/13/2020 11:00 42.00
Here is something you can try with lubridate. I edited another function I had I thought might be helpful.
First create a sequence of dates between the two dates of interest. Then create intervals based on business hours, checking each date if on the weekend or not.
Then, "clamp" the start and end times to the allowed business hours time intervals using pmin and pmax.
You can use time_length to get the time measurement of the intervals; summing them up will give you total time elapsed.
library(lubridate)
library(dplyr)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_intervals <- if_else(weekdays(my_dates) %in% c("Saturday", "Sunday"),
interval(ymd_hm(paste(my_dates, "09:00"), tz = "UTC"), ymd_hm(paste(my_dates, "21:00"), tz = "UTC")),
interval(ymd_hm(paste(my_dates, "08:00"), tz = "UTC"), ymd_hm(paste(my_dates, "23:00"), tz = "UTC")))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])), int_start(my_intervals[length(my_intervals)]))
sum(time_length(my_intervals, "hour"))
}
calc_bus_hours(as.POSIXct("07/24/2020 22:20", format = "%m/%d/%Y %H:%M", tz = "UTC"), as.POSIXct("07/25/2020 21:20", format = "%m/%d/%Y %H:%M", tz = "UTC"))
[1] 12.66667
Edit: For Spanish language, use c("sábado", "domingo") instead of c("Saturday", "Sunday")
For the data frame example, you can use mapply to call the function using the two selected columns as arguments. Try:
df$business_hours <- mapply(calc_bus_hours, df$start_date, df$end_date)
start end business_hours
1 2020-07-24 22:20:00 2020-07-25 21:20:00 12.66667
2 2020-07-14 21:00:00 2020-07-16 09:30:00 18.50000
3 2020-07-18 08:26:00 2020-07-19 10:00:00 13.00000
4 2020-07-10 08:00:00 2020-07-13 11:00:00 42.00000

Time Series within R (ColumnSorting)

I have a csv of real-time data inputs with timestamps and I am looking to group these data in a time series of 30 mins for analysis.
A sample of the real-time data is
Date:
2019-06-01 08:03:04
2019-06-01 08:20:04
2019-06-01 08:33:04
2019-06-01 08:54:04
...
I am looking to group them in a table with a step increment of 30 mins (i.e. 08:30, 09:00, etc..) to seek out the number of occurences during each period. I created a new csv file to access through R. This is so that I will not corrupt the formatting of the orginal dataset.
Date:
2019-06-01 08:00
2019-06-01 08:30
2019-06-01 09:00
2019-06-01 09:30
I have firstly constructed a list of 30 mins intervals by:
sheet_csv$Date <- as.POSIXct(paste(sheet_csv$Date), format = "%Y-%m-%d %H:%M", tz = "GMT") #to change to POSIXct
sheet_csv$Date <- timeDate::timeSequence(from = "2019-06-01 08:00", to = "2019-12-03 09:30", by = 1800,
format = "%Y-%m-%d %H:%M", zone = "GMT")
I encountered an error "Error in x[[idx]][[1]] : this S4 class is not subsettable" for this interval.
I am relatively new to R. Please do help out where you can. Greatly Appreciated.
You probably don't need the timeDate package for something like this. One package that is very helpful to manipulate dates and times is lubridate - you may want to consider going forward.
I used your example and added another date/time for illustration.
To create your 30 minute intervals, you could use cut and seq.POSIXt to create a sequence of date/times with 30 minute breaks. I used your minimum date/time to start with (rounding down to nearest hour) but you can also specify another date/time here.
Use of table will give you frequencies after cut.
sheet_csv <- data.frame(
Date = c("2019-06-01 08:03:04",
"2019-06-01 08:20:04",
"2019-06-01 08:33:04",
"2019-06-01 08:54:04",
"2019-06-01 10:21:04")
)
sheet_csv$Date <- as.POSIXct(sheet_csv$Date, format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
as.data.frame(table(cut(sheet_csv$Date,
breaks = seq.POSIXt(from = round(min(sheet_csv$Date), "hours"),
to = max(sheet_csv$Date) + .5 * 60 * 60,
by = "30 min"))))
Output
Var1 Freq
1 2019-06-01 08:00:00 2
2 2019-06-01 08:30:00 2
3 2019-06-01 09:00:00 0
4 2019-06-01 09:30:00 0
5 2019-06-01 10:00:00 1

Filtering out time data from R data frame

So i have a dataset in R:
IncidentID Time Vehicle
19002 4:48 Car
19003 12:30 Motorcycle
19004 14:00 Car
19005 9:30 Bicycle
And I'm trying to filter out some data, since its quite a large dataset. The above is just a few examples of data.
I want to filter out the data according to the time, where say i want to obtain the data where the Time is between 12pm to 6pm (18:00 in 24 hour format), hence i would have:
IncidentID Time Vehicle
19003 12:30 Motorcycle
19004 14:00 Car
I did:
incident <- read.csv("incident.csv")
afternoon_incident <- incident[which(incident$Time >= 12 && incident$Time <= 18),]
But I'm getting the error saying:
1: In Ops.factor(web$Time, 6:0) : ‘>=’ not meaningful for factors
2: In Ops.factor(web$Time, 12:0) : ‘<=’ not meaningful for factors
You can use lubridate to convert Time field into time object and then extract hour for filtering:
library(lubridate)
incident$Time <- hm(as.character(incident$Time))
incident[which(hour(incident$Time) >= 12 & hour(incident$Time) <= 18), ]
You need to first convert the Time into actual date-time object using as.POSIXct and then compare.
As you want to subset based on hour, we can extract only hour part of the data using format and keep rows which are in between 12 and 18 hour. Using base R, we can do
df$hour <- as.numeric(format(as.POSIXct(df$Time, format = "%H:%M"), "%H"))
subset(df, hour >= 12 & hour <= 18)
# IncidentID Time Vehicle hour
#2 19003 12:30 Motorcycle 12
#3 19004 14:00 Car 14
You can remove the hour column later if not needed.
For a general solution, we can create a date-time column and then compare
df$datetime <- as.POSIXct(df$Time, format = "%H:%M")
subset(df, datetime >= as.POSIXct("12:30:00", format = "%T") &
datetime <= as.POSIXct("18:30:00", format = "%T"))

Add dates with missing values (-999) to top and bottom of the year

I have several time series of hourly that I am working with. Is there a way to add the date and missing values only to the beginning and end of the year the time series starts and ends in? So for the data posted I would like to fill the data to the beginning of 1990 and to the end of 2008. The only way I can see doing it is with an infinite number of loops. I have looked at dplyr, zoo, and seq for this task but cannot see how to only fill the year the data is taken in and in a concise manner. I would like to make a loop that will work on all of my different time series as changing the script for each timeseries. I am new to R so any assistance would be helpful!
My data:
date O3
9/15/1990 0:00 24
9/15/1990 1:00 28
9/15/1990 2:00 26
9/15/1990 3:00 25
9/15/1990 4:00 -999
9/15/1990 5:00 18
9/15/1990 6:00 17
The end of the data looks like this:
1/31/2008 19:00 -999
1/31/2008 20:00 -999
1/31/2008 21:00 -999
1/31/2008 22:00 -999
1/31/2008 23:00 -999
This is my current script:
library(openair)
library(plyr)
filedir <- "C:/Users/dfmcg/Documents/Thesisfiles/removedleapyears"
myfiles <- c(list.files(path = filedir))
paste(filedir, myfiles, sep = '/')
npsfiles <- c(paste(filedir, myfiles,sep = '/'))
for (i in npsfiles[1:28]) {
timeozone <- import(i, date ="date", date.format = "%m/%d/%Y %H", header = TRUE, na.strings = "-999")
ts <- seq.POSIXt(as.POSIXct("1990-01-01 0:00",'%Y-%m-%d %H'), as.POSIXct("2015-12-31 23:00",'%Y-%m-%d %H'), by="hour")
ts <- seq.POSIXt(as.POSIXlt("1990-01-01 0:00:00"), as.POSIXlt("2015-12-31 0:00:00"), by="hour")
ts <- format.POSIXct(ts,'%Y-%m-%D %H')
df <- data.frame(date=ts)
data_with_missing_times <- join(df,timeozone)
}
Use zoo. Replace -999 with NA. Then convert you data to a zoo object. Use na.spline i.e. yourdata$O3.zoo<-na.spline(yourdata$O3.zoo,method="fmm"). Just clip your data to the years you want after.

Resources