Time Series within R (ColumnSorting) - r

I have a csv of real-time data inputs with timestamps and I am looking to group these data in a time series of 30 mins for analysis.
A sample of the real-time data is
Date:
2019-06-01 08:03:04
2019-06-01 08:20:04
2019-06-01 08:33:04
2019-06-01 08:54:04
...
I am looking to group them in a table with a step increment of 30 mins (i.e. 08:30, 09:00, etc..) to seek out the number of occurences during each period. I created a new csv file to access through R. This is so that I will not corrupt the formatting of the orginal dataset.
Date:
2019-06-01 08:00
2019-06-01 08:30
2019-06-01 09:00
2019-06-01 09:30
I have firstly constructed a list of 30 mins intervals by:
sheet_csv$Date <- as.POSIXct(paste(sheet_csv$Date), format = "%Y-%m-%d %H:%M", tz = "GMT") #to change to POSIXct
sheet_csv$Date <- timeDate::timeSequence(from = "2019-06-01 08:00", to = "2019-12-03 09:30", by = 1800,
format = "%Y-%m-%d %H:%M", zone = "GMT")
I encountered an error "Error in x[[idx]][[1]] : this S4 class is not subsettable" for this interval.
I am relatively new to R. Please do help out where you can. Greatly Appreciated.

You probably don't need the timeDate package for something like this. One package that is very helpful to manipulate dates and times is lubridate - you may want to consider going forward.
I used your example and added another date/time for illustration.
To create your 30 minute intervals, you could use cut and seq.POSIXt to create a sequence of date/times with 30 minute breaks. I used your minimum date/time to start with (rounding down to nearest hour) but you can also specify another date/time here.
Use of table will give you frequencies after cut.
sheet_csv <- data.frame(
Date = c("2019-06-01 08:03:04",
"2019-06-01 08:20:04",
"2019-06-01 08:33:04",
"2019-06-01 08:54:04",
"2019-06-01 10:21:04")
)
sheet_csv$Date <- as.POSIXct(sheet_csv$Date, format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
as.data.frame(table(cut(sheet_csv$Date,
breaks = seq.POSIXt(from = round(min(sheet_csv$Date), "hours"),
to = max(sheet_csv$Date) + .5 * 60 * 60,
by = "30 min"))))
Output
Var1 Freq
1 2019-06-01 08:00:00 2
2 2019-06-01 08:30:00 2
3 2019-06-01 09:00:00 0
4 2019-06-01 09:30:00 0
5 2019-06-01 10:00:00 1

Related

R: Date Operation Results in Empty Dates

I am using the R programming language. I am trying to take the difference between two date columns. Both dates are in the following format : 2010-01-01 12:01
When I bring my file into R, the dates are in "Factor" format. Here is my attempt to recreate the file in R:
#how my file looks like when I import it into R
date_1 = c("2010-01-01 13:01 ", "2010-01-01 14:01" )
date_2 = c("2010-01-01 15:01 ", "2010-01-01 16:01" )
file = data.frame(date_1, date_2)
file$date_1 = as.factor(file$date_1)
file$date_2 = as.factor(file$date_2)
Now, I am trying to create a new column which takes the difference between these dates (in minutes)
I first tried to convert both date variables into the appropriate "Date" formats:
#convert to date formats:
file$date_a = as.POSIXlt(file$date_1,format="%Y-%m-%dT%H:%M")
file$date_b = as.POSIXlt(file$date_2,format="%Y-%m-%dT%H:%M")
Then, I tried to take the difference :
file$diff = difftime(file$date_a, file$date_b, units="mins")
But this results in "NA's":
> file
date_1 date_2 date_a date_b diff
1 2010-01-01 13:01 2010-01-01 13:01 <NA> <NA> NA mins
2 2010-01-01 13:01 2010-01-01 13:01 <NA> <NA> NA mins
Can someone please show me what I am doing wrong?
Thanks
Reference: How to get difference (in minutes) between two date strings?
There is no T in the string. So, we need the format as
difftime(as.POSIXct(file$date_1, format = '%Y-%m-%d %H:%M'),
as.POSIXct(file$date_2, format = '%Y-%m-%d %H:%M'), units = 'mins')
#Time differences in mins
#[1] -120 -120

Remove seconds from some observations to work in HM format using R

I have a column called "time" with some observations in "hours: minutes: seconds" and others only with "hours: minutes". I would like to remove the seconds and be left with only hours and minutes.
So far I have loaded the lubridate package and tried:
format(data$time ,format = "%H:%M")
but no change occurs.
And with:
data$time <- hm(data$time)
all the observations with h:m:s become NAs
What should I do?
You can use parse_date_time from lubridate to bring time into POSIXct format and then use format to keep the information that you need.
data <- data.frame(time = c('10:04:00', '14:00', '15:00', '12:34:56'))
data$time1 <- format(lubridate::parse_date_time(x, c('HMS', 'HM')), '%H:%M')
data
# time time1
#1 10:04:00 10:04
#2 14:00 14:00
#3 15:00 15:00
#4 12:34:56 12:34

How to calculate business hours between two dates when business hours vary depending on the day in R?

I'm trying to calculate business hours between two dates. Business hours vary depending on the day.
Weekdays have 15 business hours (8:00-23:00), saturdays and sundays have 12 business hours (9:00-21:00).
For example: start date 07/24/2020 22:20 (friday) and end date 07/25/2020 21:20 (saturday), since I'm only interested in the business hours the result should be 12.67hours.
Here an example of the dataframe and desired output:
start_date end_date business_hours
07/24/2020 22:20 07/25/2020 21:20 12.67
07/14/2020 21:00 07/16/2020 09:30 18.50
07/18/2020 08:26 07/19/2020 10:00 13.00
07/10/2020 08:00 07/13/2020 11:00 42.00
Here is something you can try with lubridate. I edited another function I had I thought might be helpful.
First create a sequence of dates between the two dates of interest. Then create intervals based on business hours, checking each date if on the weekend or not.
Then, "clamp" the start and end times to the allowed business hours time intervals using pmin and pmax.
You can use time_length to get the time measurement of the intervals; summing them up will give you total time elapsed.
library(lubridate)
library(dplyr)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_intervals <- if_else(weekdays(my_dates) %in% c("Saturday", "Sunday"),
interval(ymd_hm(paste(my_dates, "09:00"), tz = "UTC"), ymd_hm(paste(my_dates, "21:00"), tz = "UTC")),
interval(ymd_hm(paste(my_dates, "08:00"), tz = "UTC"), ymd_hm(paste(my_dates, "23:00"), tz = "UTC")))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])), int_start(my_intervals[length(my_intervals)]))
sum(time_length(my_intervals, "hour"))
}
calc_bus_hours(as.POSIXct("07/24/2020 22:20", format = "%m/%d/%Y %H:%M", tz = "UTC"), as.POSIXct("07/25/2020 21:20", format = "%m/%d/%Y %H:%M", tz = "UTC"))
[1] 12.66667
Edit: For Spanish language, use c("sábado", "domingo") instead of c("Saturday", "Sunday")
For the data frame example, you can use mapply to call the function using the two selected columns as arguments. Try:
df$business_hours <- mapply(calc_bus_hours, df$start_date, df$end_date)
start end business_hours
1 2020-07-24 22:20:00 2020-07-25 21:20:00 12.66667
2 2020-07-14 21:00:00 2020-07-16 09:30:00 18.50000
3 2020-07-18 08:26:00 2020-07-19 10:00:00 13.00000
4 2020-07-10 08:00:00 2020-07-13 11:00:00 42.00000

Filling not observed observations

I want to make a time series with the frequency a date and time is observed. The raw data looked something like this:
dd-mm-yyyy hh:mm
28-2-2018 0:12
28-2-2018 11:16
28-2-2018 12:12
28-2-2018 13:22
28-2-2018 14:23
28-2-2018 14:14
28-2-2018 16:24
The date and time format is in the wrong way for R, so I had to adjust it:
extracted_times <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
I ordered the data with frequency in a table using the following code:
timeserieswithoutzeros <- table(extracted_times)
The data looks something like this now:
2018-02-28 00:11:00 2018-02-28 01:52:00 2018-02-28 03:38:00
1 2 5
2018-02-28 04:10:00 2018-02-28 04:40:00 2018-02-28 04:45:00
2 1 1
As you may see there are a lot of unobserved dates and times.
I want to add these unobserved dates and times with the frequency of 0.
I tried the complete function, but the error states that it can't best used, because I use as.POSIXct().
Any ideas?
As already mentinoned in the comments by #eric-lecoutre, you can combine your observations with a sequence begining at the earliest ending at the last date using seq and subtract 1 of the frequency table.
timeseriesWithzeros <- table(c(extracted_times, seq(min(extracted_times), max(extracted_times), "1 min")))-1
Maybe the following is what you want.
First, coerce the data to class "POSIXt" and create the sequence of all date/time between min and max by steps of 1 minute.
bedrijf.CSV$viewed_at <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
new <- seq(min(bedrijf.CSV$viewed_at),
max(bedrijf.CSV$viewed_at),
by = "1 mins")
tmp <- data.frame(viewed_at = new)
Now see if these values are in the original data.
tmp$viewed <- tmp$viewed_at %in% bedrijf.CSV$viewed_at
tbl <- xtabs(viewed ~ viewed_at, tmp)
sum(tbl != 0)
#[1] 7
Final clean up.
rm(new, tmp)

Convert hour to date-time

I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time

Resources