Trying to create a time series per hour in R.
I've a data frame collecting the amount of vehicules per hour, it looks as:
> head(df)
# A tibble: 6 x 8
interval cars vans trucks total `mean speed` `% occupation` hour
<dttm> <int> <int> <int> <int> <dbl> <dbl> <int>
1 2017-10-09 00:00:00 7 0 0 7 7.37 1. 0
2 2017-10-09 01:00:00 24 0 0 24 16.1 3. 1
3 2017-10-09 02:00:00 27 0 0 27 18.1 2. 2
4 2017-10-09 03:00:00 47 3 0 50 31.5 3. 3
5 2017-10-09 04:00:00 122 1 5 128 48.0 16. 4
6 2017-10-09 05:00:00 353 6 2 361 66.3 20. 5
> tail(df,1)
# A tibble: 1 x 8
interval cars vans trucks total `mean speed` `% occupation` hour
<dttm> <int> <int> <int> <int> <dbl> <dbl> <int>
1 2018-03-15 20:00:00 48 0 2 50 31.5 5. 20
Looking at the answer at starting a daily time series in R that clearly explains how to create a ts by day
I've converted this df to a time series as:
ts2Start <- df$interval[1]
ts2End <- df$interval[nrow(df)]
indexPerHour <- seq(ts2Start, ts2End, by = 'hour')
Since we have 365 days in a year and 24h per day, I created the ts as:
> df.ts <- ts(df$total, start = c(2017, as.numeric(format(indexPerHour[1], '%j'))),
+ frequency=24*365)
where
as.numeric(format(indexPerHour[1], '%j')))
returns 282
In order to validate what I'm doing I checked if the date obtained from the index is the same as the first row in my data frame
head(date_decimal(index(df.ts)),1)
but while my first date/time should be: "2017-10-09 00:00:00 "
I'm getting: "2017-01-12 16:59:59 UTC"
It looks as the first index in the df.ts series has started at ~ 282/24
I do not understand what I'm doing wrong. How the start parameter works in ts()?
I also checked the post: How to Create a R TimeSeries for Hourly data
where it is suggested to use xts package.
The issue is that I'm just learning from a book where tslm() is used and xts object does not seem to be supported.
Can I use ts() to create hourly time series ?
You should use an xts library instead. For example:
time_index <- seq(from = as.POSIXct("2016-01-01 00:00:00"),
to = as.POSIXct("2018-10-01 00:00:00"), by = "hour")
traff = xts(df, order.by = time_index)```
Related
I need to calculate a time of consecutive dates, until the difference of time between two consecutive dates is greater than 13 seconds.
For example, in the data frame create with the code shown below, the column test has the time difference between the dates. What I need is events of time between lines with test > 13 seconds.
# Create a vector of dates with a random time difference in seconds between records
dates <- seq(as.POSIXct("2020-01-01 00:00:02"), as.POSIXct("2020-01-02 00:00:02"), by = "2 sec")
dates <- dates + sample(15, length(dates), replace = T)
# Create a data.frame
data <- data.frame(id = 1:length(dates), dates = dates)
# Create a test field with the time difference between each date and the next
data$test <- c(diff(data$dates, lag = 1), 0)
# Delete the zero and negative time
data <- data[data$test > 0, ]
head(data)
What I want is something like this:
To get to your desired result we need to define 'blocks' of observation. Each block is splitted where test is greater than 13.
We start identifying the split_point, and then using the rle function we can assign an ID to each block.
Then we can filter out the split_point, and summarize the remaining blocks. Once with the sum of seconds, then with the min of the event dates.
split_point <- data$test <=13
# Find continuous blocks
block_str <- rle(split_point)
# Create block IDs
data$block <- rep(seq_along(block_str$lengths), block_str$lengths)
data <- data[split_point, ] # Remove split points
# Summarize
final_df <- aggregate(test ~ block, data = data, FUN = sum)
dtevent <- aggregate(dates ~ block, data= data, FUN=min)
# Join the two summaries
final_df$DatetimeEvent <- dtevent$dates
head(final_df)
#> block test DatetimeEvent
#> 1 1 101 2020-01-01 00:00:09
#> 2 3 105 2020-01-01 00:01:11
#> 3 5 277 2020-01-01 00:02:26
#> 4 7 46 2020-01-01 00:04:58
#> 5 9 27 2020-01-01 00:05:30
#> 6 11 194 2020-01-01 00:05:44
Created on 2020-04-02 by the reprex package (v0.3.0)
Using dplyrfor convenience sake:
library(dplyr)
final_df <- data %>%
mutate(split_point = test <= 13,
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
group_by(block) %>%
filter(split_point) %>%
summarise(DateTimeEvent = min(dates), TotalTime = sum(test))
final_df
#> # A tibble: 1,110 x 3
#> block DateTimeEvent TotalTime
#> <int> <dttm> <drtn>
#> 1 1 2020-01-01 00:00:06 260 secs
#> 2 3 2020-01-01 00:02:28 170 secs
#> 3 5 2020-01-01 00:04:11 528 secs
#> 4 7 2020-01-01 00:09:07 89 secs
#> 5 9 2020-01-01 00:10:07 37 secs
#> 6 11 2020-01-01 00:10:39 135 secs
#> 7 13 2020-01-01 00:11:56 50 secs
#> 8 15 2020-01-01 00:12:32 124 secs
#> 9 17 2020-01-01 00:13:52 98 secs
#> 10 19 2020-01-01 00:14:47 83 secs
#> # … with 1,100 more rows
Created on 2020-04-02 by the reprex package (v0.3.0)
(results are different because reprex recreates the data each time)
I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17
I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.
We have booking data for only 279 days in the year.
dplyr::arrange(df_demand, depdate)
depdate bookings
1 2017-11-02 43
2 2017-11-03 27
3 2017-11-05 27
4 2017-11-06 22
5 2017-11-07 39
6 2017-11-08 48
.
.
279 2018-11-01 60
I want to introduce another column day_of_year in the following way:
depdate day_of_year bookings
1 2017-11-02 1 43
2 2017-11-03 2 27
3 2017-11-04 3 NA
4 2017-11-05 4 27
.
.
.
365 2018-11-01 365 60
I am trying to find the best possible way to do this.
In Python, I could use something like :
df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1
I wanted to know about an R equivalent of the same.
When I run
typeof(df_demand_2$depdate)
the output is
"double"
Am I missing something?
You can create a row for every date using the complete function from the tidyr package.
First, I'm creating a data frame with some sample data:
df <- data.frame(
depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
bookings = c(43, 27, 27)
)
Next, I'm performing two operations. First, using tidyr::complete, I'm specifying all the dates I want in my analysis. I can do that using seq.Date, creating a sequence from the first to the last day.
Once that is done, the day_of_year column is simply equal to the row number.
df_complete <- tidyr::complete(df,
depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)
df_complete$day_of_year <- 1:nrow(df_complete)
> df_complete
#> # A tibble: 4 x 3
#> depdate bookings day_of_year
#> <date> <dbl> <int>
#> 1 2017-11-02 43 1
#> 2 2017-11-03 27 2
#> 3 2017-11-04 NA 3
#> 4 2017-11-05 27 4
An equivalent solution with the pipe operator from dplyr:
df %>%
complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
mutate(days_of_year = row_number())
I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!
Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date
apologies if there is already an answer to a similar query but I can't seem to find it! I'm a newbie to R but determined not to revert back to VBA for this...
My question is about preparing data ready for forecasting with ses. I have a set of ticket data (~25,000 entries) with time stamps that I've imported from Excel:
Number Created Category Priority `Incident state` `Reassignment count` Urgency Impact
<dbl> <dttm> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 2014-07-01 19:16:00 Software/System 5 - Minor Closed 0 3 - Low 3 - Low
2 2 2014-07-02 15:27:00 Software/System 5 - Minor Closed 0 3 - Low 3 - Low
3 3 2014-07-02 15:27:00 Software/System 5 - Minor Closed 0 3 - Low 3 - Low
4 4 2014-07-02 15:27:00 Software/System 5 - Minor Closed 0 3 - Low 3 - Low
5 5 2014-07-02 15:28:00 Software/System 5 - Minor Closed 0 3 - Low 3 - Low
6 6 2014-07-02 15:29:00 Software/System 5 - Minor Closed 0 3 - Low 3 - Low
The data is not regularly spaced out as no tickets are raised outside of working hours so I can't specify a seq(). I need to subset the Created column into hourly blocks ahead of converting into a time series that I can forecast from. I tried rounding the Created column to hours:
modelling_messy$Created <- as.POSIXct(modelling_messy$Created,format="%Y/%m/%d %H:%M:%S", tz = "GMT")
modelling_messy$Created <- as.POSIXct(round(modelling_messy$Created, units = "hours"))
This made my data look the way I wanted, and allowed me to aggregate() all entries with the same hourly time stamp, but it goes all squinty when I use ts()
# A tibble: 2 x 8
Number Created Category Priority `Incident state` `Reassignment count` Urgency Impact
<dbl> <dttm> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 1 2014-07-01 19:00:00 Software/System 5 Closed 0 3 - Low 3 - Low
2 2 2014-07-02 15:00:00 Software/System 5 Closed 0 3 - Low 3 - Low
> myts <- ts(modelling_clean[,1:2], start = c(2014-07-01, 1), freq = 1)
> head(myts)
Time Series:
Start = 2006
End = 2011
Frequency = 1
Group.1 Number
2006 1404241200 1
2007 1404313200 5
2008 1404316800 1
2009 1404907200 8
2010 1404910800 28
2011 1404914400 1
I know that I've messed up ts() somehow but I can't find how to fix it! I want the time data to remain as "%Y-%m-%d %H:00:00" or other useful date/hour combination (I'm only covering 2014 - 2017 by the way).
Any and all help is greatly appreciated.
Ta muchly.
EDIT
Thanks for the advice - I think this will solve the problem of conversion to the time series but I'm unsure of how to take the data for df$Created from my current Tibble (too much data to manually code in!) I attempted the following but threw an error:
> df = data.frame(Created = modelling_messy$Created),stringsAsFactors = F)
Error: unexpected ',' in "df = data.frame(Created = modelling_messy$Created),"
> df$id = seq_along(nrow(df))
Error in df$id = seq_along(nrow(df)) :
object of type 'closure' is not subsettable
Thanks in advance!
You could create hourly timeseries with the xts package as follows:
library(xts)
# sample data
df = data.frame(Created = c("2014-07-01 19:16:00","2014-07-02 15:27:00","2014-07-02 15:27:00","2014-07-02 15:27:00",
"2014-07-02 15:28:00","2014-07-02 15:29:00"),stringsAsFactors = F)
df$id = seq_along(nrow(df))
# Round dates to hours
df$Created <- as.POSIXct(df$Created,format="%Y-%m-%d %H", tz = "GMT")
# Let's aggregate and create hourly data
df = aggregate(id ~ Created, df,length)
time_series = data.frame(Created= seq( min(df$Created), max(df$Created),by='1 hour'))
time_series = merge(time_series,df,by="Created",all.x=TRUE)
time_series$id[is.na(time_series$id)]=0
# create timeseries object
library(xts)
myxts = xts(time_series$id, order.by = time_series$Created)
Output:
[,1]
2014-07-01 19:00:00 1
2014-07-01 20:00:00 0
2014-07-01 21:00:00 0
2014-07-01 22:00:00 0
2014-07-01 23:00:00 0
2014-07-02 00:00:00 0
2014-07-02 01:00:00 0
2014-07-02 02:00:00 0
2014-07-02 03:00:00 0
2014-07-02 04:00:00 0
2014-07-02 05:00:00 0
2014-07-02 06:00:00 0
2014-07-02 07:00:00 0
2014-07-02 08:00:00 0
2014-07-02 09:00:00 0
2014-07-02 10:00:00 0
2014-07-02 11:00:00 0
2014-07-02 12:00:00 0
2014-07-02 13:00:00 0
2014-07-02 14:00:00 0
2014-07-02 15:00:00 5
It's working!
Disclaimer: This is my first time playing with time series in R, so there may be other (i.e. better) ways to achieve this.