creating inteval object in r using lubridate package [duplicate] - r

This question already has an answer here:
indicateing to which interval a date belongs
(1 answer)
Closed 4 years ago.
hi i have data from uber :
about pick ups in NYC .
im trying to add a column to the raw data, that indicates for each row, for
which time interval (which is represented by a single timepoint at the beginning of thetime interval) it belongs.
i want to Create a vector containing all relevant timepoints (i.e. every 15 minutes
Use int_diff function from lubridate package on this vector to create an
interval object.
Run a loop on all the time points in the raw data and for each data
point; indicate to which interval (which is represented by a single
timepoint at the beginning of the time interval) it belongs.
i tried looking for explanations how to use the int_diff function but i dont understand how my vector should look and how the syntax of int_diff works
tanks for the help :)

Is this what you have in mind?
start <- mdy_hm('4/11/2014 0:00') # start of the period
end <- mdy_hm('5/12/2015 0:00') # end
time_seq <- seq(from = start, to = end, by = '15 mins') # sequence by 15 minutes
times <- mdy_hm(c('4/11/2014 0:12', '4/11/2014 1:24')) # times to find intervals for
dat <- data.frame(times)
dat$intervals <- cut(times, breaks = time_seq) # assign each time to an interval
intervals_cols <- model.matrix(~ - + intervals, dat) # turn this into a set of columns, one for each interval, with a 1 indicating that this observation falls into the column

Related

Define different timeseries for different columns

I have a dataframe where some of the columns are starting later than the other. Please find a reproducible example.
set.seed(354)
df <- data.frame(Product_Id = rep(1:100, each = 50),
Date = seq(from = as.Date("2014/1/1"),
to = as.Date("2018/2/1"),
by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312, 2551:2562, 2651:2662, 2751:2762), ]
library(zoo)
z <- read.zoo(df, index = "Date", split = "Product_Id", FUN = as.yearmon)
tt <- as.ts(z)
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
There, AFAIK, no way to do this in R in a time series matrix. And if each column started at a different date, then (since each column has the same number of entries), each column would also need to end at a different date. Is this really what you need? A collection of time series that all happen to be of the same length (so they can fit into a matrix), but that start and end with offsets? I struggle to understand where something like this would be useful, outside a kind of forecasting competition.
If you really need this, then I would recommend you put your time series into a list structure. Then each one can start and end at any date, and they can be the same or different lengths. Take inspiration from Mcomp::M3.
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Since your tt is already a time series object, the simplest way would be simply to iterate over its columns:
fcst <- matrix(nrow=10,ncol=ncol(tt))
for ( ii in 1:ncol(tt) ) fcst <- forecast(ets(tt[,ii]),10)$mean
Note that most modeling functions in forecast will throw a warning and do something reasonable on encountering NA values. Here, e.g.:
1: In ets(tt[, ii]) :
Missing values encountered. Using longest contiguous portion of time series
Of course, you could do something yourself inside the loop, e.g., search for the last NA and start the time series for modeling right after that (but make sure you fail gracefully if the last entry is NA).

gnuplot, calculating and plotting monthly averages

I have a datafile with several months of minute data with lines like "2016-02-02 13:21(\t)value(\n)".
I need to plot the data (no problem with that) and calculate + plot an average for each month.
Is it possible in gnuplot?
I am able to get an overall average using
fit a "datafile" using 1:3 via a
I am also able to specify some time range for the fit using
fit [now_secs-3600*24*31:now_secs] b "datafile" using 1:3 via b
... and then plot them with
plot a t "Total average",b t "Last 31 days"
But no idea how to calculate and plot an average for each month (= one stepped line showing each month average)
Here is a way to do it purely in gnuplot. This method can be adapted (with a not small amount of effort) to work with files that cross a year boundary or span more than one year. It works just fine if the data starts with January or not. It computes the ordinary average for each month (the arithmetic mean) treating each data point as one value for the month. With somewhat significant modification, it can be used to work with weighted averages as well.
This makes a significant use of the stats function to compute values. It is a little long, partly because I commented it heavily. It uses 5.0 features (NaN for undefined values and in-memory datablocks instead of temporary files), but comments note how to change these for earlier versions.
Note: This script must be run before setting time mode. The stats function will not work in time mode. Time conversions are handled by the script functions.
data_time_format = "%Y-%m-%d %H:%M" #date format in file
date_cols = 2 # Number of columns consumed by date format
# get numeric month value of time - 1=January, 12=December
get_month(x) = 0+strftime("%m",strptime(data_time_format,x))
# get numeric year value of time
get_year(x) = 0+strftime("%Y",strptime(data_time_format,x))
# get internal time representation of day 1 of month x in year y
get_month_first(x,y) = strptime("%Y-%m-%d",sprintf("%d-%d-01",y,x))
# get internal time representation of date
get_date(x) = strptime(data_time_format,x)
# get date string in file format corresponding to day y in month x of year z
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-%02d",z,x,y)))
# determine if date represented by z is in month x of year y
check_valid(x,y,z) = (get_date(z)>=get_month_first(x,y))&(get_date(z)<get_month_first(x+1,y))
# Determine year and month range represented by file
year = 0
stats datafile u (year=get_year(strcol(1)),get_month(strcol(1))) nooutput
month_min = STATS_min
month_max = STATS_max
# list of average values for each month
aves = ""
# fill missing months at beginning of year with 0
do for[i=1:(month_min-1)] {
aves = sprintf("%s %d",aves,0)
}
# compute average of each month and store it at the end of aves
do for[i=month_min:month_max] {
# In versions prior to 5.0, replace NaN with 1/0
stats datafile u (check_valid(i,year,strcol(1))?column(date_cols+1):NaN) nooutput
aves = sprintf("%s %f",aves,STATS_mean)
}
# day on which to plot average
baseday = 15
# In version prior to 5.0, replace $k with a temporary file name
set print $k
# Change this to start at 1 if we want to fill in prior months
do for [i=month_min:month_max] {
print sprintf("%s %s",get_date_string(i,baseday,year),word(aves,i))
}
set print
This script will create either a in-memory datablock or a temporary file for earlier versions (with the noted changes) that contains a similar file to the original, but containing one entry per month with the value of the monthly average.
At the beginning we need to define our date format and the number of columns that the date format consumes. From then on it is assumed that the data file is structured as datetime value. Several functions are defined which make extensive use of the strptime function (to compute a date string to an internal integer) and the strftime function (to compute an internal representation to a string). Some of these functions compute both ways in order to extract the necessary values. Note the addition of 0 in the get_month and get_year function to convert a string value to an integer.
We do several steps with the data in order to build our resulting datablock/file.
Use the stats function to compute the first and last month and the year. We are assuming only one year is present. This step needs to be modified heavily if we need to work with more than one year. In particular months in a second year would need to be numbered 13 - 24 and in a third year 25 - 36 and so on. We would need to modify this line to capture multiple years as well. Probably two passes would be needed.
Build up a string which contains space separated values for the average value for each month. This is done by applying the stats function once for each month. The check_valid function checks if a value is in the month of interest, and a value that isn't is assigned NaN which causes the stats function to ignore it.
Loop over the months of interest and build a datablock/temporary file with one entry for each month with the average value for that month. In this case, the average value is assigned to the start of the 15th day of the month. This can be easily changed to any other desired time. The get_date_string function is used for assigning the value to a time.
Now to demonstrate this, suppose that we have the following data
2016-02-03 15:22 95
2016-02-20 18:03 23
2016-03-10 16:03 200
2016-03-15 03:02 100
2016-03-18 02:02 200
We wish to plot this data along with the average value for each month. We can run the above script, and we will get a datablock $k (make the commented change near the bottom to use a temporary file instead) containing the following
2016-02-15 00:00 59.000000
2016-03-15 00:00 166.666667
This is exactly the average values for each month. Now we can plot with
set xdata time
set timefmt data_time_format
set key outside top right
plot $k u 1:3 w points pt 7 t "Monthly Average",\
datafile u 1:3 with lines t "Original Data"
Here, just for illustration, I used points with the averages. Feel free to use any style that you want. If you choose to use steps, you will very likely want to adjust the day that is assigned† in the datablock/temporary file (probably the first or last day in the month depending on how you want to do it).
It is usually easier with a task like this to do some outside preprocessing, but this demonstrates that it is possible in pure gnuplot.
† Regarding changing the day that is assigned, using any specific day in the month is easy, as long as it is a day that occurs in every month (dates from the 1st to the 28th) - just change baseday. For other values modifications to the get_date_string function need to be made.
For example, to use the last day, the function can be defined as
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-01",z,x+1))-24*60*60)
This version actually computes the first day of the next month, and then subtracts one whole day from that. The second argument is ignored in this version, but preserved to allow it to be used without having to make any additional changes to the script.
With a recent version of gnuplot, you have the stats command and you can do something something like this:
stats "datafile" using 1:3 name m0
month_sec=3600*24*30.5
do for [month=1:12] {
stats [now_secs-(i+1)*month_sec:(i+0)*now_secs-month_sec] "datafile" using 1:3 name sprintf("m%d")
}
you get m0_mean value for the total mean and you get all m1_mean m2_mean variables for the previuos months etc... defined in gnuplot
Finally to plot the you should do something like:
plot 'datafile', for [month=0:12] value(sprintf("m%d_mean"))
see help stats help for help value help sprintf for more information on the above commands

Convert Julian days into radians (or similar)

I have a data frame with n rows each of which corresponds to a single event in space and time. The data frame has columns containing spatial coordinates and the date in Julian days as well as several other columns of additional data.
There are various things I would like to do with my data but as an example I want to rasterise some of the columns and output some maps. For most of my columns I can do this easily with something like this:
df.raster <- rasterize(df.sp, base.raster, field = "column", fun=median)
plot(df.raster)
However, for Julian days this doesn't make sense because its cyclical. 365/366 is adjacent to 1 but R doesn't know this so using the median function isn't going to provide me with a meaningful number. I'm looking for a way to convert my column of Julian days into a new column which reflects this and enables me to create a raster of meaningful values for Julian day.
My Julian days column runs from 1-366 reflecting the day on which an event took place within a particular year. My data covers multiple years but my Julian days column starts from 1 again at the start of every year.
I've tried a few things including converting to radians but nothing has worked so far. Any help would be much appreciated!
To get what I wanted I first had to scale my "Julian days" column to degrees, then I could convert degrees to radians using the as_radians function in the aspace package and then I could use circular statistics on the radians:
# Scale Julian days to degrees
df$degrees <- (df$jday/366)*360
# Convert degrees to radians
df$radians <- as_radians(df$degrees)
# Convert df to a spatial object
df.sp <- df
coordinates(df.sp) <- ~ x + y
proj4string(df.sp) <- proj4string(coordinates)
# Rasterise radians
radians.raster <- rasterize(df.sp, base.raster, field = "radians", fun = mean.circular)
# Plot rasterised radians
plot(radians.raster)
Currently the figures will be slightly inaccurate as (when converting to degrees) leap years should be divided by 366 and non-leap years by 365 but I'll fix this with a simple loop which looks up the year (also included in my df) for each row and uses 366/365 appropriately.

How to calculate summary statistics within specified date/time range within time series, using an input of multiple start and end dates?

I have a (dummy) data frame with time series data:
datetime <- as.POSIXct(seq(ISOdate(2012,12,22), ISOdate(2012,12,23), by="hour"), tz='EST')
data <- rnorm(25, 10, 5)
df <- data.frame(datetime, data)
I also have a separate data frame with start and end times as the two columns:
start <- as.POSIXct(c('2012/12/22 19:53', '2012/12/22 23:05'), tz='gmt')
end <- as.POSIXct(c('2012/12/22 21:06', '2012/12/22 23:58'), tz='gmt')
index <- data.frame(start, end)
What I'd like to do is "feed" the main data frame 'df' the 'index' data frame, and, for each start and end date/time combination, find the average value of "data" within that date/time range. This would be equivalent to doing a subset of 'df' manually for each start/end time, but in a combined fashion. (My real data set has years of data, and a hundred date/time ranges I want to feed it FYI).
End goal is to have three columns, start time, end time, and the average numeric value of 'data' within those times.
In general you don't want to grow a data frame one row at a time by calling rbind because it is very inefficient (see the second circle of the R inferno for details). In your case, you can use sapply to replicate this logic:
index$mean <- sapply(1:nrow(index), function(i) mean(df[df$datetime >= index$start[i] &
df$datetime <= index$end[i],2]))
index
# start end mean
# 1 2012-12-22 19:53:00 2012-12-22 21:06:00 9.563336
# 2 2012-12-22 23:05:00 2012-12-22 23:58:00 NaN
I figured out how to do it with a for loop. If anyone has a more efficient solution, that would be great. The for loop solution:
d <- data.frame()
for i in (1:nrow(index)) {
d <- rbind(d, mean(subset(df, datetime >= index[i,1] &
datetime <= index[i,2])[,2]))}

How do I add periods to time series in R after aggregation

I have a two variable dataframe (df) in R of daily sales for a ten year period from 2004-07-09 through 2014-12-31. Not every single date is represented in the ten year period, but pretty much most days Monday through Friday.
My objective is to aggregate sales by quarter, convert to a time series object, and run a seasonal decomposition and other time series forecasting.
I am having trouble with the conversion, as ulitmately I receive a error:
time series has no or less than 2 periods
Here's the structure of my code.
# create a time series object
library(xts)
x <- xts(df$amount, df$date)
# create a time series object aggregated by quarter
q.x <- apply.quarterly(x, sum)
When I try to run
fit <- stl(q.x, s.window = "periodic")
I get the error message
series is not periodic or has less than two periods
When I try to run
q.x.components <- decompose(q.x)
# or
decompose(x)
I get the error message
time series has no or less than 2 periods
So, how do I take my original dataframe, with a date variable and an amount variable (sales), aggregate that quarterly as a time series object, and then run a time series analysis?
I think I was able to answer my own question. I did this. Can anyone confirm if this structure makes sense?
library(lubridate)
# add a new variable indicating the calendar year.quarter (i.e. 2004.3) of each observation
df$year.quarter <- quarter(df$date, with_year = TRUE)
library(plyr)
# summarize gift amount by year.quarter
new.data <- ddply(df, .(year.quarter), summarize,
sum = round(sum(amount), 2))
# convert the new data to a quarterly time series object beginning
# in July 2004 (2004, Q3) and ending in December 2014 (2014, Q4)
nd.ts <- ts(new.data$sum, start = c(2004,3), end = c(2014,4), frequency = 4)

Resources