Calculate rolling average of simulated data series with data.table - r

I am simulating a price time series, where the time horizon basically is that each month consists of 20 working days and 12 months are one year. I now would like to calculate the rolling average of this price, always based on the first day of the month.
I do have a working solution, but would like to know if there's a more elegant or faster one.
dt.oil.price
Period Month Day.Month Oil.Price Oil.Supply Risk.Free.Interest
1: 1 1 1 39.4560000 NA 0.08642857
2: 2 1 2 3.7889460 NA 0.08642857
3: 3 1 3 51.0748751 NA 0.08642857
4: 4 1 4 60.6282853 NA 0.08642857
5: 5 1 5 35.7267224 NA 0.08642857
6: 6 1 6 26.1868977 NA 0.08642857
7: 7 1 7 32.6488136 NA 0.08642857
8: 8 1 8 42.6397549 NA 0.08642857
9: 9 1 9 18.8969991 NA 0.08642857
...
20: 20 1 20 8.8036135 NA 0.08642857
21: 21 2 1 2.5559526 NA 0.08642857
22: 22 2 2 24.3996401 NA 0.08642857
...
40: 40 2 20 41.2988566 NA 0.08642857
41: 41 3 1 20.8012327 NA 0.08642857
42: 42 3 2 70.5297726 NA 0.08642857
Just to give you an idea on the structure of the data. To create the above data structure with 60 periods:
set.seed(1);
dt.oil.price <- as.data.table(cbind( Period = 1:60,
Month = as.integer(rep(1:(60/20), each = 20))[1:60],
Oil.Price=rnorm(3*20,mean = 50, sd = 10)))
dt.oil.price[,"Day.Month" := rank(Period),by="Month"]
With the following code I can then select all first days of a month and calculate the mean of the oil price for these days:
dt.oil.price[ Day.Month == 1, mean(Oil.Price)]
In the next step I use another helper column "Num.Months" to rank the number of months accordingly, by
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)]
and with this I can then select only the last two months for the average calculation, by subsetting this
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)][Num.Months <= 2, Oil.Price]
A code snippet, which allows to calculate the mean without using an explicit helper column for the last three months:
dt.oil.price[Day.Month == 1 & Period <= 60, {Num.Months = rank(-Period); list("Period" = Period, "Month" = Month, "Oil.Price" = Oil.Price, "Num.Months" = Num.Months)}][Num.Months <=12, mean(Oil.Price)]
I hope my steps are all clear and it becomes also clear, what I would like to achieve. It is also possible to calculate the moving average dynamically by defining for example a period and then calculate the moving average for the last 12 months preceding that period. This can be achieved, by sub-setting the data.table only to periods smaller than the defined period and then calculating "Num.Months" for this data.table subset.

Related

How to calculate moving average for different starting date?

I would like to calculate moving averages for each participant in the dataset.
Participant may have more than one visit date, and I would like to calculate the average value in the past 3 days and in the past 2 days before each visit (not including the day of visit).
For example, let id=1, date=6/6/2017.
Average value in the past 2 days should be an average of value on 6/5/2017 and 6/4/2017.
Sample datasets are generated as below.
I am working on a much larger dataset, with more participants, more visits, and more days of value. I want to find an efficient way to calculate these averages.
timeseries <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3), date=c("6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017"),
value=c(2,3,4,NA,6,7,
NA,9,5,NA,3,2,
5,7,3,8,3,5))
> timeseries
id date value
1 1 6/1/2017 2
2 1 6/2/2017 3
3 1 6/3/2017 4
4 1 6/4/2017 NA
5 1 6/5/2017 6
6 1 6/6/2017 7
7 2 6/1/2017 NA
8 2 6/2/2017 9
9 2 6/3/2017 5
10 2 6/4/2017 NA
...
visit <- data.frame(id=c(1,1,2,3,3,3),
date=c("6/6/2017","6/5/2017",
"6/6/2017",
"6/6/2017","6/5/2017","6/4/2017"))
> visit
id date
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
The result table should be something like this, where mean3 is the average value in the past 3 days, and mean2 is the average value in the past 2 days
> result
id date mean3 mean2
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
For each id of visit, I subset corresponding data from timeseries and then calculated mean of the value within n_days.
library(lubridate)
n_days = 2
sapply(1:NROW(visit), function(i)
with(subset(x = timeseries,
subset = timeseries$id == visit$id[i]),
mean(x = value[difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") <= n_days &
difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") > 0],
na.rm = TRUE)))
#[1] 6.0 4.0 3.0 5.5 5.5 5.0

R: Create a column of averages based upon groups of four rows

>head(df)
person week target actual drop_out organization agency
1: QJ1 1 30 19 TRUE BB LLC
2: GJ2 1 30 18 FALSE BB LLC
3: LJ3 1 30 22 TRUE CC BBR
4: MJ4 1 30 24 FALSE CC BBR
5: PJ5 1 35 55 FALSE AA FUN
6: EJ6 1 35 50 FALSE AA FUN
There are around ~30 weeks in the dataset with a repeating Person ID each week.
I want to look at each person's values FOUR weeks at a time (so week 1-4, 5-9, 10-13, and so on). For each of these chunks, I want to add up all the "actual" columns and divide it by the sum of the "target" columns. Then we could put that value in a column called "monthly percent."
As per Shape's recommendation I've created a month column like so
fullReshapedDT$month <- with(fullReshapedDT, ceiling(week/4))
Trying to figure out how to iterate over the month column and calculate averages now. Trying something like this, but it obviously doesn't work:
fullReshapedDT[,.(monthly_attendance = actual/target,by=.(person_id, month)]
Have you tried creating a group variable? It will allow you to group operations by the four-week period:
setDT(df1)[,grps:=ceiling(week/4) #Create 4-week groups
][,sum(actual)/sum(target), .(person, grps) #grouped operations
][,grps:=NULL][] #Remove unnecessary columns
# person V1
# 1: QJ1 1.1076923
# 2: GJ2 1.1128205
# 3: LJ3 0.9948718
# 4: MJ4 0.6333333
# 5: PJ5 1.2410256
# 6: EJ6 1.0263158
# 7: QJ1 1.2108108
# 8: GJ2 0.6378378
# 9: LJ3 0.9891892
# 10: MJ4 0.8564103
# 11: PJ5 1.1729730
# 12: EJ6 0.8666667

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeƱorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Prepare Time Series for Machine Learning - Long to Wide Format

I have a data frame of time series data in a 'long' format where there is 1 row/observation per day. I would like to transform this data into a 'wide' format. Each row/observation should have the time series value for the current date and the previous 2 days.
To provide a concrete example, I will use the Air Quality data available in R. This is what my input data frame looks like.
> input <- airquality[1:4,c("Month", "Day", "Ozone")]
> input
Month Day Ozone
1 5 1 41
2 5 2 36
3 5 3 12
4 5 4 18
I would like to transform this input so that it looks like the following.
output <- data.frame(Month = 5, Day = 1:4, Ozone=c(41,36,12,18), Ozone.Prev.1=c(NA,41,36,12), Ozone.Prev.2=c(NA,NA,41,36))
> output
Month Day Ozone Ozone.Prev.1 Ozone.Prev.2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36
Any suggestions on a nice, clean way to do this? Many thanks in advance.
You can use the lag function from zoo, but the following small function get's the trick done without using additional packages:
shift_vector = function(vec, n) c(rep(NA, n), head(vec, -n))
output = transform(input, prev_1 = shift_vector(Ozone, 1),
prev_2 = shift_vector(Ozone, 2))
output
Month Day Ozone prev_1 prev_2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36

cross sectional sub-sets in data.table

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Resources