randomly sample groups in groups - r

Given is a data frame with a column 'Date' (yyyy-mm-dd).
Date
1 2015-01-01
2 2015-01-01
3 2015-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-24
7 2015-01-24
8 2015-01-30
9 2015-01-30
...
996 2015-12-17
997 2015-12-17
998 2015-12-31
999 2015-12-31
Now I want to sample the data frame by Date within each month. If the Date in diffrent rows is the same it should be still grouped after the sample.
The result I'am looking for could be like this:
Date
1 2015-01-24
2 2015-01-24
3 2015-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
8 2015-01-30
9 2015-01-30
...
996 2015-12-31
997 2015-12-31
998 2015-12-17
999 2015-12-17

Using dplyr and padr this is a solution
library(dplyr)
library(padr)
# make some data
x <- data.frame(Date = seq(as.Date("2016-01-01"), length.out = 730, by = "day")) %>%
sample_frac(0.8) %>% arrange(Date)
x %>% thicken("month") %>%
group_by(Date_month) %>%
sample_n(10)

Related

Mutate a column in a tsibble dataframe, applying a Box-Cox transformation

I am a big fan of Hyndman's packages, but stumbled with Box-Cox transformation.
I have a dataframe
class(chicago_sales)
[1] "tbl_ts" "tbl_df" "tbl" "data.frame"
I am trying to mutate an extra column, where the Mean_price variable will be transformed.
foo <- chicago_sales %>%
mutate(bc = BoxCox(x = chicago_sales$Median_price, lambda =
BoxCox.lambda(chicago_sales$Median_price)))
gives me some result (probably wrong too) and cannot apply autoplot.
I also tried to apply the code from Hyndman's book, but failed.
What am I doing wrong? Thanks!
UPDATED:
Issue, inside tsibbles, when using dplyr, you do not call chicago_sales$Median_price, but just Median_price. When using tsibbles I would advice using fable and fabletools, but if you are using forecast, it should work like this:
library(tsibble)
library(dplyr)
library(forecast)
pedestrian %>%
mutate(bc = BoxCox(Count, BoxCox.lambda(Count)))
# A tsibble: 66,037 x 6 [1h] <Australia/Melbourne>
# Key: Sensor [4]
Sensor Date_Time Date Time Count bc
<chr> <dttm> <date> <int> <int> <dbl>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 11.3
2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 9.87
3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 9.10
4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 7.65
5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 6.52
6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 5.54
7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 4.67
8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 5.04
9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 6.17
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 6.82
# ... with 66,027 more rows
I used a built in dataset from the tsibble package as you did not provide a dput of chicago_sales.

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

R changing variable value of one factor level to represent value mean of factor levels by day

I have the following dataframe:
> df
Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 333116
3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 14127
9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Here is the str:
> str(df)
'data.frame': 12 obs. of 8 variables:
$ Time_Start : POSIXct, format: "2016-05-05 11:19:00" "2016-05-05 11:25:00" "2016-05-05 11:33:00" ...
$ Time_End : POSIXct, format: "2016-05-06 09:30:00" "2016-05-06 09:35:00" "2016-05-06 09:39:00" ...
$ Cut : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 8 levels "1","3","6","7",..: 1 2 3 4 5 6 1 2 3 4 ...
$ Inlet_NH4N : num 0.234 1.091 1.354 0.686 0.263 ...
$ Outlet_NH4N : num 0.308 0.508 0.349 0.341 0.738 ...
$ Pump_reading : num 79847 78777 79585 80043 79228 ...
$ Anemometer_reading: int 6296343 333116 8970447 8436546 9007387 8070785 7807792 14127 9844162 9143733 ...
This is a small segment of a larger dataset.
I have a problem with these data in that the Anemometer_reading for plot "3" is always much lower than for the other plots. This is due to a mechanical problem. I want to remove this artifact and think that the best way to do this is to take an average of the Anemometer_reading for all the plots outwith plot "3". I want to calculate this average on a daily basis.
I can calculate the daily Anemometer_reading average, excluding plot "3" like this:
library(dplyr)
> df_avg <- df %>% filter(Plot != "3") %>% group_by(as.Date(Time_End)) %>% summarise(Anemometer_mean = mean(Anemometer_reading))
> df_avg
Source: local data frame [2 x 2]
as.Date(Time_End) Anemometer_mean
<date> <dbl>
1 2013-05-07 9168521
2 2016-05-06 8156302
I'm not sure how to go about using the resulting dataframe to replace the Anemometer_reading values from plot "3".
Can anyone point me in the right direction please?
Thanks
I would follow #roland's comment. However, if you care about how you would use dplyr to do what you asked:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_mean = mean(Anemometer_reading[Plot != "3"])) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", first(Anemometer_mean))) %>%
ungroup() %>% select(-`as.Date(Time_End)`, -Anemometer_mean)
print(result)
## A tibble: 12 x 8
## Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
## <fctr> <fctr> <int> <int> <dbl> <dbl> <dbl> <dbl>
##1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
##2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 8156302
##3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
##4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
##5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
##6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
##7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
##8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 9168521
##9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
##10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
##11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
##12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Instead of filter and summarise, mutate to create a new column Anemometer_mean that computes the mean with all rows for Plot!=3. Then replace the Anemometer_read for those rows Plot==3 with this mean.
In fact, you can do all this with just one mutate:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", mean(Anemometer_reading[Plot != "3"]))) %>%
ungroup() %>% select(-`as.Date(Time_End)`)
Hope this helps.

R: Fill in all elements of sequence of datetime with patchy periodic datetime information

I guess I don't even know really what to 'title' this question as.
But I think this is quite a common data manipulation requirement.
I have data that has a periodic exchange between two parties of a quantity of a good. The exchanges are made hourly. Here is an example data frame:
df <- cbind.data.frame(Seller = as.character(c("A","A","A","A","A","A")),
Buyer = c("B","B","B","C","C","C"),
DateTimeFrom = c("1/07/2013 0:00","1/07/2013 9:00","1/07/2013 0:00","1/07/2013 6:00","1/07/2013 8:00","2/07/2013 9:00"),
DateTimeTo = c("1/07/2013 8:00","1/07/2013 15:00","2/07/2013 8:00","1/07/2013 9:00","1/07/2013 12:00","2/07/2013 16:00"),
Qty = c(50,10,20,25,5,5)
)
df$DateTimeFrom <- as.POSIXct(df$DateTimeFrom, format = '%d/%m/%Y %H:%M', tz = 'GMT')
df$DateTimeTo <- as.POSIXct(df$DateTimeTo, format = '%d/%m/%Y %H:%M', tz = 'GMT')
> df
Seller Buyer DateTimeFrom DateTimeTo Qty
1 A B 2013-07-01 00:00:00 2013-07-01 08:00:00 50
2 A B 2013-07-01 09:00:00 2013-07-01 15:00:00 10
3 A B 2013-07-01 00:00:00 2013-07-02 08:00:00 20
4 A C 2013-07-01 06:00:00 2013-07-01 09:00:00 25
5 A C 2013-07-01 08:00:00 2013-07-01 12:00:00 5
6 A C 2013-07-02 09:00:00 2013-07-02 16:00:00 5
So, for example, the first row of this data frame says that the Seller "A" sells 50 units of the good to the buyer "B" every hour from midnight on 1/7/13 until 8am on 1/7/13. You can also notice that some of these exchanges between the same two parties can overlap, but just with a different negotiated quantity.
What I need to do (and need your help with) is to generate a sequence covering all hours over this two day period that sums the total quantity exchanged in that hour between two sellers over all neogociations.
Here would be the resulting dataframe.
DateTimeSeq <- data.frame(seq(ISOdate(2013,7,1,0),by = "hour", length.out = 48))
colnames(DateTimeSeq) <- c("DateTime")
#What the Answer should be
DateTimeSeq$QtyAB <- c(70,70,70,70,70,70,70,70,70,30,30,30,30,30,30,30,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
DateTimeSeq$QtyAC <- c(0,0,0,0,0,0,25,25,30,30,5,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,5,5,0,0,0,0,0,0,0)
> DateTimeSeq
DateTime QtyAB QtyAC
1 2013-07-01 00:00:00 70 0
2 2013-07-01 01:00:00 70 0
3 2013-07-01 02:00:00 70 0
4 2013-07-01 03:00:00 70 0
5 2013-07-01 04:00:00 70 0
6 2013-07-01 05:00:00 70 0
7 2013-07-01 06:00:00 70 25
8 2013-07-01 07:00:00 70 25
9 2013-07-01 08:00:00 70 30
10 2013-07-01 09:00:00 30 30
11 2013-07-01 10:00:00 30 5
12 2013-07-01 11:00:00 30 5
13 2013-07-01 12:00:00 30 5
14 2013-07-01 13:00:00 30 0
15 2013-07-01 14:00:00 30 0
.... etc
Anybody able to lend a hand?
Thanks,
A
Here is my solution which uses the dplyr and reshape package.
library(dplyr)
library(reshape)
Firstly, we should expand the dataframe so that everything is in an hourly format. This can be done using the do part of dplyr.
df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour")))
Output:
Source: local data frame [66 x 4]
Groups: <by row>
Seller Buyer Qty DateTimeCurr
1 A B 50 2013-07-01 00:00:00
2 A B 50 2013-07-01 01:00:00
3 A B 50 2013-07-01 02:00:00
...
From there it is trivial to get the correct id's and summarise the total using the group_by function.
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
Output:
Source: local data frame [48 x 5]
Groups: Seller, Buyer
Seller Buyer DateTimeCurr TotalQty id
1 A B 2013-07-01 00:00:00 70 QtyAB
2 A B 2013-07-01 01:00:00 70 QtyAB
3 A B 2013-07-01 02:00:00 70 QtyAB
From this dataframe, all we have to do is cast it into the format you have above.
> cast(df1, DateTimeCurr~ id, value="TotalQty")
DateTimeCurr QtyAB QtyAC
1 2013-07-01 00:00:00 70 NA
2 2013-07-01 01:00:00 70 NA
3 2013-07-01 02:00:00 70 NA
4 2013-07-01 03:00:00 70 NA
5 2013-07-01 04:00:00 70 NA
6 2013-07-01 05:00:00 70 NA
So the whole piece of code
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
cast(df1, DateTimeCurr~ id, value="TotalQty")

Quarterly Year over Year Growth Rate

I have been trying to calculate the growth rate comparing quarter 1 from one year to quarter 1 for the following year.
In excel the formula would look like this ((B6-B2)/B2)*100.
What is the best way to accomplish this in R? I know how to get the differences from period to period, but cannot accomplish it with 4 time periods' difference.
Here is the code:
date <- c("2000-01-01","2000-04-01", "2000-07-01",
"2000-10-01","2001-01-01","2001-04-01",
"2001-07-01","2001-10-01","2002-01-01",
"2002-04-01","2002-07-01","2002-10-01")
value <- c(1592,1825,1769,1909,2022,2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
Which will produce this data frame:
date value
1 2000-01-01 1592
2 2000-04-01 1825
3 2000-07-01 1769
4 2000-10-01 1909
5 2001-01-01 2022
6 2001-04-01 2287
7 2001-07-01 2169
8 2001-10-01 2366
9 2002-01-01 2001
10 2002-04-01 2087
11 2002-07-01 2099
12 2002-10-01 2258
Here's an option using the dplyr package:
# Convert date column to date format
df$date = as.POSIXct(df$date)
library(dplyr)
library(lubridate)
In the code below, we first group by month, which allows us to operate on each quarter separately. The arrange function just makes sure that the data within each quarter is ordered by date. Then we add the yearOverYear column using mutate which calculates the ratio of the current year to the previous year for each quarter.
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1))
date value month yearOverYear
1 2000-01-01 1592 1 NA
2 2001-01-01 2022 1 1.2701005
3 2002-01-01 2001 1 0.9896142
4 2000-04-01 1825 4 NA
5 2001-04-01 2287 4 1.2531507
6 2002-04-01 2087 4 0.9125492
7 2000-07-01 1769 7 NA
8 2001-07-01 2169 7 1.2261164
9 2002-07-01 2099 7 0.9677271
10 2000-10-01 1909 10 NA
11 2001-10-01 2366 10 1.2393924
12 2002-10-01 2258 10 0.9543533
If you prefer to have the data frame back in overall date order after adding the year-over-year values:
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1)) %>%
ungroup() %>% arrange(date)
Or using data.table
library(data.table) # v1.9.5+
setDT(df)[, .(date, yoy = (value-shift(value))/shift(value)*100),
by = month(date)
][order(date)]
Here's a very simple solution:
YearOverYear<-function (x,periodsPerYear){
if(NROW(x)<=periodsPerYear){
stop("too few rows")
}
else{
indexes<-1:(NROW(x)-periodsPerYear)
return(c(rep(NA,periodsPerYear),(x[indexes+periodsPerYear]-x[indexes])/x[indexes]))
}
}
> cbind(df,YoY=YearOverYear(df$value,4))
date value YoY
1 2000-01-01 1592 NA
2 2000-04-01 1825 NA
3 2000-07-01 1769 NA
4 2000-10-01 1909 NA
5 2001-01-01 2022 0.27010050
6 2001-04-01 2287 0.25315068
7 2001-07-01 2169 0.22611645
8 2001-10-01 2366 0.23939235
9 2002-01-01 2001 -0.01038576
10 2002-04-01 2087 -0.08745081
11 2002-07-01 2099 -0.03227294
12 2002-10-01 2258 -0.04564666
df$yoy <- c(rep(NA,4),(df$value[5:nrow(df)]-df$value[1:(nrow(df)-4)])/df$value[1:(nrow(df)-4)]*100);
df;
## date value yoy
## 1 2000-01-01 1592 NA
## 2 2000-04-01 1825 NA
## 3 2000-07-01 1769 NA
## 4 2000-10-01 1909 NA
## 5 2001-01-01 2022 27.010050
## 6 2001-04-01 2287 25.315068
## 7 2001-07-01 2169 22.611645
## 8 2001-10-01 2366 23.939235
## 9 2002-01-01 2001 -1.038576
## 10 2002-04-01 2087 -8.745081
## 11 2002-07-01 2099 -3.227294
## 12 2002-10-01 2258 -4.564666
Another base R solution. Requires that the date is in date format, so that the common months can be used as a grouping variable to which the function to calculate growth rate can be passed
# set date to a date objwct
df$date <- as.Date(df$date)
# order by date
df <- df[order(df$date), ]
# function to calculate differences
f <- function(x) c(NA, 100*diff(x)/x[-length(x)])
df$yoy <- ave(df$value, format(df$date, "%m"), FUN=f)
# date value yoy
# 1 2000-01-01 1592 NA
# 2 2000-04-01 1825 NA
# 3 2000-07-01 1769 NA
# 4 2000-10-01 1909 NA
# 5 2001-01-01 2022 27.010050
# 6 2001-04-01 2287 25.315068
# 7 2001-07-01 2169 22.611645
# 8 2001-10-01 2366 23.939235
# 9 2002-01-01 2001 -1.038576
# 10 2002-04-01 2087 -8.745081
# 11 2002-07-01 2099 -3.227294
# 12 2002-10-01 2258 -4.564666
or
c(rep(NA, 4,), 100* diff(df$value, lag=4) / head(df$value, -4))

Resources