converting dataframe to time series in R - r

I have a dataframe :
> dsa[1:20]
Ordered.Item date Qty
1: 2011001FAM002025001 2019-06-01 19440.00
2: 2011001FAM002025001 2019-05-01 24455.53
3: 2011001FAM002025001 2019-04-01 16575.06
4: 2011001FAM002025001 2019-03-01 880.00
5: 2011001FAM002025001 2019-02-01 5000.00
6: 2011001FAM002035001 2019-04-01 175.00
7: 2011001FAM004025001 2019-06-01 2000.00
8: 2011001FAM004025001 2019-05-01 2500.00
9: 2011001FAM004025001 2019-04-01 3000.00
10: 2011001FAM012025001 2019-06-01 1200.00
11: 2011001FAM012025001 2019-04-01 1074.02
12: 2011001FAM022025001 2019-06-01 350.00
13: 2011001FAM022025001 2019-05-01 110.96
14: 2011001FAM022025001 2019-04-01 221.13
15: 2011001FAM022035001 2019-06-01 500.00
16: 2011001FAM022035001 2019-05-01 18.91
17: 2011001FAM027025001 2019-06-01 210.00
18: 2011001FAM028025001 2019-04-01 327.21
19: 2011001FBK005035001 2019-05-01 500.00
20: 2011001FBL001025001 2019-06-01 15350.00
>str(dsa)
Classes ‘data.table’ and 'data.frame': 830 obs. of 3 variables:
$ Ordered.Item: Factor w/ 435 levels "2011001FAM002025001",..: 1 1 1 1 1 2 3 3 3 4 ...
$ date : Date, format: "2019-06-01" "2019-05-01" "2019-04-01" ...
$ Qty : num 19440 24456 16575 880 5000 ...
- attr(*, ".internal.selfref")=<externalptr>
this data contains sku and it's quantity sold per month
Because i plan to use ARIMA forecasting i am trying to convert the dataframe to time series but i get a weird output
> timesr<-ts(data=dsa,start=c(12,2018),frequency = 12)
> head(timesr)
Ordered.Item date Qty
[1,] 1 18048 19440.00
[2,] 1 18017 24455.53
[3,] 1 17987 16575.06
[4,] 1 17956 880.00
[5,] 1 17928 5000.00
[6,] 2 17987 175.00

You might try something like this for your sku ARIMA modeling.
# Create dataframe
dsa = read.table(text = '
ID Ordered.Item date Qty
1 2011001FAM002025001 2019-06-01 19440.00
2 2011001FAM002025001 2019-05-01 24455.53
3 2011001FAM002025001 2019-04-01 16575.06
4 2011001FAM002025001 2019-03-01 880.00
5 2011001FAM002025001 2019-02-01 5000.00
6 2011001FAM002035001 2019-04-01 175.00
7 2011001FAM004025001 2019-06-01 2000.00
8 2011001FAM004025001 2019-05-01 2500.00
9 2011001FAM004025001 2019-04-01 3000.00
10 2011001FAM012025001 2019-06-01 1200.00
11 2011001FAM012025001 2019-04-01 1074.02
12 2011001FAM022025001 2019-06-01 350.00
13 2011001FAM022025001 2019-05-01 110.96
14 2011001FAM022025001 2019-04-01 221.13
15 2011001FAM022035001 2019-06-01 500.00
16 2011001FAM022035001 2019-05-01 18.91
17 2011001FAM027025001 2019-06-01 210.00
18 2011001FAM028025001 2019-04-01 327.21
19 2011001FBK005035001 2019-05-01 500.00
20 2011001FBL001025001 2019-06-01 15350.00
', header = T)
dsa$ID <- NULL
# Reshape
dsa2 <- reshape(data=dsa,idvar="date", v.names = "Qty", timevar = "Ordered.Item", direction="wide")
dsa2 <- dsa2[order(as.Date(dsa2$date, "%Y-%m-%d")),] # Sort by date
# Predict for sku 2011001FAM002025001
fit <- auto.arima(ts(dsa2$Qty.2011001FAM002025001))
fcast <- forecast(fit, h=60) # forecast 60 periods ahead
plot(fcast)

Related

Finding clusters depending on temporal distance with data.table

I am trying to identify clusters in a dataframe that are within 4 subsequent days of the first event. Additionally we have a grouping variable.
Here is an example:
startDate <- as.POSIXct("2022-10-01")
dt1 <- data.table(
id = 1:20,
timestamp = startDate+ lubridate::days(rep(1:10,2))+ lubridate::hours(1:20),
group_id = rep(c("A","B"), each= 10)
)
id timestamp group_id t_diff
1: 1 2022-10-02 01:00:00 A 0.000000 days
2: 2 2022-10-03 02:00:00 A 1.041667 days
3: 3 2022-10-04 03:00:00 A 2.083333 days
4: 4 2022-10-05 04:00:00 A 3.125000 days
5: 5 2022-10-06 05:00:00 A 4.166667 days
6: 6 2022-10-07 06:00:00 A 5.208333 days
7: 7 2022-10-08 07:00:00 A 6.250000 days
8: 8 2022-10-09 08:00:00 A 7.291667 days
9: 9 2022-10-10 09:00:00 A 8.333333 days
10: 10 2022-10-11 10:00:00 A 9.375000 days
11: 11 2022-10-02 11:00:00 B 0.000000 days
12: 12 2022-10-03 12:00:00 B 1.041667 days
13: 13 2022-10-04 13:00:00 B 2.083333 days
14: 14 2022-10-05 14:00:00 B 3.125000 days
15: 15 2022-10-06 15:00:00 B 4.166667 days
16: 16 2022-10-07 16:00:00 B 5.208333 days
17: 17 2022-10-08 17:00:00 B 6.250000 days
18: 18 2022-10-09 18:00:00 B 7.291667 days
19: 19 2022-10-10 19:00:00 B 8.333333 days
20: 20 2022-10-11 20:00:00 B 9.375000 days
The result should look like this:
id timestamp group_id t_diff cluster_id
1: 1 2022-10-02 01:00:00 A 0.000000 days 1
2: 2 2022-10-03 02:00:00 A 1.041667 days 1
3: 3 2022-10-04 03:00:00 A 2.083333 days 1
4: 4 2022-10-05 04:00:00 A 3.125000 days 1
5: 5 2022-10-06 05:00:00 A 4.166667 days 2
6: 6 2022-10-07 06:00:00 A 5.208333 days 2
7: 7 2022-10-08 07:00:00 A 6.250000 days 2
8: 8 2022-10-09 08:00:00 A 7.291667 days 2
9: 9 2022-10-10 09:00:00 A 8.333333 days 3
10: 10 2022-10-11 10:00:00 A 9.375000 days 3
11: 11 2022-10-02 11:00:00 B 0.000000 days 4
12: 12 2022-10-03 12:00:00 B 1.041667 days 4
13: 13 2022-10-04 13:00:00 B 2.083333 days 4
14: 14 2022-10-05 14:00:00 B 3.125000 days 4
15: 15 2022-10-06 15:00:00 B 4.166667 days 5
16: 16 2022-10-07 16:00:00 B 5.208333 days 5
17: 17 2022-10-08 17:00:00 B 6.250000 days 5
18: 18 2022-10-09 18:00:00 B 7.291667 days 5
19: 19 2022-10-10 19:00:00 B 8.333333 days 6
20: 20 2022-10-11 20:00:00 B 9.375000 days 6
I have tried an approch with lapply, but the code is ugly and very slow. I am looking for a data.table approach, but I don't know how to dynamically refer to the "first" observation.
By first observation I mean the first observation of the 4 day interval.
You can use integer division.
Not that as.numeric run on a difftime object as an argument units that converts the difference to the desired time unit.
startDate <- as.POSIXct("2022-10-01")
dt1 <- data.table::data.table(
id = 1:20,
timestamp = startDate + lubridate::days(rep(1:10,2)) + lubridate::hours(1:20),
group_id = rep(c("A","B"), each= 10)
)
#
dt1[, GRP := as.numeric(timestamp - min(timestamp),
units = "days") %/% 4,
by = group_id][]
#> id timestamp group_id GRP
#> 1: 1 2022-10-02 01:00:00 A 0
#> 2: 2 2022-10-03 02:00:00 A 0
#> 3: 3 2022-10-04 03:00:00 A 0
#> 4: 4 2022-10-05 04:00:00 A 0
#> 5: 5 2022-10-06 05:00:00 A 1
#> 6: 6 2022-10-07 06:00:00 A 1
#> 7: 7 2022-10-08 07:00:00 A 1
#> 8: 8 2022-10-09 08:00:00 A 1
#> 9: 9 2022-10-10 09:00:00 A 2
#> 10: 10 2022-10-11 10:00:00 A 2
#> 11: 11 2022-10-02 11:00:00 B 0
#> 12: 12 2022-10-03 12:00:00 B 0
#> 13: 13 2022-10-04 13:00:00 B 0
#> 14: 14 2022-10-05 14:00:00 B 0
#> 15: 15 2022-10-06 15:00:00 B 1
#> 16: 16 2022-10-07 16:00:00 B 1
#> 17: 17 2022-10-08 17:00:00 B 1
#> 18: 18 2022-10-09 18:00:00 B 1
#> 19: 19 2022-10-10 19:00:00 B 2
#> 20: 20 2022-10-11 20:00:00 B 2
# When you want a single ID index
# alternatovely, just you the combination of group_id and GRP in subsequent `by`s
dt1[, cluster_id := .GRP, by = .(group_id, GRP)][]
#> id timestamp group_id GRP cluster_id
#> 1: 1 2022-10-02 01:00:00 A 0 1
#> 2: 2 2022-10-03 02:00:00 A 0 1
#> 3: 3 2022-10-04 03:00:00 A 0 1
#> 4: 4 2022-10-05 04:00:00 A 0 1
#> 5: 5 2022-10-06 05:00:00 A 1 2
#> 6: 6 2022-10-07 06:00:00 A 1 2
#> 7: 7 2022-10-08 07:00:00 A 1 2
#> 8: 8 2022-10-09 08:00:00 A 1 2
#> 9: 9 2022-10-10 09:00:00 A 2 3
#> 10: 10 2022-10-11 10:00:00 A 2 3
#> 11: 11 2022-10-02 11:00:00 B 0 4
#> 12: 12 2022-10-03 12:00:00 B 0 4
#> 13: 13 2022-10-04 13:00:00 B 0 4
#> 14: 14 2022-10-05 14:00:00 B 0 4
#> 15: 15 2022-10-06 15:00:00 B 1 5
#> 16: 16 2022-10-07 16:00:00 B 1 5
#> 17: 17 2022-10-08 17:00:00 B 1 5
#> 18: 18 2022-10-09 18:00:00 B 1 5
#> 19: 19 2022-10-10 19:00:00 B 2 6
#> 20: 20 2022-10-11 20:00:00 B 2 6

How to calculate the change using a baseline hour in R?

I have this data frame
library(lubridate)
df <- data.frame(seq(ymd_h("2017-01-01-00"), ymd_h("2020-01-31-24"), by = "hours"))
df$close <- rnorm(nrows(df), 3000, 150)
colnames(df) <- c("date", "close")
df$date <- as.POSIXct(df$date, format = "%Y-%m-%d %H:%M:%S")
df$hour <- hour(df$date)
df$day <- day(df$date)
df$month <- month(df$date)
df$year <- year(df$date)
I want to get the change of close price since the 16 hours. For example, after hour 16 the mean change of price in all the data at hour 18 was...and so on for all the hours. I want to set one hour as a baseline and get the change in price.
This is what I did. First I use lag but I am not sure how to set 16 hours as a baseline. However, this not even give me close to the result I want. The second approach I use lead but I have the same problem:
df_2 <- df %>% group_by(year, month, day, hour) %>%
mutate(change = (close-lead(close)))
In summary, I want to calculate on each day the change in price from hour 16 and then get the mean change on price from 16 hours to the rest of the hours.
If you need around the clock diff:
setDT(df)
df[, date_number := as.numeric(as.Date(ymd_h( sprintf("%d-%d-%dT%d",year,month,day,hour) ) - hours(16))) ]
df[, delta := close - close[ hour == 16 ], .(date_number) ]
head( df, n=48 )
tail( df, n=48 )
df[, .(meanPerHour = mean(delta)), .(hour) ]
To do it correctly you need to create the Date object, which you can see in the code, then subtract 16 hours (or add 8), to make 16:00 your new 0:00 , and then cast this back to a Date, and group by that Date's day number (which you get from as.numeric).
The first 48 rows:
> head( df, n=48 )
date close hour day month year date_number delta
1: 2017-01-01 00:00:00 2924.671 0 1 1 2017 17166 NA
2: 2017-01-01 01:00:00 3019.730 1 1 1 2017 17166 NA
3: 2017-01-01 02:00:00 2988.162 2 1 1 2017 17166 NA
4: 2017-01-01 03:00:00 3133.018 3 1 1 2017 17166 NA
5: 2017-01-01 04:00:00 3017.546 4 1 1 2017 17166 NA
6: 2017-01-01 05:00:00 3047.795 5 1 1 2017 17166 NA
7: 2017-01-01 06:00:00 2912.731 6 1 1 2017 17166 NA
8: 2017-01-01 07:00:00 3107.180 7 1 1 2017 17166 NA
9: 2017-01-01 08:00:00 2876.211 8 1 1 2017 17166 NA
10: 2017-01-01 09:00:00 2946.021 9 1 1 2017 17166 NA
11: 2017-01-01 10:00:00 3013.483 10 1 1 2017 17166 NA
12: 2017-01-01 11:00:00 3014.441 11 1 1 2017 17166 NA
13: 2017-01-01 12:00:00 2969.755 12 1 1 2017 17166 NA
14: 2017-01-01 13:00:00 3110.976 13 1 1 2017 17166 NA
15: 2017-01-01 14:00:00 3018.507 14 1 1 2017 17166 NA
16: 2017-01-01 15:00:00 2995.602 15 1 1 2017 17166 NA
17: 2017-01-01 16:00:00 2941.672 16 1 1 2017 17167 0.000000
18: 2017-01-01 17:00:00 3076.628 17 1 1 2017 17167 134.956576
19: 2017-01-01 18:00:00 2862.928 18 1 1 2017 17167 -78.743991
20: 2017-01-01 19:00:00 3346.545 19 1 1 2017 17167 404.872660
21: 2017-01-01 20:00:00 2934.287 20 1 1 2017 17167 -7.385360
22: 2017-01-01 21:00:00 3114.609 21 1 1 2017 17167 172.937229
23: 2017-01-01 22:00:00 3039.294 22 1 1 2017 17167 97.622331
24: 2017-01-01 23:00:00 3116.011 23 1 1 2017 17167 174.338827
25: 2017-01-02 00:00:00 2877.843 0 2 1 2017 17167 -63.828732
26: 2017-01-02 01:00:00 2934.232 1 2 1 2017 17167 -7.439448
27: 2017-01-02 02:00:00 2891.967 2 2 1 2017 17167 -49.705095
28: 2017-01-02 03:00:00 3034.642 3 2 1 2017 17167 92.969817
29: 2017-01-02 04:00:00 2826.341 4 2 1 2017 17167 -115.331282
30: 2017-01-02 05:00:00 3037.061 5 2 1 2017 17167 95.389536
31: 2017-01-02 06:00:00 2986.333 6 2 1 2017 17167 44.661103
32: 2017-01-02 07:00:00 3263.606 7 2 1 2017 17167 321.934480
33: 2017-01-02 08:00:00 2979.311 8 2 1 2017 17167 37.638695
34: 2017-01-02 09:00:00 2983.321 9 2 1 2017 17167 41.649113
35: 2017-01-02 10:00:00 2896.498 10 2 1 2017 17167 -45.174011
36: 2017-01-02 11:00:00 2966.731 11 2 1 2017 17167 25.059003
37: 2017-01-02 12:00:00 3027.436 12 2 1 2017 17167 85.764290
38: 2017-01-02 13:00:00 3062.598 13 2 1 2017 17167 120.926630
39: 2017-01-02 14:00:00 3159.810 14 2 1 2017 17167 218.138486
40: 2017-01-02 15:00:00 3145.530 15 2 1 2017 17167 203.858440
41: 2017-01-02 16:00:00 2984.756 16 2 1 2017 17168 0.000000
42: 2017-01-02 17:00:00 3210.481 17 2 1 2017 17168 225.724909
43: 2017-01-02 18:00:00 2733.484 18 2 1 2017 17168 -251.271959
44: 2017-01-02 19:00:00 3093.430 19 2 1 2017 17168 108.674494
45: 2017-01-02 20:00:00 2921.657 20 2 1 2017 17168 -63.098117
46: 2017-01-02 21:00:00 3198.335 21 2 1 2017 17168 213.579029
47: 2017-01-02 22:00:00 2945.484 22 2 1 2017 17168 -39.271663
48: 2017-01-02 23:00:00 3197.860 23 2 1 2017 17168 213.104247
The last 48 records:
> tail( df, n=48 )
date close hour day month year date_number delta
1: 18290 3170.775 1 30 1 2020 18290 201.47027428
2: 18290 3293.403 2 30 1 2020 18290 324.09870453
3: 18290 2940.591 3 30 1 2020 18290 -28.71382979
4: 18290 2922.411 4 30 1 2020 18290 -46.89312915
5: 18290 3237.419 5 30 1 2020 18290 268.11402422
6: 18290 2989.678 6 30 1 2020 18290 20.37332637
7: 18290 2932.777 7 30 1 2020 18290 -36.52746038
8: 18291 3188.269 8 30 1 2020 18290 218.96474627
9: 18291 3003.327 9 30 1 2020 18290 34.02206527
10: 18291 2969.222 10 30 1 2020 18290 -0.08292166
11: 18291 2848.911 11 30 1 2020 18290 -120.39313851
12: 18291 2892.804 12 30 1 2020 18290 -76.50054871
13: 18291 3064.894 13 30 1 2020 18290 95.58913403
14: 18291 3172.009 14 30 1 2020 18290 202.70445747
15: 18291 3373.631 15 30 1 2020 18290 404.32650780
16: 18291 3019.765 16 30 1 2020 18291 0.00000000
17: 18291 2748.688 17 30 1 2020 18291 -271.07660267
18: 18291 2718.065 18 30 1 2020 18291 -301.70056024
19: 18291 2817.891 19 30 1 2020 18291 -201.87390563
20: 18291 3086.820 20 30 1 2020 18291 67.05492016
21: 18291 2972.657 21 30 1 2020 18291 -47.10804222
22: 18291 3009.258 22 30 1 2020 18291 -10.50687269
23: 18291 2949.268 23 30 1 2020 18291 -70.49745611
24: 18291 3032.938 0 31 1 2020 18291 13.17296251
25: 18291 3267.187 1 31 1 2020 18291 247.42241735
26: 18291 2984.129 2 31 1 2020 18291 -35.63610546
27: 18291 3053.728 3 31 1 2020 18291 33.96259834
28: 18291 3290.451 4 31 1 2020 18291 270.68616991
29: 18291 2875.921 5 31 1 2020 18291 -143.84421823
30: 18291 3159.612 6 31 1 2020 18291 139.84677795
31: 18291 2798.017 7 31 1 2020 18291 -221.74778788
32: 18292 2833.522 8 31 1 2020 18291 -186.24270860
33: 18292 3184.870 9 31 1 2020 18291 165.10465470
34: 18292 3037.279 10 31 1 2020 18291 17.51427029
35: 18292 3260.309 11 31 1 2020 18291 240.54407728
36: 18292 3178.804 12 31 1 2020 18291 159.03915248
37: 18292 2905.164 13 31 1 2020 18291 -114.60150340
38: 18292 2928.120 14 31 1 2020 18291 -91.64555778
39: 18292 2975.566 15 31 1 2020 18291 -44.19924163
40: 18292 3060.792 16 31 1 2020 18292 0.00000000
41: 18292 2916.899 17 31 1 2020 18292 -143.89373840
42: 18292 3297.537 18 31 1 2020 18292 236.74429212
43: 18292 3208.996 19 31 1 2020 18292 148.20392802
44: 18292 2791.129 20 31 1 2020 18292 -269.66375428
45: 18292 2842.001 21 31 1 2020 18292 -218.79120834
46: 18292 2992.381 22 31 1 2020 18292 -68.41127630
47: 18292 3189.018 23 31 1 2020 18292 128.22565814
48: 18292 2962.099 0 1 2 2020 18292 -98.69355677
The average per hour:
> df[, .(meanPerHour = mean(delta)), .(hour) ]
hour meanPerHour
1: 0 3.5877077
2: 1 1.3695897
3: 2 0.1010658
4: 3 1.4441742
5: 4 -3.0837907
6: 5 -3.1353593
7: 6 11.3738058
8: 7 4.7171345
9: 8 5.0449846
10: 9 1.3226027
11: 10 -2.3716443
12: 11 1.4710920
13: 12 -4.8875706
14: 13 4.7203754
15: 14 2.3528875
16: 15 2.3075150
17: 16 0.0000000
18: 17 -2.1353366
19: 18 4.5127309
20: 19 5.2032461
21: 20 3.8043017
22: 21 3.7928297
23: 22 -3.9258290
24: 23 3.0638861
And in the end, a neat function:
average.by.hour.by.reference <- function( df, hrs=16 ) {
df <- as.data.table(df)
df[, date_number := as.numeric(as.Date(ymd_h( sprintf("%d-%d-%dT%d",year,month,day,hour) ) - hours(hrs))) ]
df[, delta := close - close[ hour == hrs ], .(date_number) ]
return( df[, .(meanPerHour = mean(delta,na.rm=TRUE)), .(hour) ] )
}
average.by.hour.by.reference( df, 16 ) # produces the above results
Ironically
You can get, the same, or close enough for real application most likely, by not bothering with the date-wise grouping and just do a global group by hour and subtract from that what ever hour you want as reference.
(but then we wouldn't get to show all this fancy code!)
Using data.table you can try the following to get the difference between the baseline and the a price after 16.
df <- data.frame(seq(ymd_h("2017-01-01-00"), ymd_h("2020-01-31-24"), by = "hours"))
set.seed(56789)
df$close <- rnorm(nrow(df), 3000, 150)
colnames(df) <- c("date", "close")
df$date <- as.POSIXct(df$date, format = "%Y-%m-%d %H:%M:%S")
df$hour <- hour(df$date)
df$day <- day(df$date)
df$month <- month(df$date)
df$year <- year(df$date)
library(data.table)
setDT(df)
df[,dummy:= ifelse(hour>=16,1,0), .(day,month,year)] #building temporary dummy variable
df[dummy==1, Difference:= close-close[1],.(day,month,year)] # computing only when dummy=1
df[1:24,] first 24 rows
date close hour day month year dummy Difference
1: 2017-01-01 00:00:00 3159.493 0 1 1 2017 0 NA
2: 2017-01-01 01:00:00 3029.092 1 1 1 2017 0 NA
3: 2017-01-01 02:00:00 2944.042 2 1 1 2017 0 NA
4: 2017-01-01 03:00:00 3234.751 3 1 1 2017 0 NA
5: 2017-01-01 04:00:00 2900.514 4 1 1 2017 0 NA
6: 2017-01-01 05:00:00 2733.769 5 1 1 2017 0 NA
7: 2017-01-01 06:00:00 3101.770 6 1 1 2017 0 NA
8: 2017-01-01 07:00:00 2981.632 7 1 1 2017 0 NA
9: 2017-01-01 08:00:00 2913.672 8 1 1 2017 0 NA
10: 2017-01-01 09:00:00 2876.495 9 1 1 2017 0 NA
11: 2017-01-01 10:00:00 3025.853 10 1 1 2017 0 NA
12: 2017-01-01 11:00:00 3135.209 11 1 1 2017 0 NA
13: 2017-01-01 12:00:00 3038.329 12 1 1 2017 0 NA
14: 2017-01-01 13:00:00 3227.153 13 1 1 2017 0 NA
15: 2017-01-01 14:00:00 3069.497 14 1 1 2017 0 NA
16: 2017-01-01 15:00:00 2988.749 15 1 1 2017 0 NA
17: 2017-01-01 16:00:00 2920.402 16 1 1 2017 1 0.00000
18: 2017-01-01 17:00:00 2756.129 17 1 1 2017 1 -164.27264
19: 2017-01-01 18:00:00 2945.021 18 1 1 2017 1 24.61939
20: 2017-01-01 19:00:00 3078.004 19 1 1 2017 1 157.60205
21: 2017-01-01 20:00:00 3239.770 20 1 1 2017 1 319.36791
22: 2017-01-01 21:00:00 3045.156 21 1 1 2017 1 124.75450
23: 2017-01-01 22:00:00 2793.858 22 1 1 2017 1 -126.54371
24: 2017-01-01 23:00:00 3054.496 23 1 1 2017 1 134.09401
date close hour day month year dummy Difference
Then to compute the average you will need.
df[dummy==1, .(Average= mean(Difference)), .(day, month, year)]
day month year Average
1: 1 1 2017 58.70269
2: 2 1 2017 80.47927
3: 3 1 2017 -103.96512
4: 4 1 2017 -26.52648
5: 5 1 2017 112.79842
---
1122: 27 1 2020 -37.89037
1123: 28 1 2020 107.96715
1124: 29 1 2020 222.18109
1125: 30 1 2020 236.18325
1126: 31 1 2020 107.96395
To take the mean hourly you have different possibilities:
df[dummy==1, .(Average= mean(Difference)), .(hour)]#This takes the average across all the times periods, this can be thought as the hourly mean
hour Average
1: 16 0.0000000
2: 17 -13.6811620
3: 18 0.9756538
4: 19 1.0668213
5: 20 -2.9194445
6: 21 -4.1216115
7: 22 -8.7311824
8: 23 5.6657656
df[dummy==1, .(Average= mean(Difference)), .(hour,day)]#This takes the average hourly per day
hour day Average
1: 16 1 0.000000
2: 17 1 -7.226656
3: 18 1 13.162067
4: 19 1 -59.917710
5: 20 1 1.941420
---
244: 19 31 -31.069330
245: 20 31 -80.659022
246: 21 31 -14.458324
247: 22 31 -56.760001
248: 23 31 -98.356176
df[dummy==1, .(Average= mean(Difference)), .(hour,month)]#This takes the average across hourly per month
hour month Average
1: 16 1 0.000000000
2: 17 1 -4.618350490
3: 18 1 40.095826732
4: 19 1 51.049164347
5: 20 1 47.760496506
6: 21 1 28.985260025
7: 22 1 21.453695738
8: 23 1 43.921050387
9: 16 2 0.000000000
10: 17 2 24.000082289
11: 18 2 2.371547684
12: 19 2 3.065889216
13: 20 2 30.568486748
14: 21 2 -7.283307589
15: 22 2 4.123056028
16: 23 2 16.827384126
17: 16 3 0.000000000
18: 17 3 -16.011701993
19: 18 3 6.322605325
20: 19 3 -29.855560457
21: 20 3 -13.706427976
22: 21 3 -4.131364097
23: 22 3 -25.854584963
24: 23 3 -18.667824140
25: 16 4 0.000000000
26: 17 4 -20.303835780
27: 18 4 5.908122132
28: 19 4 -8.934949281
29: 20 4 -21.563964556
30: 21 4 -26.050153530
31: 22 4 -16.182759246
32: 23 4 -0.367104020
33: 16 5 0.000000000
34: 17 5 -83.744224359
35: 18 5 -44.324985588
36: 19 5 -13.327785591
37: 20 5 -14.258074789
38: 21 5 -36.776426101
39: 22 5 -40.702102505
40: 23 5 -26.994831954
41: 16 6 0.000000000
42: 17 6 10.047707916
43: 18 6 3.580200953
44: 19 6 8.229738674
45: 20 6 2.976396675
46: 21 6 14.575098983
47: 22 6 12.378672353
48: 23 6 4.663891884
49: 16 7 0.000000000
50: 17 7 19.338362910
51: 18 7 31.278370567
52: 19 7 12.295521900
53: 20 7 -36.728712097
54: 21 7 25.194723060
55: 22 7 -24.817961383
56: 23 7 -6.270365221
57: 16 8 0.000000000
58: 17 8 13.125994953
59: 18 8 15.364473667
60: 19 8 29.268466966
61: 20 8 44.668839826
62: 21 8 14.083177674
63: 22 8 17.876126102
64: 23 8 50.563302963
65: 16 9 0.000000000
66: 17 9 -55.277687661
67: 18 9 -5.648068231
68: 19 9 12.181088927
69: 20 9 -42.631881383
70: 21 9 -39.224046003
71: 22 9 -24.291235470
72: 23 9 3.112446527
73: 16 10 0.000000000
74: 17 10 9.087632052
75: 18 10 -12.014161643
76: 19 10 -10.884415174
77: 20 10 18.022160926
78: 21 10 31.348117569
79: 22 10 29.875655193
80: 23 10 28.086021752
81: 16 11 0.000000000
82: 17 11 -25.057459470
83: 18 11 0.745030363
84: 19 11 -23.835528943
85: 20 11 -22.762853780
86: 21 11 -0.005295847
87: 22 11 -37.868714610
88: 23 11 -13.091041985
89: 16 12 0.000000000
90: 17 12 -35.291817797
91: 18 12 -44.854066421
92: 19 12 -33.453450088
93: 20 12 -43.362749669
94: 21 12 -62.620521565
95: 22 12 -30.582971909
96: 23 12 -26.379698528
hour month Average
> df[dummy==1, .(Average= mean(Difference)), .(hour,year)]#This takes the average across hourly per year
hour year Average
1: 16 2017 0.00000000
2: 17 2017 0.01183124
3: 18 2017 -4.00877399
4: 19 2017 7.94893418
5: 20 2017 5.78072996
6: 21 2017 -4.38927559
7: 22 2017 -4.32599586
8: 23 2017 10.48530717
9: 16 2018 0.00000000
10: 17 2018 -32.52958909
11: 18 2018 -10.05792694
12: 19 2018 -11.98513416
13: 20 2018 -19.05685234
14: 21 2018 -7.55054075
15: 22 2018 -19.68501405
16: 23 2018 -6.70448412
17: 16 2019 0.00000000
18: 17 2019 -8.12025319
19: 18 2019 13.66533695
20: 19 2019 5.00197941
21: 20 2019 -2.37632221
22: 21 2019 -2.06337033
23: 22 2019 -4.47205960
24: 23 2019 11.88583864
25: 16 2020 0.00000000
26: 17 2020 -18.45530363
27: 18 2020 40.16399935
28: 19 2020 27.37843018
29: 20 2020 78.25315556
30: 21 2020 15.16866359
31: 22 2020 18.22609517
32: 23 2020 21.33292148
hour year Average
df[dummy==1, .(Average= mean(Difference)), .(hour,day,month)]#This takes the average hourly per month and so on
hour day month Average
1: 16 1 1 0.000000
2: 17 1 1 -121.842677
3: 18 1 1 -58.055247
4: 19 1 1 -116.444000
5: 20 1 1 5.414297
---
2916: 19 31 12 -162.743923
2917: 20 31 12 -60.029392
2918: 21 31 12 -289.992006
2919: 22 31 12 -26.354495
2920: 23 31 12 -171.848433

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

Find previous date in dataframe with same column category in R

I have the following data frame:
Date.POSIXct Date WeekDay DayCategory Hour Holidays value
1 2018-05-01 00:00:00 2018-05-01 MA MA-MI-JU 0 0 30
2 2018-05-01 01:00:00 2018-05-01 MA MA-MI-JU 1 0 80
3 2018-05-01 02:00:00 2018-05-01 MA MA-MI-JU 2 0 42
4 2018-05-01 03:00:00 2018-05-01 MA MA-MI-JU 3 0 90
5 2018-05-01 04:00:00 2018-05-01 MA MA-MI-JU 4 0 95
6 2018-05-01 05:00:00 2018-05-01 MA MA-MI-JU 5 0 5
DayCategory groups days of the week in the following way: Mondays goes to LU DayCategory. Tuesday, Wednesday and Thursdays go to MA-MI-JU DayCategory.
Friday goes to VI, Saturdays to SA and Sundays to DO Categories respectively.
I would like to find the value for the same hour in the previous day (Date) with the same DayCategory, while Holidays remains within the same group (e.g. if one instance has holiday 0 but previous day with same DayCategory has 1, we should lookv for the previous one, etc.)
As an intermediate step and to understand the process I would like to add a column PreviousDaySameDayCategory with the Date of the previous day that has the same DayCategory that the corresponding row. Some times it will be just the same date minus seven days ("LU","VI","SA","DO") but other days it will be just one day.
Reproducible data:
library(lubridate)
Date.POSIXct <- seq(as.POSIXct("2018-05-01"), as.POSIXct("2018-05-31"), "hour")
mydf <- as.data.frame(Date.POSIXct)
mydf$Date <- as.Date(substr(as.character(mydf$Date.POSIXct),1,10))
mydf$WeekDay <- substr(toupper((weekdays(mydf$Date))),1,2)
mydf$DayCategory <-as.factor(ifelse(mydf$WeekDay == "MA" | mydf$WeekDay == "MI" | mydf$WeekDay == "JU",
"MA-MI-JU", mydf$WeekDay))
mydf$Hour <- hour(mydf$Date.POSIXct)
mydf$Holidays <- c(rep(0, 24*7),rep(1, 24*7), rep(0, 24*16+1))
set.seed(123)
mydf$myvalue <- sample.int(101,size=nrow(mydf),replace=TRUE)
I have manually started the first days and craeted a vector of how the solution should look like:
a <- rep(NA, 24)
b <- mydf$value[1:24]
c <- mydf$value[25:48]
d <- rep(NA, 24)
e <- rep(NA,24)
f <- rep(NA,24)
g <- rep(NA,24)
h <- rep(NA,24)
i <- mydf$value[169:192]
solution <- c(a,b,c,d,e,f,g,h,i)
solution
I would appreciate any hint in the thinking process to solve this kind of problems that I face with relative frequency.
Here is a data.table solution which uses a "grouped shift()" and multiple joins to copy value from the same hour of the PreviousDaySameDayCategory.
Create reproducible data
OP's code to create reproducible data was not fully reproducible because he used the weekdays() function which returns the weekday names in the current locale (which seems to be Spanish for the OP). To be independent of the current locale, I switched to format(Date, "%u") which returns the numbers 1 to 7 for Monday to Sunday. Furthermore, the fct_collapse() from the forcats package is used to collapse the days 2, 3, and 4 (Tuesday to Thursday) into one factor level.
library(data.table)
# note that package lubridate is not required
myDT <- data.table(Date.POSIXct = seq(as.POSIXct("2018-05-01"),
as.POSIXct("2018-05-31"), "hour"))
myDT[, Date := as.Date(Date.POSIXct)]
myDT[, Weekday := format(Date, "%u")]
myDT[, DayCategory := forcats::fct_collapse(Weekday, "234" = c("2", "3", "4"))]
myDT[, hour := hour(Date.POSIXct)]
myDT[, Holidays := c(rep(0, 24 * 7), rep(1, 24 * 7), rep(0, 24 * 16 + 1))]
set.seed(123)
myDT[, myvalue := sample.int(101, size = nrow(mydf), replace = TRUE)]
Intermediate step: PreviousDaySameDayCategory
The sample data set consists of hourly data but in order to determine the PreviousDaySameDayCategory we need to work day-wise and thus only have to deal with the unique values of Date, DayCategory, and Holidays. The data is grouped by DayCategory and the Holidays indicator. For each group separately, the previous day is picked by lagging Date. As the result of shift() operations depend on the order of rows the dataset has been ordered before shifting.
tmp <- unique(myDT[order(Date), .(Date, DayCategory, Holidays)])[
, .(Date, PreviousDaySameDayCategory = shift(Date)), by = .(DayCategory, Holidays)][
order(Date)]
tmp
DayCategory Holidays Date PreviousDaySameDayCategory
1: 234 0 2018-05-01 <NA>
2: 234 0 2018-05-02 2018-05-01
3: 234 0 2018-05-03 2018-05-02
4: 5 0 2018-05-04 <NA>
5: 6 0 2018-05-05 <NA>
6: 7 0 2018-05-06 <NA>
7: 1 0 2018-05-07 <NA>
8: 234 1 2018-05-08 <NA>
9: 234 1 2018-05-09 2018-05-08
10: 234 1 2018-05-10 2018-05-09
11: 5 1 2018-05-11 <NA>
12: 6 1 2018-05-12 <NA>
13: 7 1 2018-05-13 <NA>
14: 1 1 2018-05-14 <NA>
15: 234 0 2018-05-15 2018-05-03
16: 234 0 2018-05-16 2018-05-15
17: 234 0 2018-05-17 2018-05-16
18: 5 0 2018-05-18 2018-05-04
19: 6 0 2018-05-19 2018-05-05
20: 7 0 2018-05-20 2018-05-06
21: 1 0 2018-05-21 2018-05-07
22: 234 0 2018-05-22 2018-05-17
23: 234 0 2018-05-23 2018-05-22
24: 234 0 2018-05-24 2018-05-23
25: 5 0 2018-05-25 2018-05-18
26: 6 0 2018-05-26 2018-05-19
27: 7 0 2018-05-27 2018-05-20
28: 1 0 2018-05-28 2018-05-21
29: 234 0 2018-05-29 2018-05-24
30: 234 0 2018-05-30 2018-05-29
31: 234 0 2018-05-31 2018-05-30
DayCategory Holidays Date PreviousDaySameDayCategory
For days 3 and 4 (Wednesdays and Thursday) the preceeding Tuesday and Wednesday, resp., of the same week are picked. For day 2 (Tuesday) the preceeding Thursday of the preceeding week is picked if both weeks have the same holiday indicator set. If the preceeding week has a different holiday indicator the most recent Thursday of the same holiday period is picked. This is why, e.g., the 2018-05-03 is picked in row 15.
Copying value from matching PreviousDaySameDayCategory
This is done in two steps. First, the hourly values are picked from the matching PreviousDaySameDayCategory by joining with the matching days table tmp:
tmp2 <- myDT[tmp, on = .(Date = PreviousDaySameDayCategory), .(Date = i.Date, hour, myvalue), nomatch = 0L]
tmp2
Date hour myvalue
1: 2018-05-02 0 30
2: 2018-05-02 1 80
3: 2018-05-02 2 42
4: 2018-05-02 3 90
5: 2018-05-02 4 95
---
500: 2018-05-31 19 39
501: 2018-05-31 20 1
502: 2018-05-31 21 1
503: 2018-05-31 22 101
504: 2018-05-31 23 11
Second, a new column previousValue in myDT is created by updating in a join which contains the corresponding value from PreviousDaySameDayCategory:
myDT[tmp2, on = .(Date, hour), previousValue := i.myvalue]
Here, the first two days of the result are shown:
myDT[Date %between% c(as.Date("2018-05-01"), as.Date("2018-05-02"))]
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
1: 2018-05-01 00:00:00 2018-05-01 2 234 0 0 30 NA
2: 2018-05-01 01:00:00 2018-05-01 2 234 1 0 80 NA
3: 2018-05-01 02:00:00 2018-05-01 2 234 2 0 42 NA
4: 2018-05-01 03:00:00 2018-05-01 2 234 3 0 90 NA
5: 2018-05-01 04:00:00 2018-05-01 2 234 4 0 95 NA
6: 2018-05-01 05:00:00 2018-05-01 2 234 5 0 5 NA
7: 2018-05-01 06:00:00 2018-05-01 2 234 6 0 54 NA
8: 2018-05-01 07:00:00 2018-05-01 2 234 7 0 91 NA
9: 2018-05-01 08:00:00 2018-05-01 2 234 8 0 56 NA
10: 2018-05-01 09:00:00 2018-05-01 2 234 9 0 47 NA
11: 2018-05-01 10:00:00 2018-05-01 2 234 10 0 97 NA
12: 2018-05-01 11:00:00 2018-05-01 2 234 11 0 46 NA
13: 2018-05-01 12:00:00 2018-05-01 2 234 12 0 69 NA
14: 2018-05-01 13:00:00 2018-05-01 2 234 13 0 58 NA
15: 2018-05-01 14:00:00 2018-05-01 2 234 14 0 11 NA
16: 2018-05-01 15:00:00 2018-05-01 2 234 15 0 91 NA
17: 2018-05-01 16:00:00 2018-05-01 2 234 16 0 25 NA
18: 2018-05-01 17:00:00 2018-05-01 2 234 17 0 5 NA
19: 2018-05-01 18:00:00 2018-05-01 2 234 18 0 34 NA
20: 2018-05-01 19:00:00 2018-05-01 2 234 19 0 97 NA
21: 2018-05-01 20:00:00 2018-05-01 2 234 20 0 90 NA
22: 2018-05-01 21:00:00 2018-05-01 2 234 21 0 70 NA
23: 2018-05-01 22:00:00 2018-05-01 2 234 22 0 65 NA
24: 2018-05-01 23:00:00 2018-05-01 2 234 23 0 101 NA
25: 2018-05-02 00:00:00 2018-05-02 3 234 0 0 67 30
26: 2018-05-02 01:00:00 2018-05-02 3 234 1 0 72 80
27: 2018-05-02 02:00:00 2018-05-02 3 234 2 0 55 42
28: 2018-05-02 03:00:00 2018-05-02 3 234 3 0 61 90
29: 2018-05-02 04:00:00 2018-05-02 3 234 4 0 30 95
30: 2018-05-02 05:00:00 2018-05-02 3 234 5 0 15 5
31: 2018-05-02 06:00:00 2018-05-02 3 234 6 0 98 54
32: 2018-05-02 07:00:00 2018-05-02 3 234 7 0 92 91
33: 2018-05-02 08:00:00 2018-05-02 3 234 8 0 70 56
34: 2018-05-02 09:00:00 2018-05-02 3 234 9 0 81 47
35: 2018-05-02 10:00:00 2018-05-02 3 234 10 0 3 97
36: 2018-05-02 11:00:00 2018-05-02 3 234 11 0 49 46
37: 2018-05-02 12:00:00 2018-05-02 3 234 12 0 77 69
38: 2018-05-02 13:00:00 2018-05-02 3 234 13 0 22 58
39: 2018-05-02 14:00:00 2018-05-02 3 234 14 0 33 11
40: 2018-05-02 15:00:00 2018-05-02 3 234 15 0 24 91
41: 2018-05-02 16:00:00 2018-05-02 3 234 16 0 15 25
42: 2018-05-02 17:00:00 2018-05-02 3 234 17 0 42 5
43: 2018-05-02 18:00:00 2018-05-02 3 234 18 0 42 34
44: 2018-05-02 19:00:00 2018-05-02 3 234 19 0 38 97
45: 2018-05-02 20:00:00 2018-05-02 3 234 20 0 16 90
46: 2018-05-02 21:00:00 2018-05-02 3 234 21 0 15 70
47: 2018-05-02 22:00:00 2018-05-02 3 234 22 0 24 65
48: 2018-05-02 23:00:00 2018-05-02 3 234 23 0 48 101
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
Verification
The result is in line with OP's expectations
identical(myDT[, previousValue[seq_along(solution)]], solution)
[1] TRUE
OP has posted the same question in the Data Science section as well. I am including the same solution I have there here case it might help others.
It is similar to Uwe's solution but with the dplyr library instead.
library(dplyr)
rankedDf <- mydf %>%
group_by(DayCategory, Hour, Holidays) %>%
arrange(Date) %>%
mutate(rowRank = order(Date), previousRowRank = order(Date) - 1) %>%
left_join(., ., by = c("previousRowRank" = "rowRank", "DayCategory", "Hour", "Holidays")) %>%
select(
Date.POSIXct = Date.POSIXct.x,
Date = Date.x,
WeekDay = WeekDay.x,
DayCategory,
Hour,
Holidays,
myvalue = myvalue.x,
PreviousDaySameDayCategory = Date.y,
PreviousValueSameDayCategory = myvalue.y
)
print.data.frame(rankedDf)
P.S. love the way Uwe changes the original sample code.

R changing variable value of one factor level to represent value mean of factor levels by day

I have the following dataframe:
> df
Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 333116
3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 14127
9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Here is the str:
> str(df)
'data.frame': 12 obs. of 8 variables:
$ Time_Start : POSIXct, format: "2016-05-05 11:19:00" "2016-05-05 11:25:00" "2016-05-05 11:33:00" ...
$ Time_End : POSIXct, format: "2016-05-06 09:30:00" "2016-05-06 09:35:00" "2016-05-06 09:39:00" ...
$ Cut : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 8 levels "1","3","6","7",..: 1 2 3 4 5 6 1 2 3 4 ...
$ Inlet_NH4N : num 0.234 1.091 1.354 0.686 0.263 ...
$ Outlet_NH4N : num 0.308 0.508 0.349 0.341 0.738 ...
$ Pump_reading : num 79847 78777 79585 80043 79228 ...
$ Anemometer_reading: int 6296343 333116 8970447 8436546 9007387 8070785 7807792 14127 9844162 9143733 ...
This is a small segment of a larger dataset.
I have a problem with these data in that the Anemometer_reading for plot "3" is always much lower than for the other plots. This is due to a mechanical problem. I want to remove this artifact and think that the best way to do this is to take an average of the Anemometer_reading for all the plots outwith plot "3". I want to calculate this average on a daily basis.
I can calculate the daily Anemometer_reading average, excluding plot "3" like this:
library(dplyr)
> df_avg <- df %>% filter(Plot != "3") %>% group_by(as.Date(Time_End)) %>% summarise(Anemometer_mean = mean(Anemometer_reading))
> df_avg
Source: local data frame [2 x 2]
as.Date(Time_End) Anemometer_mean
<date> <dbl>
1 2013-05-07 9168521
2 2016-05-06 8156302
I'm not sure how to go about using the resulting dataframe to replace the Anemometer_reading values from plot "3".
Can anyone point me in the right direction please?
Thanks
I would follow #roland's comment. However, if you care about how you would use dplyr to do what you asked:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_mean = mean(Anemometer_reading[Plot != "3"])) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", first(Anemometer_mean))) %>%
ungroup() %>% select(-`as.Date(Time_End)`, -Anemometer_mean)
print(result)
## A tibble: 12 x 8
## Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
## <fctr> <fctr> <int> <int> <dbl> <dbl> <dbl> <dbl>
##1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
##2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 8156302
##3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
##4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
##5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
##6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
##7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
##8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 9168521
##9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
##10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
##11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
##12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Instead of filter and summarise, mutate to create a new column Anemometer_mean that computes the mean with all rows for Plot!=3. Then replace the Anemometer_read for those rows Plot==3 with this mean.
In fact, you can do all this with just one mutate:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", mean(Anemometer_reading[Plot != "3"]))) %>%
ungroup() %>% select(-`as.Date(Time_End)`)
Hope this helps.

Resources