grouping and splitting data frame in R - r

The following is the promotion sales table listing products and group where the promotion was run and at what time.
Product.code cgrp promo.from promo.to
1 1100001369 12 2014-01-01 2014-03-01
2 1100001369 16 37 2014-01-01 2014-03-01
3 1100001448 12 2014-03-01 2014-03-01
4 1100001446 12 2014-03-01 2014-03-01
5 1100001629 11 30 2014-03-01 2014-03-01
6 1100001369 16 37 2014-03-01 2014-06-01
7 1100001368 12 2014-06-01 2014-07-01
8 1100001369 12 2014-06-01 2014-07-01
9 1100001368 11 30 2014-06-01 2014-07-01
10 1100001738 11 30 2014-06-01 2014-07-01
11 1100001629 11 30 2014-06-01 2014-06-01
12 1100001738 11 30 2014-07-01 2014-07-01
13 1100001619 11 30 2014-08-01 2014-08-01
14 1100001619 11 30 2014-08-01 2014-08-01
15 1100001629 11 30 2014-08-01 2014-08-01
16 1100001738 12 2014-09-01 2014-09-01
17 1100001738 16 37 2014-08-01 2014-08-01
18 1100001448 12 2014-09-01 2014-09-01
19 1100001446 12 2014-10-01 2014-10-01
20 1100001369 12 2014-11-01 2014-11-01
21 1100001547 16 37 2014-11-01 2014-11-01
22 1100001368 11 30 2014-11-01 2014-11-01
I am trying to group the product.code and cgrp so that I can know all promotion for a product in a particular group and do further analysis.
I tried looping through the whole data.frame. Not efficient and buggy.
What is the efficient method to get this done.
[edit]
to get a multiple data.frame like the following
x=
Product.code cgrp promo.from promo.to
3 1100001448 12 2014-03-01 2014-03-01
18 1100001448 12 2014-09-01 2014-09-01
y=
Product.code cgrp promo.from promo.to
1 1100001369 12 2014-01-01 2014-03-01
8 1100001369 12 2014-06-01 2014-07-01
20 1100001369 12 2014-11-01 2014-11-01

You could split the 'cgrp' column and reshape the dataset to 'long' format with cSplit. Then, split the dataset ('df1') by 'Product.code' and 'cgrp to create a list ('lst').
library(splitstackshape)
df1 <- as.data.frame(cSplit(df, 'cgrp', ' ', 'long'))
lst <- split(df1, list(df1$Product.code, df1$cgrp), drop=TRUE)
names(lst) <- paste0('dfN', seq_along(lst))
It may be better to keep the datasets in a list. But, if you want as separate objects in the global environment, one option is list2env (not recommended).
list2env(lst, envir=.GlobalEnv)

Related

How to calculate the change using a baseline hour in R?

I have this data frame
library(lubridate)
df <- data.frame(seq(ymd_h("2017-01-01-00"), ymd_h("2020-01-31-24"), by = "hours"))
df$close <- rnorm(nrows(df), 3000, 150)
colnames(df) <- c("date", "close")
df$date <- as.POSIXct(df$date, format = "%Y-%m-%d %H:%M:%S")
df$hour <- hour(df$date)
df$day <- day(df$date)
df$month <- month(df$date)
df$year <- year(df$date)
I want to get the change of close price since the 16 hours. For example, after hour 16 the mean change of price in all the data at hour 18 was...and so on for all the hours. I want to set one hour as a baseline and get the change in price.
This is what I did. First I use lag but I am not sure how to set 16 hours as a baseline. However, this not even give me close to the result I want. The second approach I use lead but I have the same problem:
df_2 <- df %>% group_by(year, month, day, hour) %>%
mutate(change = (close-lead(close)))
In summary, I want to calculate on each day the change in price from hour 16 and then get the mean change on price from 16 hours to the rest of the hours.
If you need around the clock diff:
setDT(df)
df[, date_number := as.numeric(as.Date(ymd_h( sprintf("%d-%d-%dT%d",year,month,day,hour) ) - hours(16))) ]
df[, delta := close - close[ hour == 16 ], .(date_number) ]
head( df, n=48 )
tail( df, n=48 )
df[, .(meanPerHour = mean(delta)), .(hour) ]
To do it correctly you need to create the Date object, which you can see in the code, then subtract 16 hours (or add 8), to make 16:00 your new 0:00 , and then cast this back to a Date, and group by that Date's day number (which you get from as.numeric).
The first 48 rows:
> head( df, n=48 )
date close hour day month year date_number delta
1: 2017-01-01 00:00:00 2924.671 0 1 1 2017 17166 NA
2: 2017-01-01 01:00:00 3019.730 1 1 1 2017 17166 NA
3: 2017-01-01 02:00:00 2988.162 2 1 1 2017 17166 NA
4: 2017-01-01 03:00:00 3133.018 3 1 1 2017 17166 NA
5: 2017-01-01 04:00:00 3017.546 4 1 1 2017 17166 NA
6: 2017-01-01 05:00:00 3047.795 5 1 1 2017 17166 NA
7: 2017-01-01 06:00:00 2912.731 6 1 1 2017 17166 NA
8: 2017-01-01 07:00:00 3107.180 7 1 1 2017 17166 NA
9: 2017-01-01 08:00:00 2876.211 8 1 1 2017 17166 NA
10: 2017-01-01 09:00:00 2946.021 9 1 1 2017 17166 NA
11: 2017-01-01 10:00:00 3013.483 10 1 1 2017 17166 NA
12: 2017-01-01 11:00:00 3014.441 11 1 1 2017 17166 NA
13: 2017-01-01 12:00:00 2969.755 12 1 1 2017 17166 NA
14: 2017-01-01 13:00:00 3110.976 13 1 1 2017 17166 NA
15: 2017-01-01 14:00:00 3018.507 14 1 1 2017 17166 NA
16: 2017-01-01 15:00:00 2995.602 15 1 1 2017 17166 NA
17: 2017-01-01 16:00:00 2941.672 16 1 1 2017 17167 0.000000
18: 2017-01-01 17:00:00 3076.628 17 1 1 2017 17167 134.956576
19: 2017-01-01 18:00:00 2862.928 18 1 1 2017 17167 -78.743991
20: 2017-01-01 19:00:00 3346.545 19 1 1 2017 17167 404.872660
21: 2017-01-01 20:00:00 2934.287 20 1 1 2017 17167 -7.385360
22: 2017-01-01 21:00:00 3114.609 21 1 1 2017 17167 172.937229
23: 2017-01-01 22:00:00 3039.294 22 1 1 2017 17167 97.622331
24: 2017-01-01 23:00:00 3116.011 23 1 1 2017 17167 174.338827
25: 2017-01-02 00:00:00 2877.843 0 2 1 2017 17167 -63.828732
26: 2017-01-02 01:00:00 2934.232 1 2 1 2017 17167 -7.439448
27: 2017-01-02 02:00:00 2891.967 2 2 1 2017 17167 -49.705095
28: 2017-01-02 03:00:00 3034.642 3 2 1 2017 17167 92.969817
29: 2017-01-02 04:00:00 2826.341 4 2 1 2017 17167 -115.331282
30: 2017-01-02 05:00:00 3037.061 5 2 1 2017 17167 95.389536
31: 2017-01-02 06:00:00 2986.333 6 2 1 2017 17167 44.661103
32: 2017-01-02 07:00:00 3263.606 7 2 1 2017 17167 321.934480
33: 2017-01-02 08:00:00 2979.311 8 2 1 2017 17167 37.638695
34: 2017-01-02 09:00:00 2983.321 9 2 1 2017 17167 41.649113
35: 2017-01-02 10:00:00 2896.498 10 2 1 2017 17167 -45.174011
36: 2017-01-02 11:00:00 2966.731 11 2 1 2017 17167 25.059003
37: 2017-01-02 12:00:00 3027.436 12 2 1 2017 17167 85.764290
38: 2017-01-02 13:00:00 3062.598 13 2 1 2017 17167 120.926630
39: 2017-01-02 14:00:00 3159.810 14 2 1 2017 17167 218.138486
40: 2017-01-02 15:00:00 3145.530 15 2 1 2017 17167 203.858440
41: 2017-01-02 16:00:00 2984.756 16 2 1 2017 17168 0.000000
42: 2017-01-02 17:00:00 3210.481 17 2 1 2017 17168 225.724909
43: 2017-01-02 18:00:00 2733.484 18 2 1 2017 17168 -251.271959
44: 2017-01-02 19:00:00 3093.430 19 2 1 2017 17168 108.674494
45: 2017-01-02 20:00:00 2921.657 20 2 1 2017 17168 -63.098117
46: 2017-01-02 21:00:00 3198.335 21 2 1 2017 17168 213.579029
47: 2017-01-02 22:00:00 2945.484 22 2 1 2017 17168 -39.271663
48: 2017-01-02 23:00:00 3197.860 23 2 1 2017 17168 213.104247
The last 48 records:
> tail( df, n=48 )
date close hour day month year date_number delta
1: 18290 3170.775 1 30 1 2020 18290 201.47027428
2: 18290 3293.403 2 30 1 2020 18290 324.09870453
3: 18290 2940.591 3 30 1 2020 18290 -28.71382979
4: 18290 2922.411 4 30 1 2020 18290 -46.89312915
5: 18290 3237.419 5 30 1 2020 18290 268.11402422
6: 18290 2989.678 6 30 1 2020 18290 20.37332637
7: 18290 2932.777 7 30 1 2020 18290 -36.52746038
8: 18291 3188.269 8 30 1 2020 18290 218.96474627
9: 18291 3003.327 9 30 1 2020 18290 34.02206527
10: 18291 2969.222 10 30 1 2020 18290 -0.08292166
11: 18291 2848.911 11 30 1 2020 18290 -120.39313851
12: 18291 2892.804 12 30 1 2020 18290 -76.50054871
13: 18291 3064.894 13 30 1 2020 18290 95.58913403
14: 18291 3172.009 14 30 1 2020 18290 202.70445747
15: 18291 3373.631 15 30 1 2020 18290 404.32650780
16: 18291 3019.765 16 30 1 2020 18291 0.00000000
17: 18291 2748.688 17 30 1 2020 18291 -271.07660267
18: 18291 2718.065 18 30 1 2020 18291 -301.70056024
19: 18291 2817.891 19 30 1 2020 18291 -201.87390563
20: 18291 3086.820 20 30 1 2020 18291 67.05492016
21: 18291 2972.657 21 30 1 2020 18291 -47.10804222
22: 18291 3009.258 22 30 1 2020 18291 -10.50687269
23: 18291 2949.268 23 30 1 2020 18291 -70.49745611
24: 18291 3032.938 0 31 1 2020 18291 13.17296251
25: 18291 3267.187 1 31 1 2020 18291 247.42241735
26: 18291 2984.129 2 31 1 2020 18291 -35.63610546
27: 18291 3053.728 3 31 1 2020 18291 33.96259834
28: 18291 3290.451 4 31 1 2020 18291 270.68616991
29: 18291 2875.921 5 31 1 2020 18291 -143.84421823
30: 18291 3159.612 6 31 1 2020 18291 139.84677795
31: 18291 2798.017 7 31 1 2020 18291 -221.74778788
32: 18292 2833.522 8 31 1 2020 18291 -186.24270860
33: 18292 3184.870 9 31 1 2020 18291 165.10465470
34: 18292 3037.279 10 31 1 2020 18291 17.51427029
35: 18292 3260.309 11 31 1 2020 18291 240.54407728
36: 18292 3178.804 12 31 1 2020 18291 159.03915248
37: 18292 2905.164 13 31 1 2020 18291 -114.60150340
38: 18292 2928.120 14 31 1 2020 18291 -91.64555778
39: 18292 2975.566 15 31 1 2020 18291 -44.19924163
40: 18292 3060.792 16 31 1 2020 18292 0.00000000
41: 18292 2916.899 17 31 1 2020 18292 -143.89373840
42: 18292 3297.537 18 31 1 2020 18292 236.74429212
43: 18292 3208.996 19 31 1 2020 18292 148.20392802
44: 18292 2791.129 20 31 1 2020 18292 -269.66375428
45: 18292 2842.001 21 31 1 2020 18292 -218.79120834
46: 18292 2992.381 22 31 1 2020 18292 -68.41127630
47: 18292 3189.018 23 31 1 2020 18292 128.22565814
48: 18292 2962.099 0 1 2 2020 18292 -98.69355677
The average per hour:
> df[, .(meanPerHour = mean(delta)), .(hour) ]
hour meanPerHour
1: 0 3.5877077
2: 1 1.3695897
3: 2 0.1010658
4: 3 1.4441742
5: 4 -3.0837907
6: 5 -3.1353593
7: 6 11.3738058
8: 7 4.7171345
9: 8 5.0449846
10: 9 1.3226027
11: 10 -2.3716443
12: 11 1.4710920
13: 12 -4.8875706
14: 13 4.7203754
15: 14 2.3528875
16: 15 2.3075150
17: 16 0.0000000
18: 17 -2.1353366
19: 18 4.5127309
20: 19 5.2032461
21: 20 3.8043017
22: 21 3.7928297
23: 22 -3.9258290
24: 23 3.0638861
And in the end, a neat function:
average.by.hour.by.reference <- function( df, hrs=16 ) {
df <- as.data.table(df)
df[, date_number := as.numeric(as.Date(ymd_h( sprintf("%d-%d-%dT%d",year,month,day,hour) ) - hours(hrs))) ]
df[, delta := close - close[ hour == hrs ], .(date_number) ]
return( df[, .(meanPerHour = mean(delta,na.rm=TRUE)), .(hour) ] )
}
average.by.hour.by.reference( df, 16 ) # produces the above results
Ironically
You can get, the same, or close enough for real application most likely, by not bothering with the date-wise grouping and just do a global group by hour and subtract from that what ever hour you want as reference.
(but then we wouldn't get to show all this fancy code!)
Using data.table you can try the following to get the difference between the baseline and the a price after 16.
df <- data.frame(seq(ymd_h("2017-01-01-00"), ymd_h("2020-01-31-24"), by = "hours"))
set.seed(56789)
df$close <- rnorm(nrow(df), 3000, 150)
colnames(df) <- c("date", "close")
df$date <- as.POSIXct(df$date, format = "%Y-%m-%d %H:%M:%S")
df$hour <- hour(df$date)
df$day <- day(df$date)
df$month <- month(df$date)
df$year <- year(df$date)
library(data.table)
setDT(df)
df[,dummy:= ifelse(hour>=16,1,0), .(day,month,year)] #building temporary dummy variable
df[dummy==1, Difference:= close-close[1],.(day,month,year)] # computing only when dummy=1
df[1:24,] first 24 rows
date close hour day month year dummy Difference
1: 2017-01-01 00:00:00 3159.493 0 1 1 2017 0 NA
2: 2017-01-01 01:00:00 3029.092 1 1 1 2017 0 NA
3: 2017-01-01 02:00:00 2944.042 2 1 1 2017 0 NA
4: 2017-01-01 03:00:00 3234.751 3 1 1 2017 0 NA
5: 2017-01-01 04:00:00 2900.514 4 1 1 2017 0 NA
6: 2017-01-01 05:00:00 2733.769 5 1 1 2017 0 NA
7: 2017-01-01 06:00:00 3101.770 6 1 1 2017 0 NA
8: 2017-01-01 07:00:00 2981.632 7 1 1 2017 0 NA
9: 2017-01-01 08:00:00 2913.672 8 1 1 2017 0 NA
10: 2017-01-01 09:00:00 2876.495 9 1 1 2017 0 NA
11: 2017-01-01 10:00:00 3025.853 10 1 1 2017 0 NA
12: 2017-01-01 11:00:00 3135.209 11 1 1 2017 0 NA
13: 2017-01-01 12:00:00 3038.329 12 1 1 2017 0 NA
14: 2017-01-01 13:00:00 3227.153 13 1 1 2017 0 NA
15: 2017-01-01 14:00:00 3069.497 14 1 1 2017 0 NA
16: 2017-01-01 15:00:00 2988.749 15 1 1 2017 0 NA
17: 2017-01-01 16:00:00 2920.402 16 1 1 2017 1 0.00000
18: 2017-01-01 17:00:00 2756.129 17 1 1 2017 1 -164.27264
19: 2017-01-01 18:00:00 2945.021 18 1 1 2017 1 24.61939
20: 2017-01-01 19:00:00 3078.004 19 1 1 2017 1 157.60205
21: 2017-01-01 20:00:00 3239.770 20 1 1 2017 1 319.36791
22: 2017-01-01 21:00:00 3045.156 21 1 1 2017 1 124.75450
23: 2017-01-01 22:00:00 2793.858 22 1 1 2017 1 -126.54371
24: 2017-01-01 23:00:00 3054.496 23 1 1 2017 1 134.09401
date close hour day month year dummy Difference
Then to compute the average you will need.
df[dummy==1, .(Average= mean(Difference)), .(day, month, year)]
day month year Average
1: 1 1 2017 58.70269
2: 2 1 2017 80.47927
3: 3 1 2017 -103.96512
4: 4 1 2017 -26.52648
5: 5 1 2017 112.79842
---
1122: 27 1 2020 -37.89037
1123: 28 1 2020 107.96715
1124: 29 1 2020 222.18109
1125: 30 1 2020 236.18325
1126: 31 1 2020 107.96395
To take the mean hourly you have different possibilities:
df[dummy==1, .(Average= mean(Difference)), .(hour)]#This takes the average across all the times periods, this can be thought as the hourly mean
hour Average
1: 16 0.0000000
2: 17 -13.6811620
3: 18 0.9756538
4: 19 1.0668213
5: 20 -2.9194445
6: 21 -4.1216115
7: 22 -8.7311824
8: 23 5.6657656
df[dummy==1, .(Average= mean(Difference)), .(hour,day)]#This takes the average hourly per day
hour day Average
1: 16 1 0.000000
2: 17 1 -7.226656
3: 18 1 13.162067
4: 19 1 -59.917710
5: 20 1 1.941420
---
244: 19 31 -31.069330
245: 20 31 -80.659022
246: 21 31 -14.458324
247: 22 31 -56.760001
248: 23 31 -98.356176
df[dummy==1, .(Average= mean(Difference)), .(hour,month)]#This takes the average across hourly per month
hour month Average
1: 16 1 0.000000000
2: 17 1 -4.618350490
3: 18 1 40.095826732
4: 19 1 51.049164347
5: 20 1 47.760496506
6: 21 1 28.985260025
7: 22 1 21.453695738
8: 23 1 43.921050387
9: 16 2 0.000000000
10: 17 2 24.000082289
11: 18 2 2.371547684
12: 19 2 3.065889216
13: 20 2 30.568486748
14: 21 2 -7.283307589
15: 22 2 4.123056028
16: 23 2 16.827384126
17: 16 3 0.000000000
18: 17 3 -16.011701993
19: 18 3 6.322605325
20: 19 3 -29.855560457
21: 20 3 -13.706427976
22: 21 3 -4.131364097
23: 22 3 -25.854584963
24: 23 3 -18.667824140
25: 16 4 0.000000000
26: 17 4 -20.303835780
27: 18 4 5.908122132
28: 19 4 -8.934949281
29: 20 4 -21.563964556
30: 21 4 -26.050153530
31: 22 4 -16.182759246
32: 23 4 -0.367104020
33: 16 5 0.000000000
34: 17 5 -83.744224359
35: 18 5 -44.324985588
36: 19 5 -13.327785591
37: 20 5 -14.258074789
38: 21 5 -36.776426101
39: 22 5 -40.702102505
40: 23 5 -26.994831954
41: 16 6 0.000000000
42: 17 6 10.047707916
43: 18 6 3.580200953
44: 19 6 8.229738674
45: 20 6 2.976396675
46: 21 6 14.575098983
47: 22 6 12.378672353
48: 23 6 4.663891884
49: 16 7 0.000000000
50: 17 7 19.338362910
51: 18 7 31.278370567
52: 19 7 12.295521900
53: 20 7 -36.728712097
54: 21 7 25.194723060
55: 22 7 -24.817961383
56: 23 7 -6.270365221
57: 16 8 0.000000000
58: 17 8 13.125994953
59: 18 8 15.364473667
60: 19 8 29.268466966
61: 20 8 44.668839826
62: 21 8 14.083177674
63: 22 8 17.876126102
64: 23 8 50.563302963
65: 16 9 0.000000000
66: 17 9 -55.277687661
67: 18 9 -5.648068231
68: 19 9 12.181088927
69: 20 9 -42.631881383
70: 21 9 -39.224046003
71: 22 9 -24.291235470
72: 23 9 3.112446527
73: 16 10 0.000000000
74: 17 10 9.087632052
75: 18 10 -12.014161643
76: 19 10 -10.884415174
77: 20 10 18.022160926
78: 21 10 31.348117569
79: 22 10 29.875655193
80: 23 10 28.086021752
81: 16 11 0.000000000
82: 17 11 -25.057459470
83: 18 11 0.745030363
84: 19 11 -23.835528943
85: 20 11 -22.762853780
86: 21 11 -0.005295847
87: 22 11 -37.868714610
88: 23 11 -13.091041985
89: 16 12 0.000000000
90: 17 12 -35.291817797
91: 18 12 -44.854066421
92: 19 12 -33.453450088
93: 20 12 -43.362749669
94: 21 12 -62.620521565
95: 22 12 -30.582971909
96: 23 12 -26.379698528
hour month Average
> df[dummy==1, .(Average= mean(Difference)), .(hour,year)]#This takes the average across hourly per year
hour year Average
1: 16 2017 0.00000000
2: 17 2017 0.01183124
3: 18 2017 -4.00877399
4: 19 2017 7.94893418
5: 20 2017 5.78072996
6: 21 2017 -4.38927559
7: 22 2017 -4.32599586
8: 23 2017 10.48530717
9: 16 2018 0.00000000
10: 17 2018 -32.52958909
11: 18 2018 -10.05792694
12: 19 2018 -11.98513416
13: 20 2018 -19.05685234
14: 21 2018 -7.55054075
15: 22 2018 -19.68501405
16: 23 2018 -6.70448412
17: 16 2019 0.00000000
18: 17 2019 -8.12025319
19: 18 2019 13.66533695
20: 19 2019 5.00197941
21: 20 2019 -2.37632221
22: 21 2019 -2.06337033
23: 22 2019 -4.47205960
24: 23 2019 11.88583864
25: 16 2020 0.00000000
26: 17 2020 -18.45530363
27: 18 2020 40.16399935
28: 19 2020 27.37843018
29: 20 2020 78.25315556
30: 21 2020 15.16866359
31: 22 2020 18.22609517
32: 23 2020 21.33292148
hour year Average
df[dummy==1, .(Average= mean(Difference)), .(hour,day,month)]#This takes the average hourly per month and so on
hour day month Average
1: 16 1 1 0.000000
2: 17 1 1 -121.842677
3: 18 1 1 -58.055247
4: 19 1 1 -116.444000
5: 20 1 1 5.414297
---
2916: 19 31 12 -162.743923
2917: 20 31 12 -60.029392
2918: 21 31 12 -289.992006
2919: 22 31 12 -26.354495
2920: 23 31 12 -171.848433

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

How can I label my data points according to defined ranges?

I have a dataframe df and I set interval points which are saved in a vector pts. Now I want to label my data into these intervals. I tried using the cut() function, but I always get the mistake, that x is not numeric, even though I converted it to numeric.
My dataframe df
date amount
1 2012-07-01 2.3498695
2 2012-08-01 0.6984866
3 2012-09-01 0.9079118
4 2012-10-01 2.8858218
5 2012-11-01 1.2406948
6 2012-12-01 2.3140496
7 2013-01-01 1.5904573
8 2013-02-01 3.2531825
9 2013-03-01 4.2962963
10 2013-04-01 3.3287101
11 2013-05-01 3.7698413
12 2013-06-01 1.4376997
13 2013-07-01 5.0687285
14 2013-08-01 4.4520548
15 2013-09-01 5.5063913
16 2013-10-01 5.5676856
17 2013-11-01 6.2686567
18 2013-12-01 11.021069
My vector pts with column Min with interval points
pts$Min
[1] 3 6 11
My new dataframe should look like this:
date amount IntervalRange
1 2012-07-01 2.3498695 1
2 2012-08-01 0.6984866 1
3 2012-09-01 0.9079118 1
4 2012-10-01 2.8858218 2
5 2012-11-01 1.2406948 2
6 2012-12-01 2.3140496 2
7 2013-01-01 1.5904573 3
8 2013-02-01 3.2531825 3
9 2013-03-01 4.2962963 3
10 2013-04-01 3.3287101 3
11 2013-05-01 3.7698413 3
12 2013-06-01 1.4376997 4
13 2013-07-01 5.0687285 4
14 2013-08-01 4.4520548 4
15 2013-09-01 5.5063913 4
16 2013-10-01 5.5676856 4
17 2013-11-01 6.2686567 4
18 2013-12-01 11.021069 4
SO, I tried this:
df_cut <- data.frame(as.numeric(df$date), "IntervalRange" = cut(
df,
breaks=pts$Min))
Which results in this error message:
Error in cut.default(df, breaks = pts$Min) : 'x' must be numeric
My questions now are:
Why do I get this error message? I already changed it to numeric...
Can I achieve my desired output by using the cut() and findIntervals() functions also when using other datasets with other interval points?
You are lacking the value (or the column) in the cut function. Your command should be
data.frame(as.numeric(df$date), "IntervalRange" = cut(df$amount, breaks=pts$Min))
Hope this helps!

Matching Bloomberg Data in R

Working with the Rblpapi package, I receive a list of multiple data frames when requesting securities. (Equaling the number of securities requested)
My problem is the following one: Let's say:
I request daily data for A and B from 01.10.2016 - 31.10.2016
Some data for A is missing during that time, while B has,
also some data for B is missing, when A has.
So basically:
list$A
date PX_LAST
1 2016-10-03 216.704
2 2016-10-04 217.245
3 2016-10-05 216.887
4 2016-10-06 217.164
5 2016-10-10 217.504
6 2016-10-11 217.022
7 2016-10-12 217.326
8 2016-10-13 216.219
9 2016-10-14 217.275
10 2016-10-17 216.751
11 2016-10-18 218.812
12 2016-10-19 219.682
13 2016-10-20 220.189
14 2016-10-21 220.930
15 2016-10-25 221.179
16 2016-10-26 219.840
17 2016-10-27 219.158
18 2016-10-31 217.820
list$B
date PX_LAST
1 2016-10-03 1722.82
2 2016-10-04 1717.82
3 2016-10-05 1721.14
4 2016-10-06 1718.40
5 2016-10-07 1712.40
6 2016-10-11 1700.33
7 2016-10-12 1695.54
8 2016-10-13 1689.62
9 2016-10-14 1693.71
10 2016-10-17 1687.84
11 2016-10-18 1701.10
12 2016-10-19 1706.74
13 2016-10-21 1701.16
14 2016-10-24 1706.24
15 2016-10-25 1701.20
16 2016-10-26 1699.92
17 2016-10-27 1694.66
18 2016-10-28 1690.96
19 2016-10-31 1690.92
As you see they have a different number of obervations and dates are also not equal. For example: 5. observation for A is on 2016-10-10 and for B is on 2016-10-07.
So what I need is a means to combine both data frames. My idea was a full range date range (every day) where I add the PX_values for corresponding dates of A and B. After that I could delete empty rows.
Sorry for bad formatting, this is my first post here.
Thanks in advance.

insert new rows to the time series data, with date added automatically

I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33

Resources