Idea for a runnin/rolling median in R - r

I am new here and I would like to get some help.
I have a dataset with a datetime column and a certain value assignt to it
|datetime |value|
2020-06-15 10:30:00| 3 |
2020-06:15 10:31:00| 1 |
and I need a way to calculate for each minute x, the median of (value in x-5; value in x+5)
any ideas?

Assuming DF shown in the Note at the end use either of these (only the second one if you have NA's in your data).
library(zoo)
transform(DF, median = rollmedian(value, 11, fill = NA))
transform(DF, median = rollapply(value, 11, median, fill = NA))
giving:
datetime value median
1 2020-06-15 10:30:00 1 NA
2 2020-06-15 10:31:00 2 NA
3 2020-06-15 10:32:00 3 NA
4 2020-06-15 10:33:00 4 NA
5 2020-06-15 10:34:00 5 NA
6 2020-06-15 10:35:00 6 6
7 2020-06-15 10:36:00 7 7
8 2020-06-15 10:37:00 8 8
9 2020-06-15 10:38:00 9 9
10 2020-06-15 10:39:00 10 10
11 2020-06-15 10:40:00 11 NA
12 2020-06-15 10:41:00 12 NA
13 2020-06-15 10:42:00 13 NA
14 2020-06-15 10:43:00 14 NA
15 2020-06-15 10:44:00 15 NA
Note
DF <- data.frame(
datetime = seq(as.POSIXct("2020-06-15 10:30:00"), length = 15, by = "min"),
value = 1:15)

Related

Multivariate imputing missing values in weather data

I need to get a weather dataset ready as input to keras. I have 1096 entries over 3 years of daily data of which first month is missing. I got one of the columns filled in for temperature from a nearby weather station. However, to check which imputation fits best, I deleted these 30 values and kept all columns as NA for first month. Then, I tried various imputing packages including 1. Mice - gave continuous values but too high average; 2. KNN (VIM) gave a constant value too high 3.MissForest - gave constant value too high; 4. imputeTS_interpolation - gave continuous value slightly low; 5. imputeTS_seasonal - gave constant value slight low.
Therefore, I selected imputeTS_interpolation. And used this to now impute the remaining columns after filling the temperature column with actual values. However, I cannot seem to get the seasonality in imputeTS working.
Any idea why? Please find below the data file and code used:
Code:
# #MICE did not match with historical data too high avg of 10.2
# impute <- mice(gf, method = "pmm")
# print(impute)
# xyplot(impute, Temp ~ Reco | .imp, pch = 20, cex =1.4)
# mf <- complete(impute, 3)
# mf <- cbind(mf, Date = df$Date.Time)
# write.csv(mf, "Mice_imputed.csv", row.names=TRUE)
# View(mf)
#
# View(gf_impo)
# ##Using Miss Forest ;) too high contstant 9.3
# gf_impo <- missForest(gf, maxiter = 100, ntree = 500)
# gf_impo$ximp
# gf_impo <- cbind(gf_impo, Date = df$Date.Time)
# write.csv(gf_impo$ximp, "Val_MissForest_imputed.csv", row.names=TRUE)
# class(gf_impo)
##KNN using VIM too high constant 13
imp_knn <- kNN(gf, k = 500)
aggr(imp_knn, delimiter = "imp")
View(imp_knn)
imp_knn <- cbind(imp_knn, Date = df$Date.Time)
write.csv(imp_knn, "Val_KNN_imputed.csv", row.names=TRUE)
View(imp_seas)
#imputeTS
#for seasonal imputation
imp_seas <- gf
imp_seas <- cbind(imp_seas, Date = df$Date.Time)
View(imp_seas)
View(imp_TS_intn)
imp_TS_intn <- na_interpolation(imp_seas, option = "spline") #avg of 2.83 close to real 4.1
# imp_TS_seas <- na_seasplit(imp_seas, algorithm = "interpolation", find_frequency = FALSE, maxgap = Inf)
#const 2.7
write.csv(imp_TS_intn, "ML_impTS_interpolate.csv", row.names=TRUE)
DATA:
A B C D E Date
1 NA NA NA 5.4000000 NA 2018-01-01
2 NA NA NA 5.7500000 NA 2018-01-02
3 NA NA NA 6.8000000 NA 2018-01-03
4 NA NA NA 6.3500000 NA 2018-01-04
5 NA NA NA 3.3500000 NA 2018-01-05
6 NA NA NA 3.0500000 NA 2018-01-06
7 NA NA NA 2.2000000 NA 2018-01-07
8 NA NA NA 0.6500000 NA 2018-01-08
9 NA NA NA 2.8500000 NA 2018-01-09
10 NA NA NA 2.2000000 NA 2018-01-10
11 NA NA NA 2.3500000 NA 2018-01-11
12 NA NA NA 5.1000000 NA 2018-01-12
13 NA NA NA 6.5500000 NA 2018-01-13
14 NA NA NA 5.0000000 NA 2018-01-14
15 NA NA NA 5.7500000 NA 2018-01-15
16 NA NA NA 2.0000000 NA 2018-01-16
17 NA NA NA 5.0500000 NA 2018-01-17
18 NA NA NA 3.8500000 NA 2018-01-18
19 NA NA NA 2.4500000 NA 2018-01-19
20 NA NA NA 5.1500000 NA 2018-01-20
21 NA NA NA 6.7500000 NA 2018-01-21
22 NA NA NA 9.2500000 NA 2018-01-22
23 NA NA NA 9.5000000 NA 2018-01-23
24 NA NA NA 6.4500000 NA 2018-01-24
25 NA NA NA 5.4000000 NA 2018-01-25
26 NA NA NA 5.3500000 NA 2018-01-26
27 NA NA NA 6.5500000 NA 2018-01-27
28 NA NA NA 10.1000000 NA 2018-01-28
29 NA NA NA 6.6000000 NA 2018-01-29
30 NA NA NA 3.8500000 NA 2018-01-30
31 NA NA NA 2.9000000 NA 2018-01-31
32 0.05374951 0.041144312 0.0023696211 5.9902083 0.068784302 2018-02-01
33 0.07565470 0.012326176 0.0057481689 10.5280417 0.176209125 2018-02-02
34 0.04476314 0.113718139 0.0089845444 12.8125000 0.176408788 2018-02-03
35 0.01695546 0.060965133 -0.0034163682 16.9593750 0.000000000 2018-02-04
36 0.09910202 0.090170142 -0.0111946461 10.4867292 0.088337951 2018-02-05
37 0.08514839 0.026061013 -0.0029183210 7.1662500 0.085590326 2018-02-06
38 0.06724108 0.104761909 -0.0416036605 6.9130417 0.134828348 2018-02-07
39 0.07638534 0.097570813 -0.0192784571 3.3840000 0.029682717 2018-02-08
40 0.02568162 0.008244304 -0.0288903610 12.0282292 0.055817103 2018-02-09
41 0.02752688 0.088544666 -0.0172136911 6.8694792 0.098169954 2018-02-10
42 0.06643098 0.063321337 -0.0347752292 7.4539792 0.034110652 2018-02-11
43 0.09743445 0.057502178 0.0162851223 13.9365208 0.264168082 2018-02-12
44 0.09189575 0.034429904 0.0020940613 13.8687292 0.162341764 2018-02-13
45 0.07857244 0.009406862 0.0075904680 11.7800000 0.101283946 2018-02-14
46 0.01987263 0.024783795 -0.0088742973 4.4463750 0.063949011 2018-02-15
47 0.02332892 0.010138857 0.0091396448 5.6452292 0.034708981 2018-02-16
48 0.02022396 0.014207518 0.0036018714 14.2862500 0.043205299 2018-02-17
49 0.07043020 0.075317793 0.0036760070 5.5940208 0.171898590 2018-02-18
50 0.02120779 0.010461857 -0.0277470177 13.6131250 0.061486533 2018-02-19
51 0.06405819 0.034185344 0.0173606568 7.0551042 0.052148976 2018-02-20
52 0.09428869 0.026957653 0.0016863903 6.7955000 0.085888435 2018-02-21
53 0.04248937 0.048782786 0.0004039921 17.5706250 0.000000000 2018-02-22
54 0.02076763 0.038094949 -0.0003671638 14.8379167 0.000000000 2018-02-23
55 0.01343260 0.118003726 -0.0214988345 6.4564583 0.053353606 2018-02-24
56 0.05231647 0.054454132 -0.0098012290 7.8568333 0.183326943 2018-02-25
57 0.02476706 0.087501472 0.0031839472 15.7493750 0.210616272 2018-02-26
58 0.07358998 0.023558218 0.0031618607 10.8001250 0.241602571 2018-02-27
59 0.02042573 0.009268439 0.0088051496 7.2967500 0.251608940 2018-02-28
60 0.02107772 0.083567750 -0.0037223644 6.2674375 0.062221630 2018-03-01
61 0.05830801 0.029456683 0.0114978078 13.0810417 0.193765948 2018-03-02
62 0.02923587 0.070533843 0.0068299668 14.4095833 0.244310193 2018-03-03
63 0.02570283 0.058270093 0.0137174366 3.8527917 0.120846709 2018-03-04
64 0.01434395 0.014637405 0.0051951050 9.6877083 0.112579011 2018-03-05
65 0.06426214 0.078872579 0.0068664343 4.6763750 0.000000000 2018-03-06
66 0.04782772 0.011762501 0.0086182870 12.7027083 0.129606106 2018-03-07
67 0.01809136 0.105398844 0.0231671305 10.8052083 0.017683908 2018-03-08
68 0.04427582 0.020397435 -0.0009758693 6.5983333 0.041148864 2018-03-09
69 0.05123687 0.115984361 -0.0372104856 6.5021250 0.180013174 2018-03-10
70 0.01913266 0.005981014 -0.0159701842 8.9844375 0.095262921 2018-03-11
71 0.04407234 0.009142247 -0.0031640496 7.7638333 0.000000000 2018-03-12
72 0.09108709 0.038174205 0.0005654564 5.3772083 0.044105747 2018-03-13
73 0.05488394 0.115153937 0.0192819858 8.9182917 0.039993864 2018-03-14
74 0.03726892 0.067983475 -0.0311367032 2.4423333 0.066108171 2018-03-15
75 0.05563102 0.003831231 -0.0011148743 10.7100000 0.217461791 2018-03-16
76 0.04922930 0.055446609 0.0075246331 5.0829375 0.149530704 2018-03-17
77 0.02972858 0.061966039 -0.0392014211 12.3645625 0.060670492 2018-03-18
78 0.02812688 0.018183092 0.0134514770 9.0172292 0.158435250 2018-03-19
79 0.03066101 0.007622504 -0.0249482114 6.2709792 0.118487919 2018-03-20
80 0.06801767 0.083261012 0.0133423296 13.3683333 0.196053774 2018-03-21
81 0.04178157 0.093600914 0.0116253865 10.0024167 0.020835522 2018-03-22
82 0.04725052 0.018187748 -0.0115718535 10.3528333 0.097352796 2018-03-23
83 0.02042339 0.081504844 -0.0380958738 17.2006250 0.010500742 2018-03-24
84 0.06674396 0.098739090 -0.0108474961 17.5437500 0.119415595 2018-03-25
85 0.07049507 0.016286614 -0.0007817195 16.8800000 0.060452087 2018-03-26
86 0.01244906 0.018100693 -0.0266155999 8.8651458 0.018144668 2018-03-27
87 0.05271711 0.015368632 -0.0477885811 7.2415417 0.092797451 2018-03-28
88 0.01610886 0.014919094 0.0023487944 7.7914792 0.062818728 2018-03-29
89 0.08847253 0.059397043 0.0130362880 10.9732708 0.087451484 2018-03-30
90 0.02938725 0.044473745 0.0091253257 6.0241458 0.025488946 2018-03-31
91 0.08599249 0.043160908 0.0082536160 8.8211875 0.012975783 2018-04-01
92 0.05747667 0.017709243 -0.0090965038 6.3249375 0.065731818 2018-04-02
93 0.05772051 0.085210524 -0.0013533831 13.4166667 0.067790160 2018-04-03
94 0.01699834 0.020657341 0.0039885065 3.2999792 0.076302652 2018-04-04
95 0.03565076 0.110372607 -0.0313309140 12.7822083 0.184844707 2018-04-05
96 0.02050401 0.078943608 -0.0062322339 4.3233125 0.067820413 2018-04-06
97 0.06186790 0.013147512 0.0203249289 6.3953750 0.034104318 2018-04-07
98 0.06304988 0.012997642 0.0061171825 9.7322708 0.021220516 2018-04-08
99 0.03799006 0.012420760 0.0054724563 8.8472083 0.068664033 2018-04-09
100 0.01610225 0.061182804 0.0031002885 7.5622708 0.085766429 2018-04-10
101 0.05937683 0.008333173 -0.0053972689 7.8848542 0.058386726 2018-04-11
102 0.02190115 0.037843227 0.0089823372 8.3339792 0.055761391 2018-04-12
103 0.01179665 0.016899394 -0.0016533437 5.5101667 0.099133313 2018-04-13
104 0.02464707 0.021231270 -0.0212016846 15.5106250 0.126661378 2018-04-14
105 0.01906818 0.065273389 0.0081694393 7.6616667 0.032939519 2018-04-15
106 0.05418785 0.074619385 -0.0355680586 11.3618750 0.057768261 2018-04-16
107 0.06508988 0.014345229 0.0080423912 14.7137500 0.032709791 2018-04-17
108 0.06101126 0.060624597 -0.0399526978 17.2754167 0.230982139 2018-04-18
109 0.02226268 0.010230837 0.0001617419 2.9382083 0.000000000 2018-04-19
110 0.03884772 0.014218453 0.0039652960 10.7261875 0.179962834 2018-04-20
111 0.09054488 0.025711098 -0.0115944362 4.4734583 0.011442318 2018-04-21
112 0.03072171 0.076530730 0.0032123501 9.4128750 0.033174489 2018-04-22
113 0.04361276 0.101151670 0.0249408843 14.5804167 0.024238883 2018-04-23
114 0.03877568 0.049142846 0.0080689866 8.3168750 0.084570611 2018-04-24
115 0.05564027 0.076917047 0.0033447160 15.7308333 0.199762524 2018-04-25
116 0.04752544 0.019655228 -0.0063218138 15.7302083 0.020449908 2018-04-26
117 0.01718916 0.026132806 -0.0261027525 10.0887500 0.128898351 2018-04-27
118 0.04144832 0.034526516 0.0117868820 6.0784375 0.014449565 2018-04-28
119 0.03255833 0.113650910 -0.0123724759 11.8654167 0.085410171 2018-04-29
120 0.03656535 0.043333607 0.0230071368 7.0974167 0.035725321 2018-04-30
121 0.04570760 0.093595938 -0.0329915968 5.4016458 0.013467946 2018-05-01
122 0.07271528 0.061923504 0.0130002656 9.1602292 0.018299062 2018-05-02
123 0.02646133 0.007506529 -0.0276898846 0.2338125 0.246100834 2018-05-03
124 0.02379895 0.067273612 0.0112587565 19.1260417 0.120707266 2018-05-04
125 0.05925152 0.075768053 0.0050178925 16.2114583 0.162884739 2018-05-05
126 0.01858152 0.040845398 0.0164467420 12.9156250 0.028823967 2018-05-06
127 0.06994835 0.059457560 -0.0181926787 7.7316042 0.035106399 2018-05-07
128 0.05926409 0.038623605 0.0167222227 13.5464583 0.055665220 2018-05-08
129 0.03104010 0.006805893 -0.0141792029 14.5006250 0.012099383 2018-05-09
130 0.06631012 0.059314975 -0.0228020931 13.3711875 0.073114370 2018-05-10
131 0.03794480 0.015615642 0.0034917459 16.6675208 0.191141576 2018-05-11
132 0.03532917 0.050988581 0.0079455282 14.7375208 0.214172062 2018-05-12
133 0.08512617 0.063322454 0.0224309652 11.6861250 0.166425889 2018-05-13
134 0.04498265 0.012386160 -0.0051629339 7.2488333 0.280120908 2018-05-14
135 0.06383512 0.126840241 -0.0172296864 17.3852083 0.020363429 2018-05-15
136 0.06932861 0.026819550 -0.0109061610 20.9152083 0.099516538 2018-05-16
137 0.04020292 0.021831228 -0.0007211804 6.7122292 0.069831669 2018-05-17
138 0.02037474 0.020931810 0.0088341962 15.8758333 0.130548701 2018-05-18
139 0.01704143 0.105810563 -0.0243003529 10.7339583 0.038013440 2018-05-19
140 0.01266417 0.013985439 0.0091359503 6.5119375 0.196746897 2018-05-20
141 0.03623625 0.057182212 -0.0136101306 18.6637500 0.009431062 2018-05-21
142 0.03938695 0.054879146 0.0091277482 15.5393750 0.115389187 2018-05-22
143 0.05995812 0.061925644 -0.0029137774 11.8191667 0.015729774 2018-05-23
144 0.06548692 0.095240991 0.0055356839 4.3011875 0.081309326 2018-05-24
145 0.01582489 0.015264434 -0.0020079231 9.3315833 0.105132636 2018-05-25
146 0.06834050 0.028756388 -0.0512068435 13.6035417 0.212930829 2018-05-26
147 0.08354736 0.023524928 0.0041989465 4.5111250 0.227197329 2018-05-27
148 0.05738595 0.011159952 -0.0225834032 12.9385417 0.090503870 2018-05-28
149 0.07817132 0.103507587 -0.0222426051 13.4047292 0.034928812 2018-05-29
150 0.04773356 0.035856991 -0.0191600449 9.6657708 0.019893986 2018-05-30
Disclaimer: I am looking for a co-author for help in validating my work with keras / tensor flow
Maybe you can try the following things:
Using Seasonal Decomposition na_seadec() instead of seasonal split
Manually setting seasonality
Setting find frequency = TRUE
You can set the seasonality manually with the following actions:
Suppose you have a vector x and you have monthly values - this would mean freq = 12
# Create time series with seasonality information
x <-c(2,3,4,5,6,6,6,6,6,4,4,4,4,5,6,4,3,3,5,6,4,3,5,3,5,3,5,3,4,4,4,4,4,4,2,4,2,2,4,5,6,7,8,9,0,0,5,2,4)
x_with_freq <- ts(x, frequency = 12)
# imputation for the series
na_seadec(x_with_freq)
If you have daily values your frequency should be 365.
Other than this, you could also try to run na_seadec(x, find_frequency = T) then imputeTS tries to automatically find the seasonality for you.
But in the end, I don't know your data, could very well be, that the seasonal patterns aren't too strong.

How can I assign a column a character Value based on whether another column meets a condition in R?

I have multiple events (rows) that took place on a certain date and time. I want to group them by another column and then in a separate column if they're the oldest date the column will be "No" and if not the column will be "Yes". Below is the first 20 rows of my data
Event Rework lm_date
1 409974 NA 2019-10-16 18:34:00
2 409974 NA 2019-11-24 17:02:00
3 409974 NA 2019-11-25 17:18:00
4 409974 NA 2019-12-10 20:46:00
5 410047 NA 2019-09-09 20:39:00
6 410047 NA 2019-09-10 18:46:00
7 410172 NA 2019-09-10 18:50:00
8 410172 NA 2019-09-10 20:02:00
9 410172 NA 2019-09-11 20:46:00
10 410172 NA 2019-09-13 17:40:00
11 410172 NA 2019-10-11 03:02:00
12 411169 NA 2019-10-03 16:06:00
13 411169 NA 2019-11-07 20:56:00
14 411169 NA 2019-11-08 20:02:00
15 411229 NA 2019-10-14 16:13:00
16 411229 NA 2019-11-06 16:43:00
17 411229 NA 2019-11-07 21:28:00
18 411229 NA 2019-11-11 21:45:00
19 411929 NA 2019-09-17 22:34:00
20 411929 NA 2019-09-19 20:46:00
For example, I want to group by the event so the first group would be 409974 and have the rework column be a "No" for the first row and a Yes for every other row in the event, i.e. the next 3 rows. I tried using
CATASK %>% group_by(Event) %>% mutate(if (lm_date == min(lm_date) {
Rework == "No"}
else {
Rework == "Yes"
}
))
but to no avail.. Any advice would be greatly appreciated!
I would suggest next tidyverseapproach using mutate() and checking the oldest date with min():
library(tidyverse)
#Format date
df %>% mutate(lm_date=as.POSIXct(lm_date)) %>%
group_by(Event) %>%
mutate(Rework=ifelse(lm_date==min(lm_date),'No','Yes'))
Output:
# A tibble: 20 x 3
# Groups: Event [6]
Event lm_date Rework
<dbl> <dttm> <chr>
1 409974 2019-10-16 18:33:59 No
2 409974 2019-11-24 17:02:00 Yes
3 409974 2019-11-25 17:18:00 Yes
4 409974 2019-12-10 20:45:59 Yes
5 410047 2019-09-09 20:39:00 No
6 410047 2019-09-10 18:46:00 Yes
7 410172 2019-09-10 18:49:59 No
8 410172 2019-09-10 20:02:00 Yes
9 410172 2019-09-11 20:45:59 Yes
10 410172 2019-09-13 17:39:59 Yes
11 410172 2019-10-11 03:01:59 Yes
12 411169 2019-10-03 16:05:59 No
13 411169 2019-11-07 20:55:59 Yes
14 411169 2019-11-08 20:02:00 Yes
15 411229 2019-10-14 16:12:59 No
16 411229 2019-11-06 16:43:00 Yes
17 411229 2019-11-07 21:27:59 Yes
18 411229 2019-11-11 21:45:00 Yes
19 411929 2019-09-17 22:34:00 No
20 411929 2019-09-19 20:45:59 Yes
Some data used next:
#Data
df <- structure(list(Event = c(409974, 409974, 409974, 409974, 410047,
410047, 410172, 410172, 410172, 410172, 410172, 411169, 411169,
411169, 411229, 411229, 411229, 411229, 411929, 411929), lm_date = structure(c(1571250840,
1574614920, 1574702280, 1576010760, 1568061540, 1568141160, 1568141400,
1568145720, 1568234760, 1568396400, 1570762920, 1570118760, 1573160160,
1573243320, 1571069580, 1573058580, 1573162080, 1573508700, 1568759640,
1568925960), class = c("POSIXct", "POSIXt"), tzone = "GMT")), class = "data.frame", row.names = c(NA,
-20L))

Some conditions in nested ifelse taken into account

I struggle with nested ifelse. I want to create a new variable using dplyr::mutate based on values of other variables. See the reproductible example below.
library(dplyr)
library(hms)
# make a test dataframe
datetime <- as.POSIXct(c("2015-01-26 10:10:00 UTC","2015-01-26 10:20:00 UTC","2015-01-26 10:30:00 UTC", "2015-01-26 10:40:00 UTC","2015-01-26 10:50:00 UTC","2015-01-26 11:00:00 UTC","2015-01-26 00:10:00 UTC","2015-01-26 11:20:00 UTC","2015-01-26 11:30:00 UTC","2017-03-10 10:00:00 UTC"))
time <- hms::as_hms(datetime)
pco2_corr <- c(90,135,181,226,272,317,363,NA,454,300)
State_Zero <- c(NA,NA,1,rep(NA,7))
State_Flush <- c(rep(NA,4),1,rep(NA,5))
z <- tibble(datetime, time, pco2_corr, State_Zero, State_Flush)
# now create a new variable
z <- z %>%
dplyr::mutate(pco2_corr_qf = ifelse(is.na(pco2_corr), 15,
ifelse((State_Zero >= 1 | State_Flush >= 1), 4,
ifelse(pco2_corr < 100 | pco2_corr > 450, 7,
ifelse((time >= "00:00:00" & time <= "01:30:00") |
(time >= "12:00:00" & time <= "13:00:00"), 16,
ifelse((datetime >= "2017-03-10 08:00:00" &
datetime < "2017-03-21 20:00:00"), 99,
1))))))
z
# A tibble: 10 x 6
datetime time pco2_corr State_Zero State_Flush pco2_corr_qf
<dttm> <time> <dbl> <dbl> <dbl> <dbl>
1 2015-01-26 10:10:00 10:10 90 NA NA NA
2 2015-01-26 10:20:00 10:20 135 NA NA NA
3 2015-01-26 10:30:00 10:30 181 1 NA 4
4 2015-01-26 10:40:00 10:40 226 NA NA NA
5 2015-01-26 10:50:00 10:50 272 NA 1 4
6 2015-01-26 11:00:00 11:00 317 NA NA NA
7 2015-01-26 00:10:00 00:10 363 NA NA NA
8 2015-01-26 11:20:00 11:20 NA NA NA 15
9 2015-01-26 11:30:00 11:30 454 NA NA NA
10 2017-03-10 10:00:00 10:00 300 NA NA NA
The first two ifelse work fine but the next three do not. The new variable pco2_corr_qf should not have any NA but values 7, 16, 99 and 1.
What am I doing wrong?
You are comparing time with a string that gives incorrect output, convert it to the relevant class. We can use case_when which is a better alternative to nested ifelse.
library(dplyr)
library(hms)
z %>%
mutate(pco2_corr_qf = case_when(
is.na(pco2_corr) ~ 15,
State_Zero >= 1 | State_Flush >= 1 ~ 4,
pco2_corr < 100 | pco2_corr > 450 ~ 7,
(time >= as_hms("00:00:00") & time <= as_hms("01:30:00")) |
(time >= as_hms("12:00:00") & time <= as_hms("13:00:00")) ~ 16,
datetime >= as.POSIXct("2017-03-10 08:00:00") &
datetime < as.POSIXct("2017-03-21 20:00:00") ~ 99,
TRUE ~ 1))
# datetime time pco2_corr State_Zero State_Flush pco2_corr_qf
# <dttm> <time> <dbl> <dbl> <dbl> <dbl>
# 1 2015-01-26 10:10:00 10:10 90 NA NA 7
# 2 2015-01-26 10:20:00 10:20 135 NA NA 1
# 3 2015-01-26 10:30:00 10:30 181 1 NA 4
# 4 2015-01-26 10:40:00 10:40 226 NA NA 1
# 5 2015-01-26 10:50:00 10:50 272 NA 1 4
# 6 2015-01-26 11:00:00 11:00 317 NA NA 1
# 7 2015-01-26 00:10:00 00:10 363 NA NA 16
# 8 2015-01-26 11:20:00 11:20 NA NA NA 15
# 9 2015-01-26 11:30:00 11:30 454 NA NA 7
#10 2017-03-10 10:00:00 10:00 300 NA NA 99

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

Quarterly Year over Year Growth Rate

I have been trying to calculate the growth rate comparing quarter 1 from one year to quarter 1 for the following year.
In excel the formula would look like this ((B6-B2)/B2)*100.
What is the best way to accomplish this in R? I know how to get the differences from period to period, but cannot accomplish it with 4 time periods' difference.
Here is the code:
date <- c("2000-01-01","2000-04-01", "2000-07-01",
"2000-10-01","2001-01-01","2001-04-01",
"2001-07-01","2001-10-01","2002-01-01",
"2002-04-01","2002-07-01","2002-10-01")
value <- c(1592,1825,1769,1909,2022,2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
Which will produce this data frame:
date value
1 2000-01-01 1592
2 2000-04-01 1825
3 2000-07-01 1769
4 2000-10-01 1909
5 2001-01-01 2022
6 2001-04-01 2287
7 2001-07-01 2169
8 2001-10-01 2366
9 2002-01-01 2001
10 2002-04-01 2087
11 2002-07-01 2099
12 2002-10-01 2258
Here's an option using the dplyr package:
# Convert date column to date format
df$date = as.POSIXct(df$date)
library(dplyr)
library(lubridate)
In the code below, we first group by month, which allows us to operate on each quarter separately. The arrange function just makes sure that the data within each quarter is ordered by date. Then we add the yearOverYear column using mutate which calculates the ratio of the current year to the previous year for each quarter.
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1))
date value month yearOverYear
1 2000-01-01 1592 1 NA
2 2001-01-01 2022 1 1.2701005
3 2002-01-01 2001 1 0.9896142
4 2000-04-01 1825 4 NA
5 2001-04-01 2287 4 1.2531507
6 2002-04-01 2087 4 0.9125492
7 2000-07-01 1769 7 NA
8 2001-07-01 2169 7 1.2261164
9 2002-07-01 2099 7 0.9677271
10 2000-10-01 1909 10 NA
11 2001-10-01 2366 10 1.2393924
12 2002-10-01 2258 10 0.9543533
If you prefer to have the data frame back in overall date order after adding the year-over-year values:
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1)) %>%
ungroup() %>% arrange(date)
Or using data.table
library(data.table) # v1.9.5+
setDT(df)[, .(date, yoy = (value-shift(value))/shift(value)*100),
by = month(date)
][order(date)]
Here's a very simple solution:
YearOverYear<-function (x,periodsPerYear){
if(NROW(x)<=periodsPerYear){
stop("too few rows")
}
else{
indexes<-1:(NROW(x)-periodsPerYear)
return(c(rep(NA,periodsPerYear),(x[indexes+periodsPerYear]-x[indexes])/x[indexes]))
}
}
> cbind(df,YoY=YearOverYear(df$value,4))
date value YoY
1 2000-01-01 1592 NA
2 2000-04-01 1825 NA
3 2000-07-01 1769 NA
4 2000-10-01 1909 NA
5 2001-01-01 2022 0.27010050
6 2001-04-01 2287 0.25315068
7 2001-07-01 2169 0.22611645
8 2001-10-01 2366 0.23939235
9 2002-01-01 2001 -0.01038576
10 2002-04-01 2087 -0.08745081
11 2002-07-01 2099 -0.03227294
12 2002-10-01 2258 -0.04564666
df$yoy <- c(rep(NA,4),(df$value[5:nrow(df)]-df$value[1:(nrow(df)-4)])/df$value[1:(nrow(df)-4)]*100);
df;
## date value yoy
## 1 2000-01-01 1592 NA
## 2 2000-04-01 1825 NA
## 3 2000-07-01 1769 NA
## 4 2000-10-01 1909 NA
## 5 2001-01-01 2022 27.010050
## 6 2001-04-01 2287 25.315068
## 7 2001-07-01 2169 22.611645
## 8 2001-10-01 2366 23.939235
## 9 2002-01-01 2001 -1.038576
## 10 2002-04-01 2087 -8.745081
## 11 2002-07-01 2099 -3.227294
## 12 2002-10-01 2258 -4.564666
Another base R solution. Requires that the date is in date format, so that the common months can be used as a grouping variable to which the function to calculate growth rate can be passed
# set date to a date objwct
df$date <- as.Date(df$date)
# order by date
df <- df[order(df$date), ]
# function to calculate differences
f <- function(x) c(NA, 100*diff(x)/x[-length(x)])
df$yoy <- ave(df$value, format(df$date, "%m"), FUN=f)
# date value yoy
# 1 2000-01-01 1592 NA
# 2 2000-04-01 1825 NA
# 3 2000-07-01 1769 NA
# 4 2000-10-01 1909 NA
# 5 2001-01-01 2022 27.010050
# 6 2001-04-01 2287 25.315068
# 7 2001-07-01 2169 22.611645
# 8 2001-10-01 2366 23.939235
# 9 2002-01-01 2001 -1.038576
# 10 2002-04-01 2087 -8.745081
# 11 2002-07-01 2099 -3.227294
# 12 2002-10-01 2258 -4.564666
or
c(rep(NA, 4,), 100* diff(df$value, lag=4) / head(df$value, -4))

Resources