I am trying to cross-validate a Prophet model in R.
The problem - this package does not work well with monthly data.
I managed to build the model
and even used a custom monthly seasonality.
as recommended by authors of this tool.
But cannot cross-validate monthly data. Tried to follow recommendations in the GitHub issue, but missing something.
Currently my code looks like this
model1_cv <- cross_validation(model1, initial = 156, period = 365/12, as.difftime(horizon = 365/12, units = "days"))
Updated:
Based on answer to this question, I visualized CV results. There some problems here. I used full data and partial data.
Also metrics do not look that good
I just tested a bit with training data from the package and from what I understood the package is not really well suited for monthly forecast, this part: [...] as.difftime(365/12, units = "days") [...] seems to have been informed just to prove the size of the month with 30something days. Meaning you can use this instead of just 365/12 por "period" and/or "horizon". One thing I noticed is, that both arguments are of type integer per description but when you look into the function they are calculated per as.datediff() so they are doubles actually.
library(dplyr)
library(prophet)
library(data.table)
#training data
df <- data.table::fread("ds y
1992-01-01 146376
1992-02-01 147079
1992-03-01 159336
1992-04-01 163669
1992-05-01 170068
1992-06-01 168663
1992-07-01 169890
1992-08-01 170364
1992-09-01 164617
1992-10-01 173655
1992-11-01 171547
1992-12-01 208838
1993-01-01 153221
1993-02-01 150087
1993-03-01 170439
1993-04-01 176456
1993-05-01 182231
1993-06-01 181535
1993-07-01 183682
1993-08-01 183318
1993-09-01 177406
1993-10-01 182737
1993-11-01 187443
1993-12-01 224540
1994-01-01 161349
1994-02-01 162841
1994-03-01 192319
1994-04-01 189569
1994-05-01 194927
1994-06-01 197946
1994-07-01 193355
1994-08-01 202388
1994-09-01 193954
1994-10-01 197956
1994-11-01 202520
1994-12-01 241111
1995-01-01 175344
1995-02-01 172138
1995-03-01 201279
1995-04-01 196039
1995-05-01 210478
1995-06-01 211844
1995-07-01 203411
1995-08-01 214248
1995-09-01 202122
1995-10-01 204044
1995-11-01 212190
1995-12-01 247491
1996-01-01 185019
1996-02-01 192380
1996-03-01 212110
1996-04-01 211718
1996-05-01 226936
1996-06-01 217511
1996-07-01 218111")
df <- df %>%
dplyr::mutate(ds = as.Date(ds))
model <- prophet::prophet(df)
(tscv.myfit <- prophet::cross_validation(model, horizon = 365/12, units = "days", period = 365/12, initial = 365/12 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 175344 1995-01-01 170988.8 170145.9 171828.0 1994-12-31 02:00:00
2: 172138 1995-02-01 178117.4 176975.2 179070.2 1995-01-30 12:00:00
3: 201279 1995-03-01 211462.8 210277.4 212670.8 1995-01-30 12:00:00
4: 196039 1995-04-01 200113.9 198079.5 201977.8 1995-03-01 22:00:00
5: 210478 1995-05-01 202100.5 200390.8 203797.9 1995-04-01 08:00:00
6: 211844 1995-06-01 208330.5 206229.9 210497.4 1995-05-01 18:00:00
7: 203411 1995-07-01 202563.8 200786.5 204313.0 1995-06-01 04:00:00
8: 214248 1995-08-01 214639.6 212748.3 216461.3 1995-07-01 14:00:00
9: 202122 1995-09-01 204954.0 203048.9 206768.4 1995-08-31 12:00:00
10: 204044 1995-10-01 205097.5 203209.7 206882.3 1995-09-30 22:00:00
11: 212190 1995-11-01 213586.7 211728.1 215617.6 1995-10-31 08:00:00
12: 247491 1995-12-01 251518.8 249708.2 253589.2 1995-11-30 18:00:00
13: 185019 1996-01-01 182403.7 180520.1 184494.7 1995-12-31 04:00:00
14: 192380 1996-02-01 184722.9 182772.7 186686.9 1996-01-30 14:00:00
15: 212110 1996-03-01 205020.1 202823.2 206996.9 1996-01-30 14:00:00
16: 211718 1996-04-01 214514.0 211891.9 217175.3 1996-03-31 14:00:00
17: 226936 1996-05-01 218845.2 216133.8 221420.4 1996-03-31 14:00:00
18: 217511 1996-06-01 218672.2 216007.8 221459.9 1996-05-31 14:00:00
19: 218111 1996-07-01 221156.1 218540.7 224184.1 1996-05-31 14:00:00
The cutoff is not as regular as one would expect - I guess this is due to using average days per month somehow - though I could not figute out the logic. You can replace 365/12 with as.difftime(365/12, units = "days") and will get the same result.
But if you use (365+365+365+366) / 48 instead due to the 29.02. you get a slighly different average days per month and this leads to a different output:
(tscv.myfit_2 <- prophet::cross_validation(model, horizon = (365+365+365+366)/48, units = "days", period = (365+365+365+366)/48, initial = (365+365+365+366)/48 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 172138 1995-02-01 178117.4 177075.3 179203.9 1995-01-29 13:30:00
2: 201279 1995-03-01 211462.8 210340.5 212607.3 1995-01-29 13:30:00
3: 196039 1995-04-01 200113.9 198022.6 202068.1 1995-03-31 13:30:00
4: 210478 1995-05-01 204100.2 202009.8 206098.7 1995-03-31 13:30:00
5: 211844 1995-06-01 208330.5 206114.5 210515.8 1995-05-31 13:30:00
6: 203411 1995-07-01 202606.0 200319.1 204663.4 1995-05-31 13:30:00
7: 214248 1995-08-01 214639.6 212684.4 216495.7 1995-07-31 22:30:00
8: 202122 1995-09-01 204954.0 203127.7 206951.0 1995-08-31 09:00:00
9: 204044 1995-10-01 205097.5 203285.3 207036.5 1995-09-30 19:30:00
10: 212190 1995-11-01 213586.7 211516.8 215516.2 1995-10-31 06:00:00
11: 247491 1995-12-01 251518.8 249658.3 253590.1 1995-11-30 16:30:00
12: 185019 1996-01-01 182403.7 180359.7 184399.2 1995-12-31 03:00:00
13: 192380 1996-02-01 184722.9 182652.4 186899.8 1996-01-30 13:30:00
14: 212110 1996-03-01 205020.1 203040.3 207171.9 1996-01-30 13:30:00
15: 211718 1996-04-01 214514.0 211942.6 217252.6 1996-03-31 13:30:00
16: 226936 1996-05-01 218845.2 216203.1 221506.5 1996-03-31 13:30:00
17: 217511 1996-06-01 218672.2 215823.9 221292.4 1996-05-31 13:30:00
18: 218111 1996-07-01 221156.1 218236.7 223862.0 1996-05-31 13:30:00
Form this behaviour I would say the work arround is not ideal, especially depending how exact you want the crossvalidation to be in terms of rolling month. If you need the cutoff points to be exact you could write your own function and predict always one month from the starting point, collect these results and build final comparision. I would trust this approach more than the work arround.
Related
Data (df3) Looks like this. One "1" for day at the end was added just to fulfill date format requirement.
ds y<br/>
1 2015-01-01 -390217.2<br/>
2 2015-02-01 230944.1<br/>
3 2015-03-01 367259.7<br/>
4 2015-04-01 567962.8<br/>
5 2015-05-01 753175.6<br/>
6 2015-06-01 -907767.5<br/>
7 2015-07-01 -52225619.9<br/>
8 2015-08-01 631666.1<br/>
9 2015-09-01 -792896.8<br/>
10 2015-10-01 430847.6<br/>
11 2015-11-01 5159146.7<br/>
12 2015-12-01 -2087233.7
Code i have tried:
try <- prophet(df3, seasonality.mode = 'multiplicative')
future <- make_future_dataframe(try, periods = 1)
forecast <- predict(try, future)
tail(forecast)
Result i am getting:
ds yhat<br/>
50 2019-02-01 -9536258.7<br/>
51 2019-03-01 -456995.5<br/>
52 2019-04-01 -1734330.0<br/>
53 2019-05-01 -3428825.1<br/>
54 2019-06-01 -2612847.0<br/>
55 2019-06-02 -2918161.2
Question is how to predict July 2019 instead of 2nd june 2019 value?
future = prophet.make_future_dataframe(periods=12 , freq='M')
for more information https://towardsdatascience.com/forecasting-in-python-with-facebook-prophet-29810eb57e66
future = prophet.make_future_dataframe(periods=12 , freq='MS')
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
fig.show()
MS stands for Month Start.
I have a data frame, df, that has date and two variables in it. I would like to either extract all of Oct-Dec data or delete the other months data from the data frame.
I have put the data into a data frame but at the moment have the whole year, I just want to extract the wanted data. In future I will also be extracting just winter data. I have attached my chunk of my data frame, I tried using format() with just %m but couldn't get it to work.
14138 2017-09-15 4.655946e-01 0.0603515884
14139 2017-09-16 7.881137e-01 0.0479933304
14140 2017-09-17 5.018990e-01 0.0256871025
14141 2017-09-18 -1.583625e-01 -0.0040893990
14142 2017-09-19 -6.733220e-01 -0.0313100989
14143 2017-09-20 -1.225730e+00 -0.0587706331
14144 2017-09-21 -1.419133e+00 -0.0958125544
14145 2017-09-22 -1.338630e+00 -0.0902803173
14146 2017-09-23 -1.272554e+00 -0.0659170673
14147 2017-09-24 -1.132318e+00 -0.0387240370
14148 2017-09-25 -1.255414e+00 -0.0392615823
14149 2017-09-26 -1.497188e+00 -0.0438491356
14150 2017-09-27 -1.427622e+00 -0.0633879185
14151 2017-09-28 -1.051756e+00 -0.0992427127
14152 2017-09-29 -4.876309e-01 -0.1448044528
14153 2017-09-30 -6.829681e-02 -0.1749463647
14154 2017-10-01 -1.413768e-01 -0.2009916094
14155 2017-10-02 6.359742e-02 -0.1975848313
14156 2017-10-03 9.103277e-01 -0.1828581805
14157 2017-10-04 1.695776e+00 -0.1589352546
14158 2017-10-05 1.913918e+00 -0.1538234614
14159 2017-10-06 1.479714e+00 -0.1937094170
14160 2017-10-07 8.783669e-01 -0.1703790211
14161 2017-10-08 5.706581e-01 -0.1294144428
14162 2017-10-09 4.979405e-01 -0.0666569815
14163 2017-10-10 3.233477e-01 0.0072006102
14164 2017-10-11 3.057630e-01 0.0863445067
14165 2017-10-12 5.877673e-01 0.1097707831
14166 2017-10-13 1.208526e+00 0.1301967193
14167 2017-10-14 1.671705e+00 0.1728109268
14168 2017-10-15 1.810979e+00 0.2264911145
14169 2017-10-16 1.426651e+00 0.2702958315
14170 2017-10-17 1.241140e+00 0.3242637704
14171 2017-10-18 8.997498e-01 0.3879727861
14172 2017-10-19 5.594161e-01 0.4172990825
14173 2017-10-20 3.980254e-01 0.3915170864
14174 2017-10-21 2.138538e-01 0.3249736995
14175 2017-10-22 3.926440e-01 0.2224834840
14176 2017-10-23 2.268644e-01 0.0529143372
14177 2017-10-24 5.664923e-01 -0.0081443464
14178 2017-10-25 6.167520e-01 0.0312073984
14179 2017-10-26 7.751882e-02 0.0043897693
14180 2017-10-27 -5.634851e-02 -0.0726825266
14181 2017-10-28 -2.122061e-01 -0.1711305549
14182 2017-10-29 -8.500991e-01 -0.2068581639
14183 2017-10-30 -1.039685e+00 -0.2909120824
14184 2017-10-31 -3.057745e-01 -0.3933633317
14185 2017-11-01 -1.288774e-01 -0.3726346136
14186 2017-11-02 -5.608007e-03 -0.2425754386
14187 2017-11-03 4.853990e-01 -0.0503543980
14188 2017-11-04 5.822672e-01 0.0896130098
14189 2017-11-05 8.491505e-01 0.1299151006
14190 2017-11-06 1.052999e+00 0.0749888307
14191 2017-11-07 1.170470e+00 0.0287317882
14192 2017-11-08 7.919862e-01 0.0788187381
14193 2017-11-09 4.574565e-01 0.1539981316
14194 2017-11-10 4.552032e-01 0.2034393145
14195 2017-11-11 -3.621350e-01 0.2077476707
14196 2017-11-12 -8.053965e-01 0.1759558604
14197 2017-11-13 -8.307459e-01 0.1802858410
14198 2017-11-14 -9.421325e-01 0.2175529008
14199 2017-11-15 -9.880204e-01 0.2392924580
14200 2017-11-16 -7.448127e-01 0.2519253751
14201 2017-11-17 -8.081435e-01 0.2614254732
14202 2017-11-18 -1.216806e+00 0.2629971336
14203 2017-11-19 -1.122674e+00 0.3469995055
14204 2017-11-20 -1.242597e+00 0.4553094014
14205 2017-11-21 -1.294885e+00 0.5049438231
14206 2017-11-22 -9.325514e-01 0.4684133163
14207 2017-11-23 -4.632281e-01 0.4071673624
14208 2017-11-24 -9.689322e-02 0.3710270269
14209 2017-11-25 4.704467e-01 0.4126721465
14210 2017-11-26 8.682453e-01 0.3745057653
14211 2017-11-27 5.105564e-01 0.2373454931
14212 2017-11-28 4.747265e-01 0.1650783370
14213 2017-11-29 5.905379e-01 0.2632154120
14214 2017-11-30 4.083787e-01 0.3888834762
14215 2017-12-01 3.451736e-01 0.5008047592
14216 2017-12-02 5.161312e-01 0.5388177242
14217 2017-12-03 7.109279e-01 0.5515360710
14218 2017-12-04 4.458635e-01 0.5127537202
14219 2017-12-05 -3.986610e-01 0.3896493238
14220 2017-12-06 -5.968253e-01 0.1095843268
14221 2017-12-07 -1.604398e-01 -0.2455506506
14222 2017-12-08 -4.384744e-01 -0.5801038215
14223 2017-12-09 -7.255016e-01 -0.8384627087
14224 2017-12-10 -9.691828e-01 -0.9223171538
14225 2017-12-11 -1.140588e+00 -0.8177806761
14226 2017-12-12 -1.956622e-01 -0.5250998474
14227 2017-12-13 -1.083792e-01 -0.3430768534
14228 2017-12-14 -8.016345e-02 -0.3163476104
14229 2017-12-15 8.899266e-01 -0.2813253830
14230 2017-12-16 1.322833e+00 -0.2545953062
14231 2017-12-17 1.547972e+00 -0.2275373110
14232 2017-12-18 2.164907e+00 -0.3217205817
14233 2017-12-19 2.276258e+00 -0.5773412429
14234 2017-12-20 1.862291e+00 -0.7728091393
14235 2017-12-21 1.125083e+00 -0.9099696881
14236 2017-12-22 7.737118e-01 -1.2441963604
14237 2017-12-23 7.863508e-01 -1.4802661587
14238 2017-12-24 4.313111e-01 -1.4111320559
14239 2017-12-25 -8.814799e-02 -1.0024805520
14240 2017-12-26 -3.615127e-01 -0.4943077147
14241 2017-12-27 -5.011363e-01 -0.0308588186
14242 2017-12-28 -8.474088e-01 0.3717555895
14243 2017-12-29 -7.283247e-01 0.8230450219
14244 2017-12-30 -4.566981e-01 1.2495961116
14245 2017-12-31 -4.577034e-01 1.4805369230
14246 2018-01-01 1.946166e-01 1.5310004017
14247 2018-01-02 5.203149e-01 1.5384595802
14248 2018-01-03 5.024570e-02 1.4036679018
14249 2018-01-04 -7.065297e-01 1.0749574137
14250 2018-01-05 -8.741815e-01 0.7608524752
14251 2018-01-06 1.589530e-01 0.7891084646
14252 2018-01-07 8.632378e-01 1.1230358751
As requested, the class is "Date".
You can use lubridate and base R:
library(lubridate)
dats[month(ymd(dats$V2)) >= 10,]
# EDIT if the class of the date variable is date, it should be only
dats[month(dats$V2) >= 10,]
Or fully base without any date work:
dats[substr(dats$V2,6,7) %in% c("10","11","12"),]
With data:
V1 V2 V3 V4
1 14138 2017-09-15 0.4655946 0.06035159
2 14139 2017-09-16 0.7881137 0.04799333
...
From your question, it is unclear what format the date variable is in. Maybe add the output of class(your_date_variable) to the question. As a general rule, though, you'll want to use filter from the dplyr package. Something like this:
new_data <- data %>% filter(format(date_variable, "%m") >= 10)
This might change slightly depending on the class of your date variable.
Assuming the 'date_variable' is Date class, extract the month and do a comparison in filter (action verb from dplyr)
library(dplyr)
library(lubridate)
data %>%
filter(month(date_variable) >= 10)
I would like to lag one variable by, say, 10 time steps and plot it against the other variable which remains the same. I would like to do this for various lags to see if there is a time period that the first variable influences the other. The data I have is daily and after lagging I am separating into Dec-Feb data only. The problem I am having is the plot and correlation between the lagged variable and the other data is coming out the same as the non-lagged plot and correlation every time. I am not sure how to achieve this.
A sample of my data frame "data" can be seen below.
Date x y
14158 2017-10-05 1.913918e+00 -0.1538234614
14159 2017-10-06 1.479714e+00 -0.1937094170
14160 2017-10-07 8.783669e-01 -0.1703790211
14161 2017-10-08 5.706581e-01 -0.1294144428
14162 2017-10-09 4.979405e-01 -0.0666569815
14163 2017-10-10 3.233477e-01 0.0072006102
14164 2017-10-11 3.057630e-01 0.0863445067
14165 2017-10-12 5.877673e-01 0.1097707831
14166 2017-10-13 1.208526e+00 0.1301967193
14167 2017-10-14 1.671705e+00 0.1728109268
14168 2017-10-15 1.810979e+00 0.2264911145
14169 2017-10-16 1.426651e+00 0.2702958315
14170 2017-10-17 1.241140e+00 0.3242637704
14171 2017-10-18 8.997498e-01 0.3879727861
14172 2017-10-19 5.594161e-01 0.4172990825
14173 2017-10-20 3.980254e-01 0.3915170864
14174 2017-10-21 2.138538e-01 0.3249736995
14175 2017-10-22 3.926440e-01 0.2224834840
14176 2017-10-23 2.268644e-01 0.0529143372
14177 2017-10-24 5.664923e-01 -0.0081443464
14178 2017-10-25 6.167520e-01 0.0312073984
14179 2017-10-26 7.751882e-02 0.0043897693
14180 2017-10-27 -5.634851e-02 -0.0726825266
14181 2017-10-28 -2.122061e-01 -0.1711305549
14182 2017-10-29 -8.500991e-01 -0.2068581639
14183 2017-10-30 -1.039685e+00 -0.2909120824
14184 2017-10-31 -3.057745e-01 -0.3933633317
14185 2017-11-01 -1.288774e-01 -0.3726346136
14186 2017-11-02 -5.608007e-03 -0.2425754386
14187 2017-11-03 4.853990e-01 -0.0503543980
14188 2017-11-04 5.822672e-01 0.0896130098
14189 2017-11-05 8.491505e-01 0.1299151006
14190 2017-11-06 1.052999e+00 0.0749888307
14191 2017-11-07 1.170470e+00 0.0287317882
14192 2017-11-08 7.919862e-01 0.0788187381
14193 2017-11-09 4.574565e-01 0.1539981316
14194 2017-11-10 4.552032e-01 0.2034393145
14195 2017-11-11 -3.621350e-01 0.2077476707
14196 2017-11-12 -8.053965e-01 0.1759558604
14197 2017-11-13 -8.307459e-01 0.1802858410
14198 2017-11-14 -9.421325e-01 0.2175529008
14199 2017-11-15 -9.880204e-01 0.2392924580
14200 2017-11-16 -7.448127e-01 0.2519253751
14201 2017-11-17 -8.081435e-01 0.2614254732
14202 2017-11-18 -1.216806e+00 0.2629971336
14203 2017-11-19 -1.122674e+00 0.3469995055
14204 2017-11-20 -1.242597e+00 0.4553094014
14205 2017-11-21 -1.294885e+00 0.5049438231
14206 2017-11-22 -9.325514e-01 0.4684133163
14207 2017-11-23 -4.632281e-01 0.4071673624
14208 2017-11-24 -9.689322e-02 0.3710270269
14209 2017-11-25 4.704467e-01 0.4126721465
14210 2017-11-26 8.682453e-01 0.3745057653
14211 2017-11-27 5.105564e-01 0.2373454931
14212 2017-11-28 4.747265e-01 0.1650783370
14213 2017-11-29 5.905379e-01 0.2632154120
14214 2017-11-30 4.083787e-01 0.3888834762
14215 2017-12-01 3.451736e-01 0.5008047592
14216 2017-12-02 5.161312e-01 0.5388177242
14217 2017-12-03 7.109279e-01 0.5515360710
14218 2017-12-04 4.458635e-01 0.5127537202
14219 2017-12-05 -3.986610e-01 0.3896493238
14220 2017-12-06 -5.968253e-01 0.1095843268
14221 2017-12-07 -1.604398e-01 -0.2455506506
14222 2017-12-08 -4.384744e-01 -0.5801038215
14223 2017-12-09 -7.255016e-01 -0.8384627087
14224 2017-12-10 -9.691828e-01 -0.9223171538
14225 2017-12-11 -1.140588e+00 -0.8177806761
14226 2017-12-12 -1.956622e-01 -0.5250998474
14227 2017-12-13 -1.083792e-01 -0.3430768534
14228 2017-12-14 -8.016345e-02 -0.3163476104
14229 2017-12-15 8.899266e-01 -0.2813253830
14230 2017-12-16 1.322833e+00 -0.2545953062
14231 2017-12-17 1.547972e+00 -0.2275373110
14232 2017-12-18 2.164907e+00 -0.3217205817
14233 2017-12-19 2.276258e+00 -0.5773412429
14234 2017-12-20 1.862291e+00 -0.7728091393
14235 2017-12-21 1.125083e+00 -0.9099696881
14236 2017-12-22 7.737118e-01 -1.2441963604
14237 2017-12-23 7.863508e-01 -1.4802661587
14238 2017-12-24 4.313111e-01 -1.4111320559
14239 2017-12-25 -8.814799e-02 -1.0024805520
14240 2017-12-26 -3.615127e-01 -0.4943077147
14241 2017-12-27 -5.011363e-01 -0.0308588186
14242 2017-12-28 -8.474088e-01 0.3717555895
14243 2017-12-29 -7.283247e-01 0.8230450219
14244 2017-12-30 -4.566981e-01 1.2495961116
14245 2017-12-31 -4.577034e-01 1.4805369230
14246 2018-01-01 1.946166e-01 1.5310004017
14247 2018-01-02 5.203149e-01 1.5384595802
14248 2018-01-03 5.024570e-02 1.4036679018
14249 2018-01-04 -7.065297e-01 1.0749574137
14250 2018-01-05 -8.741815e-01 0.7608524752
14251 2018-01-06 1.589530e-01 0.7891084646
14252 2018-01-07 8.632378e-01 1.1230358751
I am using
lagged <- lag(ts(x), k=10)
This is so the tsp isn't ignored. However, when I do
cor(data$x, data$y)
and
cor(lagged, data$y)
the result is the same, where I would have thought it would have been different. How do I get this lag to work before I can go ahead separate via date?
Many thanks!
I want to substract timeA with timearriving and timeL with timeleaving but I get this error:
"Error in unclass(e1) - e2 : non-numeric argument to binary operator"
When you see that error message, it means that you're trying to perform a binary operation with something that isn't a number. I understand the error but I wanted to ask is there is a way I can perform these calculations?
I provided a sample image of my dataset
number id location timearriving timeleaving timeA timeL person late
1 214980 900264 1001.18 NULL NULL 2016-09-15 10:00:00 2016-09-15 12:00:00 Teacher
2 215708 900264 1001.18 07:55:06 09:59:58 2016-09-22 10:00:00 2016-09-22 12:00:00 Teacher
3 216388 900264 1001.18 08:00:22 09:54:06 2016-09-29 10:00:00 2016-09-29 12:00:00 Teacher
4 217106 900264 1001.18 08:40:15 09:53:07 2016-10-05 10:00:00 2016-10-05 12:00:00 Teacher
5 217250 900264 1001.18 08:03:47 09:52:59 2016-10-06 10:00:00 2016-10-06 12:00:00 Teacher
6 217808 900264 1001.18 NULL NULL 2016-10-12 10:00:00 2016-10-12 12:00:00 Teacher
7 217952 900264 1001.18 08:01:44 09:51:45 2016-10-13 10:00:00 2016-10-13 12:00:00 Teacher
8 218640 900264 1001.18 08:04:04 09:57:24 2016-10-19 10:00:00 2016-10-19 12:00:00 Teacher
9 218788 900264 1001.18 07:59:52 09:50:17 2016-10-20 10:00:00 2016-10-20 12:00:00 Teacher
10 219397 900264 1001.18 08:01:06 09:51:05 2016-10-26 10:00:00 2016-10-26 12:00:00 Teacher
11 219541 900264 1001.18 08:05:29 09:56:04 2016-10-27 10:00:00 2016-10-27 12:00:00 Teacher
12 220273 900264 1001.18 08:09:20 09:57:46 2016-11-02 09:00:00 2016-11-02 11:00:00 Teacher
13 220419 900264 1001.18 08:09:05 09:59:53 2016-11-03 09:00:00 2016-11-03 11:00:00 Teacher
Here I added a new column with the name "late".
I want to subtract TimeA- timearriving
I did this using this code:
dataset["late"] <- NA
dataset$late <- dataset$timeA - dataset$timearriving
then the error was:
Error in unclass(e1) - e2 : non-numeric argument to binary operator
Now I tried to convert them like you said:
timeA <- ymd_hms(timeA )
timearriving <- hms(timearriving )
Warning message:
In .parse_hms(..., order = "HMS", quiet = quiet) :
Some strings failed to parse
Since you don't provide a reproducible example I will illustrate using one value for each variable e.g.:
library(lubridate)
timeleaving <- hms("09:59:33")
timeA <- ymd_hms("2017-02-16 10:00:00")
You could use:
timeleaving <- ymd_hms(paste(floor_date(timeA, "days"), timeleaving))
dif <- timeA -timeleaving
Time difference of 27 secs
Edited since the data was added to the original question:
data$timeleaving <- hms(data$timeleaving)
data$timearriving <- hms(data$timearriving)
data$timeA <- ymd_hms(data$timeA )
data$timeL <- ymd_hms(data$timeL )
data$timeleaving <- ymd_hms(paste(floor_date(data$timeL, "days"), data$timeleaving))
data$timearriving <- ymd_hms(paste(floor_date(data$timeA, "days"), data$timearriving))
data$late <- data$timeA - data$timearriving
I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)
EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))