I would like to lag one variable by, say, 10 time steps and plot it against the other variable which remains the same. I would like to do this for various lags to see if there is a time period that the first variable influences the other. The data I have is daily and after lagging I am separating into Dec-Feb data only. The problem I am having is the plot and correlation between the lagged variable and the other data is coming out the same as the non-lagged plot and correlation every time. I am not sure how to achieve this.
A sample of my data frame "data" can be seen below.
Date x y
14158 2017-10-05 1.913918e+00 -0.1538234614
14159 2017-10-06 1.479714e+00 -0.1937094170
14160 2017-10-07 8.783669e-01 -0.1703790211
14161 2017-10-08 5.706581e-01 -0.1294144428
14162 2017-10-09 4.979405e-01 -0.0666569815
14163 2017-10-10 3.233477e-01 0.0072006102
14164 2017-10-11 3.057630e-01 0.0863445067
14165 2017-10-12 5.877673e-01 0.1097707831
14166 2017-10-13 1.208526e+00 0.1301967193
14167 2017-10-14 1.671705e+00 0.1728109268
14168 2017-10-15 1.810979e+00 0.2264911145
14169 2017-10-16 1.426651e+00 0.2702958315
14170 2017-10-17 1.241140e+00 0.3242637704
14171 2017-10-18 8.997498e-01 0.3879727861
14172 2017-10-19 5.594161e-01 0.4172990825
14173 2017-10-20 3.980254e-01 0.3915170864
14174 2017-10-21 2.138538e-01 0.3249736995
14175 2017-10-22 3.926440e-01 0.2224834840
14176 2017-10-23 2.268644e-01 0.0529143372
14177 2017-10-24 5.664923e-01 -0.0081443464
14178 2017-10-25 6.167520e-01 0.0312073984
14179 2017-10-26 7.751882e-02 0.0043897693
14180 2017-10-27 -5.634851e-02 -0.0726825266
14181 2017-10-28 -2.122061e-01 -0.1711305549
14182 2017-10-29 -8.500991e-01 -0.2068581639
14183 2017-10-30 -1.039685e+00 -0.2909120824
14184 2017-10-31 -3.057745e-01 -0.3933633317
14185 2017-11-01 -1.288774e-01 -0.3726346136
14186 2017-11-02 -5.608007e-03 -0.2425754386
14187 2017-11-03 4.853990e-01 -0.0503543980
14188 2017-11-04 5.822672e-01 0.0896130098
14189 2017-11-05 8.491505e-01 0.1299151006
14190 2017-11-06 1.052999e+00 0.0749888307
14191 2017-11-07 1.170470e+00 0.0287317882
14192 2017-11-08 7.919862e-01 0.0788187381
14193 2017-11-09 4.574565e-01 0.1539981316
14194 2017-11-10 4.552032e-01 0.2034393145
14195 2017-11-11 -3.621350e-01 0.2077476707
14196 2017-11-12 -8.053965e-01 0.1759558604
14197 2017-11-13 -8.307459e-01 0.1802858410
14198 2017-11-14 -9.421325e-01 0.2175529008
14199 2017-11-15 -9.880204e-01 0.2392924580
14200 2017-11-16 -7.448127e-01 0.2519253751
14201 2017-11-17 -8.081435e-01 0.2614254732
14202 2017-11-18 -1.216806e+00 0.2629971336
14203 2017-11-19 -1.122674e+00 0.3469995055
14204 2017-11-20 -1.242597e+00 0.4553094014
14205 2017-11-21 -1.294885e+00 0.5049438231
14206 2017-11-22 -9.325514e-01 0.4684133163
14207 2017-11-23 -4.632281e-01 0.4071673624
14208 2017-11-24 -9.689322e-02 0.3710270269
14209 2017-11-25 4.704467e-01 0.4126721465
14210 2017-11-26 8.682453e-01 0.3745057653
14211 2017-11-27 5.105564e-01 0.2373454931
14212 2017-11-28 4.747265e-01 0.1650783370
14213 2017-11-29 5.905379e-01 0.2632154120
14214 2017-11-30 4.083787e-01 0.3888834762
14215 2017-12-01 3.451736e-01 0.5008047592
14216 2017-12-02 5.161312e-01 0.5388177242
14217 2017-12-03 7.109279e-01 0.5515360710
14218 2017-12-04 4.458635e-01 0.5127537202
14219 2017-12-05 -3.986610e-01 0.3896493238
14220 2017-12-06 -5.968253e-01 0.1095843268
14221 2017-12-07 -1.604398e-01 -0.2455506506
14222 2017-12-08 -4.384744e-01 -0.5801038215
14223 2017-12-09 -7.255016e-01 -0.8384627087
14224 2017-12-10 -9.691828e-01 -0.9223171538
14225 2017-12-11 -1.140588e+00 -0.8177806761
14226 2017-12-12 -1.956622e-01 -0.5250998474
14227 2017-12-13 -1.083792e-01 -0.3430768534
14228 2017-12-14 -8.016345e-02 -0.3163476104
14229 2017-12-15 8.899266e-01 -0.2813253830
14230 2017-12-16 1.322833e+00 -0.2545953062
14231 2017-12-17 1.547972e+00 -0.2275373110
14232 2017-12-18 2.164907e+00 -0.3217205817
14233 2017-12-19 2.276258e+00 -0.5773412429
14234 2017-12-20 1.862291e+00 -0.7728091393
14235 2017-12-21 1.125083e+00 -0.9099696881
14236 2017-12-22 7.737118e-01 -1.2441963604
14237 2017-12-23 7.863508e-01 -1.4802661587
14238 2017-12-24 4.313111e-01 -1.4111320559
14239 2017-12-25 -8.814799e-02 -1.0024805520
14240 2017-12-26 -3.615127e-01 -0.4943077147
14241 2017-12-27 -5.011363e-01 -0.0308588186
14242 2017-12-28 -8.474088e-01 0.3717555895
14243 2017-12-29 -7.283247e-01 0.8230450219
14244 2017-12-30 -4.566981e-01 1.2495961116
14245 2017-12-31 -4.577034e-01 1.4805369230
14246 2018-01-01 1.946166e-01 1.5310004017
14247 2018-01-02 5.203149e-01 1.5384595802
14248 2018-01-03 5.024570e-02 1.4036679018
14249 2018-01-04 -7.065297e-01 1.0749574137
14250 2018-01-05 -8.741815e-01 0.7608524752
14251 2018-01-06 1.589530e-01 0.7891084646
14252 2018-01-07 8.632378e-01 1.1230358751
I am using
lagged <- lag(ts(x), k=10)
This is so the tsp isn't ignored. However, when I do
cor(data$x, data$y)
and
cor(lagged, data$y)
the result is the same, where I would have thought it would have been different. How do I get this lag to work before I can go ahead separate via date?
Many thanks!
Related
I am trying to cross-validate a Prophet model in R.
The problem - this package does not work well with monthly data.
I managed to build the model
and even used a custom monthly seasonality.
as recommended by authors of this tool.
But cannot cross-validate monthly data. Tried to follow recommendations in the GitHub issue, but missing something.
Currently my code looks like this
model1_cv <- cross_validation(model1, initial = 156, period = 365/12, as.difftime(horizon = 365/12, units = "days"))
Updated:
Based on answer to this question, I visualized CV results. There some problems here. I used full data and partial data.
Also metrics do not look that good
I just tested a bit with training data from the package and from what I understood the package is not really well suited for monthly forecast, this part: [...] as.difftime(365/12, units = "days") [...] seems to have been informed just to prove the size of the month with 30something days. Meaning you can use this instead of just 365/12 por "period" and/or "horizon". One thing I noticed is, that both arguments are of type integer per description but when you look into the function they are calculated per as.datediff() so they are doubles actually.
library(dplyr)
library(prophet)
library(data.table)
#training data
df <- data.table::fread("ds y
1992-01-01 146376
1992-02-01 147079
1992-03-01 159336
1992-04-01 163669
1992-05-01 170068
1992-06-01 168663
1992-07-01 169890
1992-08-01 170364
1992-09-01 164617
1992-10-01 173655
1992-11-01 171547
1992-12-01 208838
1993-01-01 153221
1993-02-01 150087
1993-03-01 170439
1993-04-01 176456
1993-05-01 182231
1993-06-01 181535
1993-07-01 183682
1993-08-01 183318
1993-09-01 177406
1993-10-01 182737
1993-11-01 187443
1993-12-01 224540
1994-01-01 161349
1994-02-01 162841
1994-03-01 192319
1994-04-01 189569
1994-05-01 194927
1994-06-01 197946
1994-07-01 193355
1994-08-01 202388
1994-09-01 193954
1994-10-01 197956
1994-11-01 202520
1994-12-01 241111
1995-01-01 175344
1995-02-01 172138
1995-03-01 201279
1995-04-01 196039
1995-05-01 210478
1995-06-01 211844
1995-07-01 203411
1995-08-01 214248
1995-09-01 202122
1995-10-01 204044
1995-11-01 212190
1995-12-01 247491
1996-01-01 185019
1996-02-01 192380
1996-03-01 212110
1996-04-01 211718
1996-05-01 226936
1996-06-01 217511
1996-07-01 218111")
df <- df %>%
dplyr::mutate(ds = as.Date(ds))
model <- prophet::prophet(df)
(tscv.myfit <- prophet::cross_validation(model, horizon = 365/12, units = "days", period = 365/12, initial = 365/12 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 175344 1995-01-01 170988.8 170145.9 171828.0 1994-12-31 02:00:00
2: 172138 1995-02-01 178117.4 176975.2 179070.2 1995-01-30 12:00:00
3: 201279 1995-03-01 211462.8 210277.4 212670.8 1995-01-30 12:00:00
4: 196039 1995-04-01 200113.9 198079.5 201977.8 1995-03-01 22:00:00
5: 210478 1995-05-01 202100.5 200390.8 203797.9 1995-04-01 08:00:00
6: 211844 1995-06-01 208330.5 206229.9 210497.4 1995-05-01 18:00:00
7: 203411 1995-07-01 202563.8 200786.5 204313.0 1995-06-01 04:00:00
8: 214248 1995-08-01 214639.6 212748.3 216461.3 1995-07-01 14:00:00
9: 202122 1995-09-01 204954.0 203048.9 206768.4 1995-08-31 12:00:00
10: 204044 1995-10-01 205097.5 203209.7 206882.3 1995-09-30 22:00:00
11: 212190 1995-11-01 213586.7 211728.1 215617.6 1995-10-31 08:00:00
12: 247491 1995-12-01 251518.8 249708.2 253589.2 1995-11-30 18:00:00
13: 185019 1996-01-01 182403.7 180520.1 184494.7 1995-12-31 04:00:00
14: 192380 1996-02-01 184722.9 182772.7 186686.9 1996-01-30 14:00:00
15: 212110 1996-03-01 205020.1 202823.2 206996.9 1996-01-30 14:00:00
16: 211718 1996-04-01 214514.0 211891.9 217175.3 1996-03-31 14:00:00
17: 226936 1996-05-01 218845.2 216133.8 221420.4 1996-03-31 14:00:00
18: 217511 1996-06-01 218672.2 216007.8 221459.9 1996-05-31 14:00:00
19: 218111 1996-07-01 221156.1 218540.7 224184.1 1996-05-31 14:00:00
The cutoff is not as regular as one would expect - I guess this is due to using average days per month somehow - though I could not figute out the logic. You can replace 365/12 with as.difftime(365/12, units = "days") and will get the same result.
But if you use (365+365+365+366) / 48 instead due to the 29.02. you get a slighly different average days per month and this leads to a different output:
(tscv.myfit_2 <- prophet::cross_validation(model, horizon = (365+365+365+366)/48, units = "days", period = (365+365+365+366)/48, initial = (365+365+365+366)/48 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 172138 1995-02-01 178117.4 177075.3 179203.9 1995-01-29 13:30:00
2: 201279 1995-03-01 211462.8 210340.5 212607.3 1995-01-29 13:30:00
3: 196039 1995-04-01 200113.9 198022.6 202068.1 1995-03-31 13:30:00
4: 210478 1995-05-01 204100.2 202009.8 206098.7 1995-03-31 13:30:00
5: 211844 1995-06-01 208330.5 206114.5 210515.8 1995-05-31 13:30:00
6: 203411 1995-07-01 202606.0 200319.1 204663.4 1995-05-31 13:30:00
7: 214248 1995-08-01 214639.6 212684.4 216495.7 1995-07-31 22:30:00
8: 202122 1995-09-01 204954.0 203127.7 206951.0 1995-08-31 09:00:00
9: 204044 1995-10-01 205097.5 203285.3 207036.5 1995-09-30 19:30:00
10: 212190 1995-11-01 213586.7 211516.8 215516.2 1995-10-31 06:00:00
11: 247491 1995-12-01 251518.8 249658.3 253590.1 1995-11-30 16:30:00
12: 185019 1996-01-01 182403.7 180359.7 184399.2 1995-12-31 03:00:00
13: 192380 1996-02-01 184722.9 182652.4 186899.8 1996-01-30 13:30:00
14: 212110 1996-03-01 205020.1 203040.3 207171.9 1996-01-30 13:30:00
15: 211718 1996-04-01 214514.0 211942.6 217252.6 1996-03-31 13:30:00
16: 226936 1996-05-01 218845.2 216203.1 221506.5 1996-03-31 13:30:00
17: 217511 1996-06-01 218672.2 215823.9 221292.4 1996-05-31 13:30:00
18: 218111 1996-07-01 221156.1 218236.7 223862.0 1996-05-31 13:30:00
Form this behaviour I would say the work arround is not ideal, especially depending how exact you want the crossvalidation to be in terms of rolling month. If you need the cutoff points to be exact you could write your own function and predict always one month from the starting point, collect these results and build final comparision. I would trust this approach more than the work arround.
I have a data frame, df, that has date and two variables in it. I would like to either extract all of Oct-Dec data or delete the other months data from the data frame.
I have put the data into a data frame but at the moment have the whole year, I just want to extract the wanted data. In future I will also be extracting just winter data. I have attached my chunk of my data frame, I tried using format() with just %m but couldn't get it to work.
14138 2017-09-15 4.655946e-01 0.0603515884
14139 2017-09-16 7.881137e-01 0.0479933304
14140 2017-09-17 5.018990e-01 0.0256871025
14141 2017-09-18 -1.583625e-01 -0.0040893990
14142 2017-09-19 -6.733220e-01 -0.0313100989
14143 2017-09-20 -1.225730e+00 -0.0587706331
14144 2017-09-21 -1.419133e+00 -0.0958125544
14145 2017-09-22 -1.338630e+00 -0.0902803173
14146 2017-09-23 -1.272554e+00 -0.0659170673
14147 2017-09-24 -1.132318e+00 -0.0387240370
14148 2017-09-25 -1.255414e+00 -0.0392615823
14149 2017-09-26 -1.497188e+00 -0.0438491356
14150 2017-09-27 -1.427622e+00 -0.0633879185
14151 2017-09-28 -1.051756e+00 -0.0992427127
14152 2017-09-29 -4.876309e-01 -0.1448044528
14153 2017-09-30 -6.829681e-02 -0.1749463647
14154 2017-10-01 -1.413768e-01 -0.2009916094
14155 2017-10-02 6.359742e-02 -0.1975848313
14156 2017-10-03 9.103277e-01 -0.1828581805
14157 2017-10-04 1.695776e+00 -0.1589352546
14158 2017-10-05 1.913918e+00 -0.1538234614
14159 2017-10-06 1.479714e+00 -0.1937094170
14160 2017-10-07 8.783669e-01 -0.1703790211
14161 2017-10-08 5.706581e-01 -0.1294144428
14162 2017-10-09 4.979405e-01 -0.0666569815
14163 2017-10-10 3.233477e-01 0.0072006102
14164 2017-10-11 3.057630e-01 0.0863445067
14165 2017-10-12 5.877673e-01 0.1097707831
14166 2017-10-13 1.208526e+00 0.1301967193
14167 2017-10-14 1.671705e+00 0.1728109268
14168 2017-10-15 1.810979e+00 0.2264911145
14169 2017-10-16 1.426651e+00 0.2702958315
14170 2017-10-17 1.241140e+00 0.3242637704
14171 2017-10-18 8.997498e-01 0.3879727861
14172 2017-10-19 5.594161e-01 0.4172990825
14173 2017-10-20 3.980254e-01 0.3915170864
14174 2017-10-21 2.138538e-01 0.3249736995
14175 2017-10-22 3.926440e-01 0.2224834840
14176 2017-10-23 2.268644e-01 0.0529143372
14177 2017-10-24 5.664923e-01 -0.0081443464
14178 2017-10-25 6.167520e-01 0.0312073984
14179 2017-10-26 7.751882e-02 0.0043897693
14180 2017-10-27 -5.634851e-02 -0.0726825266
14181 2017-10-28 -2.122061e-01 -0.1711305549
14182 2017-10-29 -8.500991e-01 -0.2068581639
14183 2017-10-30 -1.039685e+00 -0.2909120824
14184 2017-10-31 -3.057745e-01 -0.3933633317
14185 2017-11-01 -1.288774e-01 -0.3726346136
14186 2017-11-02 -5.608007e-03 -0.2425754386
14187 2017-11-03 4.853990e-01 -0.0503543980
14188 2017-11-04 5.822672e-01 0.0896130098
14189 2017-11-05 8.491505e-01 0.1299151006
14190 2017-11-06 1.052999e+00 0.0749888307
14191 2017-11-07 1.170470e+00 0.0287317882
14192 2017-11-08 7.919862e-01 0.0788187381
14193 2017-11-09 4.574565e-01 0.1539981316
14194 2017-11-10 4.552032e-01 0.2034393145
14195 2017-11-11 -3.621350e-01 0.2077476707
14196 2017-11-12 -8.053965e-01 0.1759558604
14197 2017-11-13 -8.307459e-01 0.1802858410
14198 2017-11-14 -9.421325e-01 0.2175529008
14199 2017-11-15 -9.880204e-01 0.2392924580
14200 2017-11-16 -7.448127e-01 0.2519253751
14201 2017-11-17 -8.081435e-01 0.2614254732
14202 2017-11-18 -1.216806e+00 0.2629971336
14203 2017-11-19 -1.122674e+00 0.3469995055
14204 2017-11-20 -1.242597e+00 0.4553094014
14205 2017-11-21 -1.294885e+00 0.5049438231
14206 2017-11-22 -9.325514e-01 0.4684133163
14207 2017-11-23 -4.632281e-01 0.4071673624
14208 2017-11-24 -9.689322e-02 0.3710270269
14209 2017-11-25 4.704467e-01 0.4126721465
14210 2017-11-26 8.682453e-01 0.3745057653
14211 2017-11-27 5.105564e-01 0.2373454931
14212 2017-11-28 4.747265e-01 0.1650783370
14213 2017-11-29 5.905379e-01 0.2632154120
14214 2017-11-30 4.083787e-01 0.3888834762
14215 2017-12-01 3.451736e-01 0.5008047592
14216 2017-12-02 5.161312e-01 0.5388177242
14217 2017-12-03 7.109279e-01 0.5515360710
14218 2017-12-04 4.458635e-01 0.5127537202
14219 2017-12-05 -3.986610e-01 0.3896493238
14220 2017-12-06 -5.968253e-01 0.1095843268
14221 2017-12-07 -1.604398e-01 -0.2455506506
14222 2017-12-08 -4.384744e-01 -0.5801038215
14223 2017-12-09 -7.255016e-01 -0.8384627087
14224 2017-12-10 -9.691828e-01 -0.9223171538
14225 2017-12-11 -1.140588e+00 -0.8177806761
14226 2017-12-12 -1.956622e-01 -0.5250998474
14227 2017-12-13 -1.083792e-01 -0.3430768534
14228 2017-12-14 -8.016345e-02 -0.3163476104
14229 2017-12-15 8.899266e-01 -0.2813253830
14230 2017-12-16 1.322833e+00 -0.2545953062
14231 2017-12-17 1.547972e+00 -0.2275373110
14232 2017-12-18 2.164907e+00 -0.3217205817
14233 2017-12-19 2.276258e+00 -0.5773412429
14234 2017-12-20 1.862291e+00 -0.7728091393
14235 2017-12-21 1.125083e+00 -0.9099696881
14236 2017-12-22 7.737118e-01 -1.2441963604
14237 2017-12-23 7.863508e-01 -1.4802661587
14238 2017-12-24 4.313111e-01 -1.4111320559
14239 2017-12-25 -8.814799e-02 -1.0024805520
14240 2017-12-26 -3.615127e-01 -0.4943077147
14241 2017-12-27 -5.011363e-01 -0.0308588186
14242 2017-12-28 -8.474088e-01 0.3717555895
14243 2017-12-29 -7.283247e-01 0.8230450219
14244 2017-12-30 -4.566981e-01 1.2495961116
14245 2017-12-31 -4.577034e-01 1.4805369230
14246 2018-01-01 1.946166e-01 1.5310004017
14247 2018-01-02 5.203149e-01 1.5384595802
14248 2018-01-03 5.024570e-02 1.4036679018
14249 2018-01-04 -7.065297e-01 1.0749574137
14250 2018-01-05 -8.741815e-01 0.7608524752
14251 2018-01-06 1.589530e-01 0.7891084646
14252 2018-01-07 8.632378e-01 1.1230358751
As requested, the class is "Date".
You can use lubridate and base R:
library(lubridate)
dats[month(ymd(dats$V2)) >= 10,]
# EDIT if the class of the date variable is date, it should be only
dats[month(dats$V2) >= 10,]
Or fully base without any date work:
dats[substr(dats$V2,6,7) %in% c("10","11","12"),]
With data:
V1 V2 V3 V4
1 14138 2017-09-15 0.4655946 0.06035159
2 14139 2017-09-16 0.7881137 0.04799333
...
From your question, it is unclear what format the date variable is in. Maybe add the output of class(your_date_variable) to the question. As a general rule, though, you'll want to use filter from the dplyr package. Something like this:
new_data <- data %>% filter(format(date_variable, "%m") >= 10)
This might change slightly depending on the class of your date variable.
Assuming the 'date_variable' is Date class, extract the month and do a comparison in filter (action verb from dplyr)
library(dplyr)
library(lubridate)
data %>%
filter(month(date_variable) >= 10)
I have two very similar csv files. Stock prices for 2 different stocks downloaded from the same source in the same format. However, read.csv in R is reading them differently.
> tab1=read.csv(path1)
> tab2=read.csv(path2)
> head(tab1)
Date Open High Low Close Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
> head(tab2)
Date Open High Low Close Volume Adj.Close
1 12/1/2014 73.39 75.20 71.75 72.29 1561400 71.92211
2 11/3/2014 69.28 74.92 67.88 73.74 1421600 72.97650
3 10/1/2014 66.18 74.95 63.42 69.21 1775400 68.49341
4 9/2/2014 68.34 68.57 65.49 66.32 1249200 65.63333
5 8/1/2014 67.45 68.99 65.88 68.26 1655400 67.20743
6 7/1/2014 64.07 69.50 63.09 67.46 1733600 66.41976
If I try to use colClasses in read.csv then the dates for the second table are read incorrectly.
> tab1=read.csv(path1,colClasses=c("Date",rep("numeric",6)))
> tab2=read.csv(path2,colClasses=c("Date",rep("numeric",6)))
> head(tab1)
Date Open High Low Close Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
> head(tab2)
Date Open High Low Close Volume Adj.Close
1 0012-01-20 73.39 75.20 71.75 72.29 1561400 71.92211
2 0011-03-20 69.28 74.92 67.88 73.74 1421600 72.97650
3 0010-01-20 66.18 74.95 63.42 69.21 1775400 68.49341
4 0009-02-20 68.34 68.57 65.49 66.32 1249200 65.63333
5 0008-01-20 67.45 68.99 65.88 68.26 1655400 67.20743
6 0007-01-20 64.07 69.50 63.09 67.46 1733600 66.41976
Not sure how I can make this issue reproducible without attaching the .csv files. I'm attaching snapshots of the two files. Any help will be appreciated.
Thanks
This can be solved by reading in the dates as a character vector and then calling strptime() inside transform():
transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:
readStockData <- function(path) {
tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
tab;
};
readStockData(path1);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
readStockData(path2);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]) can be used for the detection.
We have the following function to compute monthly returns from a daily series of prices:
PricesRet = diff(Prices)/lag(Prices,k=-1)
tail(PricesRet)
# Monthly simple returns
MonRet = aggregate(PricesRet+1, as.yearmon, prod)-1
tail(MonRet)
The problem is that it returns wrong values, take for example the simple return for the month of Feb 2013, the function returns a return -0.003517301 while it should have been -0.01304773.
Why that happens?
Here are the last prices observations:
> tail(Prices,30)
Prices
2013-01-22 165.5086
2013-01-23 165.2842
2013-01-24 168.4845
2013-01-25 170.6041
2013-01-28 169.7373
2013-01-29 169.8724
2013-01-30 170.6554
2013-01-31 170.7210
2013-02-01 173.8043
2013-02-04 172.2145
2013-02-05 172.8400
2013-02-06 172.8333
2013-02-07 171.3586
2013-02-08 170.5602
2013-02-11 171.2172
2013-02-12 171.4126
2013-02-13 171.8687
2013-02-14 170.7955
2013-02-15 171.2848
2013-02-19 170.9482
2013-02-20 171.6355
2013-02-21 170.0300
2013-02-22 169.9319
2013-02-25 170.9035
2013-02-26 168.6822
2013-02-27 168.5180
2013-02-28 168.4935
2013-03-01 169.6546
2013-03-04 169.3076
2013-03-05 169.0579
Here are price returns:
> tail(PricesRet,50)
PricesRet
2012-12-18 0.0055865274
2012-12-19 -0.0015461900
2012-12-20 -0.0076140194
2012-12-23 0.0032656346
2012-12-26 0.0147750923
2012-12-27 0.0013482760
2012-12-30 -0.0004768131
2013-01-01 0.0128908541
2013-01-02 -0.0047646818
2013-01-03 0.0103372029
2013-01-06 -0.0024547278
2013-01-07 -0.0076920352
2013-01-08 0.0064368720
2013-01-09 0.0119663301
2013-01-10 0.0153828814
2013-01-13 0.0050590540
2013-01-14 -0.0053324785
2013-01-15 -0.0027043105
2013-01-16 0.0118840383
2013-01-17 -0.0005876459
2013-01-21 -0.0145541598
2013-01-22 -0.0013555548
2013-01-23 0.0193624621
2013-01-24 0.0125802978
2013-01-27 -0.0050807744
2013-01-28 0.0007959058
2013-01-29 0.0046096266
2013-01-30 0.0003844082
2013-01-31 0.0180603867
2013-02-03 -0.0091473127
2013-02-04 0.0036322298
2013-02-05 -0.0000390941
2013-02-06 -0.0085320734
2013-02-07 -0.0046591956
2013-02-10 0.0038517581
2013-02-11 0.0011412046
2013-02-12 0.0026607502
2013-02-13 -0.0062440496
2013-02-14 0.0028645616
2013-02-18 -0.0019651341
2013-02-19 0.0040206637
2013-02-20 -0.0093543648
2013-02-21 -0.0005764665
2013-02-24 0.0057176118
2013-02-25 -0.0129979321
2013-02-26 -0.0009730782
2013-02-27 -0.0001453191
2013-02-28 0.0068911863
2013-03-03 -0.0020455332
2013-03-04 -0.0014747845
The results of the function is instead:
> tail(data.frame(MonRet))
MonRet
ott 2012 -0.000848156
nov 2012 0.009833881
dic 2012 0.033406884
gen 2013 0.087822700
feb 2013 -0.023875638
mar 2013 -0.003517301
Your returns are wrong. The return for 2013-01-23 should be:
> 165.2842/165.5086-1
[1] -0.001355821
but you have 0.0193624621. I suspect this is because Prices is an xts object, not a zoo object. lag.xts breaks the convention in lag.ts and lag.zoo of k=1 implying a "lag" of (t+1) for the more common convention of using k=1 to imply a "lag" of (t-1).
I am attempting to perform a study on the clustering of high/low points based on time. I managed to achieve the above by using to.daily on intraday data and merging the two using:
intraday.merge <- merge(intraday,daily)
intraday.merge <- na.locf(intraday.merge)
intraday.merge <- intraday.merge["T08:30:00/T16:30:00"] # remove record at 00:00:00
Next, I tried to obtain the records where the high == daily.high/low == daily.low using:
intradayhi <- test[test$High == test$Daily.High]
intradaylo <- test[test$Low == test$Daily.Low]
Resulting data resembles the following:
Open High Low Close Volume Daily.Open Daily.High Daily.Low Daily.Close Daily.Volume
2012-06-19 08:45:00 258.9 259.1 258.5 258.7 1424 258.9 259.1 257.7 258.7 31523
2012-06-20 13:30:00 260.8 260.9 260.6 260.6 1616 260.4 260.9 259.2 260.8 35358
2012-06-21 08:40:00 260.7 260.8 260.4 260.5 493 260.7 260.8 257.4 258.3 31360
2012-06-22 12:10:00 255.9 256.2 255.9 256.1 626 254.5 256.2 253.9 255.3 50515
2012-06-22 12:15:00 256.1 256.2 255.9 255.9 779 254.5 256.2 253.9 255.3 50515
2012-06-25 11:55:00 254.5 254.7 254.4 254.6 1589 253.8 254.7 251.5 253.9 65621
2012-06-26 08:45:00 253.4 254.2 253.2 253.7 5849 253.8 254.2 252.4 253.1 70635
2012-06-27 11:25:00 255.6 256.0 255.5 255.9 973 251.8 256.0 251.8 255.2 53335
2012-06-28 09:00:00 257.0 257.3 256.9 257.1 601 255.3 257.3 255.0 255.1 23978
2012-06-29 13:45:00 253.0 253.4 253.0 253.4 451 247.3 253.4 246.9 253.4 52539
There are duplicated results using the subset, how do I achieve only the first record of the day? I would then be able to plot the count of records for periods in the day.
Also, are there alternate methods to get the results I want? Thanks in advance.
Edit:
Sample output should look like this, count could either be 1st result for day or aggregated (more than 1 occurrence in that day):
Time Count
08:40:00 60
08:45:00 54
08:50:00 60
...
14:00:00 20
14:05:00 12
14:10:00 30
You can get the first observation of each day via:
y <- apply.daily(x, first)
Then you can simply aggregate the count based on hours and minutes:
z <- aggregate(1:NROW(y), by=list(Time=format(index(y),"%H:%M")), sum)