separate row containing two separate dates into before and after midnight - r

I have a data frame containing sleep data, with several sleep increments, with a column for the start and a column for the end of the sleep.
For some rows, the starting time is on the previous day and the end time is on the next day.
What I would like to do is to separate such rows into two rows, where the first row contains the starting time till 23:59:59, and the second row 00:00:00 till the end time.
For example:
# A tibble: 6 x 3
sleepdatestarttime sleepdateendtime sleepstage
<dttm> <dttm> <chr>
1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
3 2018-03-02 23:55:00 2018-03-03 00:02:00 wake
4 2018-03-03 00:02:00 2018-03-03 00:03:30 light
5 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
6 2018-03-03 00:23:30 2018-03-03 02:58:00 light
and the desired output is:
# A tibble: 6 x 3
sleepdatestarttime sleepdateendtime sleepstage
<dttm> <dttm> <chr>
1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
**3 2018-03-02 23:55:00 2018-03-02 23:59:59 wake
4 2018-03-03 00:00:00 2018-03-03 00:01:59 wake**
5 2018-03-03 00:02:00 2018-03-03 00:03:30 light
6 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
7 2018-03-03 00:23:30 2018-03-03 02:58:00 light
A dplyr solution would be very helpful.

Here is a possible solution but using just base R and not a dplyr. I converted all times to UTC to avoid issue with time conversions. (See a related answer change time zone in R without it returning to original time zone)
Note this solution resorts the entire dataframe by sleepdatestarttime so if there are multiple people on the same day, then the order function on the last line needs modification.
df<-read.table(header=TRUE, text="sleepdatestarttime sleepdateendtime sleepstage
'2018-03-02 23:31:00' '2018-03-02 23:54:00' rem
'2018-03-02 23:54:00' '2018-03-02 23:55:00' light
'2018-03-02 23:55:00' '2018-03-03 00:02:00' wake
'2018-03-03 00:02:00' '2018-03-03 00:03:30' light
'2018-03-03 00:03:30' '2018-03-03 00:23:30' deep
'2018-03-03 00:23:30' '2018-03-03 02:58:00' light")
df$sleepdatestarttime<-as.POSIXct(as.character(df$sleepdatestarttime), tz="UTC")
df$sleepdateendtime<-as.POSIXct(as.character(df$sleepdateendtime), tz="UTC")
#find rows across days
rows<-which(as.Date(df$sleepdatestarttime) !=as.Date(df$sleepdateendtime))
#create the new rows
nstart<-data.frame(sleepdatestarttime= df$sleepdatestarttime[rows],
sleepdateendtime= as.POSIXct(paste(as.Date(df$sleepdatestarttime[rows]), "23:59:59"), tz="UTC"),
sleepstage=df$sleepstage[rows])
nend<-data.frame(sleepdatestarttime= as.POSIXct(paste(as.Date(df$sleepdateendtime[rows]), "00:00:00"), tz="UTC"),
sleepdateendtime= df$sleepdateendtime[rows],
sleepstage=df$sleepstage[rows])
#substitute in the new start rows
df[rows,]<-nstart
#tack on the new ending rows
df<-rbind(df, nend)
#resort the dataframe
df<-df[order(df$sleepdatestarttime ),]

This is a common issue in genomics. The IRanges package on BioConductor has the findOverlaps() function for this purpose. foverlaps() is its data.table version which is used here. AFAIK, there is no dplyr equivalent available.
First we need to create a vector of day start and end times. The call to foverlaps() returns all possible types of overlaps. Finally, the start and end times are adjusted to match with the expected result.
library(data.table)
library(lubridate)
day_seq <- setDT(df)[, .(day_start = seq(
floor_date(min(sleepdatestarttime), "day"),
ceiling_date(max(sleepdateendtime), "day"), "day"))][
, day_end := day_start + days(1)]
setkey(day_seq, day_start, day_end)
foverlaps(
df, day_seq, by.x = c("sleepdatestarttime", "sleepdateendtime"), nomatch = 0L)[
, `:=`(sleepdatestarttime = pmax(sleepdatestarttime, day_start),
sleepdateendtime = pmin(sleepdateendtime, day_end - seconds(1)))][
, c("day_start", "day_end") := NULL][]
i sleepdatestarttime sleepdateendtime sleepstage
1: 1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2: 2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
3: 3 2018-03-02 23:55:00 2018-03-02 23:59:59 wake
4: 3 2018-03-03 00:00:00 2018-03-03 00:02:00 wake
5: 4 2018-03-03 00:02:00 2018-03-03 00:03:30 light
6: 5 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
7: 6 2018-03-03 00:23:30 2018-03-03 02:58:00 light
Data
df <- readr::read_table("i sleepdatestarttime sleepdateendtime sleepstage
1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
3 2018-03-02 23:55:00 2018-03-03 00:02:00 wake
4 2018-03-03 00:02:00 2018-03-03 00:03:30 light
5 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
6 2018-03-03 00:23:30 2018-03-03 02:58:00 light")

Related

How to check if time is similar every 8 rows after floor_date function (and correct it if needed)

I have a dataset where new data are recorded at a fixed interval (3-4 minutes). Each 8 records (rows) correspond to a same set of data (CC_01->04 and DC01->04) that I want to stamp to the previous half-hour.
For this I use the floor date function of lubridate that works perfectly:
lubridate::floor_date(data$Date_IV, "30 minutes")
However, sometimes the eighth record starts after the begining of the next half-hour and so the floor_date function stamps it with this new half-hour. But I would like it to be stamped with the previous one (as part of the subset).
Therefore I'm looking for a way to check when this eighth value differs from the previous 7, and correct it if needed.
An exemple :
Label Date_IV Obs. Exp_Flux Floor_date
1 CC_01 2021-07-08 12:38:00 1 -0.290000 2021-07-08 12:30:00
2 DC_01 2021-07-08 12:42:00 2 3.830000 2021-07-08 12:30:00
3 CC_02 2021-07-08 12:45:00 3 -0.527937 2021-07-08 12:30:00
4 DC_02 2021-07-08 12:49:00 4 2.260000 2021-07-08 12:30:00
5 CC_03 2021-07-08 12:52:00 5 -0.743471 2021-07-08 12:30:00
6 DC_03 2021-07-08 12:55:00 6 2.230000 2021-07-08 12:30:00
7 CC_04 2021-07-08 12:59:00 7 -1.510000 2021-07-08 12:30:00
8 DC_04 2021-07-08 13:02:00 8 1.820000 2021-07-08 13:00:00
9 CC_01 2021-07-08 13:05:00 9 -0.190000 2021-07-08 13:00:00
10 DC_01 2021-07-08 13:08:00 10 3.750000 2021-07-08 13:00:00
11 CC_02 2021-07-08 13:11:00 11 -0.423572 2021-07-08 13:00:00
12 DC_02 2021-07-08 13:14:00 12 2.230000 2021-07-08 13:00:00
13 CC_03 2021-07-08 13:18:00 13 -0.635882 2021-07-08 13:00:00
14 DC_03 2021-07-08 13:22:00 14 2.670000 2021-07-08 13:00:00
15 CC_04 2021-07-08 13:25:00 15 -1.440000 2021-07-08 13:00:00
16 DC_04 2021-07-08 13:29:00 16 1.860000 2021-07-08 13:00:00
In my example, the first 8 lines should be stamped to to 12:30:00. The function works for the first 7, but the eighth is stamped to 13:00 as the record was done at 13:02.
This situation doesn't appear for the second measurements set (lines 9->16) as the last measurement started before the next half-hour, so the eight are stamped with 13:00, which is correct. Nothing to correct here.
These measurements are repeated many times, so I cannot modify it by hands.
I hope it makes sens.
Thanks in advance for your help,
Adrien
You can create a group of every 8 rows or create a new group every time CC_01 occurs whichever is most appropriate according to your data and take floor_date value of first value in the group.
library(dplyr)
library(lubridate)
data %>%
group_by(grp = ceiling(Obs/8)) %>%
#Or increment the group value at every occurrence of CC_01
#group_by(grp = cumsum(Label == 'CC_01')) %>%
mutate(Floor_date = floor_date(first(Date_IV), '30 minutes')) %>%
ungroup

Evaluating Prophet model in R, using cross-validation

I am trying to cross-validate a Prophet model in R.
The problem - this package does not work well with monthly data.
I managed to build the model
and even used a custom monthly seasonality.
as recommended by authors of this tool.
But cannot cross-validate monthly data. Tried to follow recommendations in the GitHub issue, but missing something.
Currently my code looks like this
model1_cv <- cross_validation(model1, initial = 156, period = 365/12, as.difftime(horizon = 365/12, units = "days"))
Updated:
Based on answer to this question, I visualized CV results. There some problems here. I used full data and partial data.
Also metrics do not look that good
I just tested a bit with training data from the package and from what I understood the package is not really well suited for monthly forecast, this part: [...] as.difftime(365/12, units = "days") [...] seems to have been informed just to prove the size of the month with 30something days. Meaning you can use this instead of just 365/12 por "period" and/or "horizon". One thing I noticed is, that both arguments are of type integer per description but when you look into the function they are calculated per as.datediff() so they are doubles actually.
library(dplyr)
library(prophet)
library(data.table)
#training data
df <- data.table::fread("ds y
1992-01-01 146376
1992-02-01 147079
1992-03-01 159336
1992-04-01 163669
1992-05-01 170068
1992-06-01 168663
1992-07-01 169890
1992-08-01 170364
1992-09-01 164617
1992-10-01 173655
1992-11-01 171547
1992-12-01 208838
1993-01-01 153221
1993-02-01 150087
1993-03-01 170439
1993-04-01 176456
1993-05-01 182231
1993-06-01 181535
1993-07-01 183682
1993-08-01 183318
1993-09-01 177406
1993-10-01 182737
1993-11-01 187443
1993-12-01 224540
1994-01-01 161349
1994-02-01 162841
1994-03-01 192319
1994-04-01 189569
1994-05-01 194927
1994-06-01 197946
1994-07-01 193355
1994-08-01 202388
1994-09-01 193954
1994-10-01 197956
1994-11-01 202520
1994-12-01 241111
1995-01-01 175344
1995-02-01 172138
1995-03-01 201279
1995-04-01 196039
1995-05-01 210478
1995-06-01 211844
1995-07-01 203411
1995-08-01 214248
1995-09-01 202122
1995-10-01 204044
1995-11-01 212190
1995-12-01 247491
1996-01-01 185019
1996-02-01 192380
1996-03-01 212110
1996-04-01 211718
1996-05-01 226936
1996-06-01 217511
1996-07-01 218111")
df <- df %>%
dplyr::mutate(ds = as.Date(ds))
model <- prophet::prophet(df)
(tscv.myfit <- prophet::cross_validation(model, horizon = 365/12, units = "days", period = 365/12, initial = 365/12 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 175344 1995-01-01 170988.8 170145.9 171828.0 1994-12-31 02:00:00
2: 172138 1995-02-01 178117.4 176975.2 179070.2 1995-01-30 12:00:00
3: 201279 1995-03-01 211462.8 210277.4 212670.8 1995-01-30 12:00:00
4: 196039 1995-04-01 200113.9 198079.5 201977.8 1995-03-01 22:00:00
5: 210478 1995-05-01 202100.5 200390.8 203797.9 1995-04-01 08:00:00
6: 211844 1995-06-01 208330.5 206229.9 210497.4 1995-05-01 18:00:00
7: 203411 1995-07-01 202563.8 200786.5 204313.0 1995-06-01 04:00:00
8: 214248 1995-08-01 214639.6 212748.3 216461.3 1995-07-01 14:00:00
9: 202122 1995-09-01 204954.0 203048.9 206768.4 1995-08-31 12:00:00
10: 204044 1995-10-01 205097.5 203209.7 206882.3 1995-09-30 22:00:00
11: 212190 1995-11-01 213586.7 211728.1 215617.6 1995-10-31 08:00:00
12: 247491 1995-12-01 251518.8 249708.2 253589.2 1995-11-30 18:00:00
13: 185019 1996-01-01 182403.7 180520.1 184494.7 1995-12-31 04:00:00
14: 192380 1996-02-01 184722.9 182772.7 186686.9 1996-01-30 14:00:00
15: 212110 1996-03-01 205020.1 202823.2 206996.9 1996-01-30 14:00:00
16: 211718 1996-04-01 214514.0 211891.9 217175.3 1996-03-31 14:00:00
17: 226936 1996-05-01 218845.2 216133.8 221420.4 1996-03-31 14:00:00
18: 217511 1996-06-01 218672.2 216007.8 221459.9 1996-05-31 14:00:00
19: 218111 1996-07-01 221156.1 218540.7 224184.1 1996-05-31 14:00:00
The cutoff is not as regular as one would expect - I guess this is due to using average days per month somehow - though I could not figute out the logic. You can replace 365/12 with as.difftime(365/12, units = "days") and will get the same result.
But if you use (365+365+365+366) / 48 instead due to the 29.02. you get a slighly different average days per month and this leads to a different output:
(tscv.myfit_2 <- prophet::cross_validation(model, horizon = (365+365+365+366)/48, units = "days", period = (365+365+365+366)/48, initial = (365+365+365+366)/48 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 172138 1995-02-01 178117.4 177075.3 179203.9 1995-01-29 13:30:00
2: 201279 1995-03-01 211462.8 210340.5 212607.3 1995-01-29 13:30:00
3: 196039 1995-04-01 200113.9 198022.6 202068.1 1995-03-31 13:30:00
4: 210478 1995-05-01 204100.2 202009.8 206098.7 1995-03-31 13:30:00
5: 211844 1995-06-01 208330.5 206114.5 210515.8 1995-05-31 13:30:00
6: 203411 1995-07-01 202606.0 200319.1 204663.4 1995-05-31 13:30:00
7: 214248 1995-08-01 214639.6 212684.4 216495.7 1995-07-31 22:30:00
8: 202122 1995-09-01 204954.0 203127.7 206951.0 1995-08-31 09:00:00
9: 204044 1995-10-01 205097.5 203285.3 207036.5 1995-09-30 19:30:00
10: 212190 1995-11-01 213586.7 211516.8 215516.2 1995-10-31 06:00:00
11: 247491 1995-12-01 251518.8 249658.3 253590.1 1995-11-30 16:30:00
12: 185019 1996-01-01 182403.7 180359.7 184399.2 1995-12-31 03:00:00
13: 192380 1996-02-01 184722.9 182652.4 186899.8 1996-01-30 13:30:00
14: 212110 1996-03-01 205020.1 203040.3 207171.9 1996-01-30 13:30:00
15: 211718 1996-04-01 214514.0 211942.6 217252.6 1996-03-31 13:30:00
16: 226936 1996-05-01 218845.2 216203.1 221506.5 1996-03-31 13:30:00
17: 217511 1996-06-01 218672.2 215823.9 221292.4 1996-05-31 13:30:00
18: 218111 1996-07-01 221156.1 218236.7 223862.0 1996-05-31 13:30:00
Form this behaviour I would say the work arround is not ideal, especially depending how exact you want the crossvalidation to be in terms of rolling month. If you need the cutoff points to be exact you could write your own function and predict always one month from the starting point, collect these results and build final comparision. I would trust this approach more than the work arround.

How do I extract data from a data frame based on the months?

I have a data frame, df, that has date and two variables in it. I would like to either extract all of Oct-Dec data or delete the other months data from the data frame.
I have put the data into a data frame but at the moment have the whole year, I just want to extract the wanted data. In future I will also be extracting just winter data. I have attached my chunk of my data frame, I tried using format() with just %m but couldn't get it to work.
14138 2017-09-15 4.655946e-01 0.0603515884
14139 2017-09-16 7.881137e-01 0.0479933304
14140 2017-09-17 5.018990e-01 0.0256871025
14141 2017-09-18 -1.583625e-01 -0.0040893990
14142 2017-09-19 -6.733220e-01 -0.0313100989
14143 2017-09-20 -1.225730e+00 -0.0587706331
14144 2017-09-21 -1.419133e+00 -0.0958125544
14145 2017-09-22 -1.338630e+00 -0.0902803173
14146 2017-09-23 -1.272554e+00 -0.0659170673
14147 2017-09-24 -1.132318e+00 -0.0387240370
14148 2017-09-25 -1.255414e+00 -0.0392615823
14149 2017-09-26 -1.497188e+00 -0.0438491356
14150 2017-09-27 -1.427622e+00 -0.0633879185
14151 2017-09-28 -1.051756e+00 -0.0992427127
14152 2017-09-29 -4.876309e-01 -0.1448044528
14153 2017-09-30 -6.829681e-02 -0.1749463647
14154 2017-10-01 -1.413768e-01 -0.2009916094
14155 2017-10-02 6.359742e-02 -0.1975848313
14156 2017-10-03 9.103277e-01 -0.1828581805
14157 2017-10-04 1.695776e+00 -0.1589352546
14158 2017-10-05 1.913918e+00 -0.1538234614
14159 2017-10-06 1.479714e+00 -0.1937094170
14160 2017-10-07 8.783669e-01 -0.1703790211
14161 2017-10-08 5.706581e-01 -0.1294144428
14162 2017-10-09 4.979405e-01 -0.0666569815
14163 2017-10-10 3.233477e-01 0.0072006102
14164 2017-10-11 3.057630e-01 0.0863445067
14165 2017-10-12 5.877673e-01 0.1097707831
14166 2017-10-13 1.208526e+00 0.1301967193
14167 2017-10-14 1.671705e+00 0.1728109268
14168 2017-10-15 1.810979e+00 0.2264911145
14169 2017-10-16 1.426651e+00 0.2702958315
14170 2017-10-17 1.241140e+00 0.3242637704
14171 2017-10-18 8.997498e-01 0.3879727861
14172 2017-10-19 5.594161e-01 0.4172990825
14173 2017-10-20 3.980254e-01 0.3915170864
14174 2017-10-21 2.138538e-01 0.3249736995
14175 2017-10-22 3.926440e-01 0.2224834840
14176 2017-10-23 2.268644e-01 0.0529143372
14177 2017-10-24 5.664923e-01 -0.0081443464
14178 2017-10-25 6.167520e-01 0.0312073984
14179 2017-10-26 7.751882e-02 0.0043897693
14180 2017-10-27 -5.634851e-02 -0.0726825266
14181 2017-10-28 -2.122061e-01 -0.1711305549
14182 2017-10-29 -8.500991e-01 -0.2068581639
14183 2017-10-30 -1.039685e+00 -0.2909120824
14184 2017-10-31 -3.057745e-01 -0.3933633317
14185 2017-11-01 -1.288774e-01 -0.3726346136
14186 2017-11-02 -5.608007e-03 -0.2425754386
14187 2017-11-03 4.853990e-01 -0.0503543980
14188 2017-11-04 5.822672e-01 0.0896130098
14189 2017-11-05 8.491505e-01 0.1299151006
14190 2017-11-06 1.052999e+00 0.0749888307
14191 2017-11-07 1.170470e+00 0.0287317882
14192 2017-11-08 7.919862e-01 0.0788187381
14193 2017-11-09 4.574565e-01 0.1539981316
14194 2017-11-10 4.552032e-01 0.2034393145
14195 2017-11-11 -3.621350e-01 0.2077476707
14196 2017-11-12 -8.053965e-01 0.1759558604
14197 2017-11-13 -8.307459e-01 0.1802858410
14198 2017-11-14 -9.421325e-01 0.2175529008
14199 2017-11-15 -9.880204e-01 0.2392924580
14200 2017-11-16 -7.448127e-01 0.2519253751
14201 2017-11-17 -8.081435e-01 0.2614254732
14202 2017-11-18 -1.216806e+00 0.2629971336
14203 2017-11-19 -1.122674e+00 0.3469995055
14204 2017-11-20 -1.242597e+00 0.4553094014
14205 2017-11-21 -1.294885e+00 0.5049438231
14206 2017-11-22 -9.325514e-01 0.4684133163
14207 2017-11-23 -4.632281e-01 0.4071673624
14208 2017-11-24 -9.689322e-02 0.3710270269
14209 2017-11-25 4.704467e-01 0.4126721465
14210 2017-11-26 8.682453e-01 0.3745057653
14211 2017-11-27 5.105564e-01 0.2373454931
14212 2017-11-28 4.747265e-01 0.1650783370
14213 2017-11-29 5.905379e-01 0.2632154120
14214 2017-11-30 4.083787e-01 0.3888834762
14215 2017-12-01 3.451736e-01 0.5008047592
14216 2017-12-02 5.161312e-01 0.5388177242
14217 2017-12-03 7.109279e-01 0.5515360710
14218 2017-12-04 4.458635e-01 0.5127537202
14219 2017-12-05 -3.986610e-01 0.3896493238
14220 2017-12-06 -5.968253e-01 0.1095843268
14221 2017-12-07 -1.604398e-01 -0.2455506506
14222 2017-12-08 -4.384744e-01 -0.5801038215
14223 2017-12-09 -7.255016e-01 -0.8384627087
14224 2017-12-10 -9.691828e-01 -0.9223171538
14225 2017-12-11 -1.140588e+00 -0.8177806761
14226 2017-12-12 -1.956622e-01 -0.5250998474
14227 2017-12-13 -1.083792e-01 -0.3430768534
14228 2017-12-14 -8.016345e-02 -0.3163476104
14229 2017-12-15 8.899266e-01 -0.2813253830
14230 2017-12-16 1.322833e+00 -0.2545953062
14231 2017-12-17 1.547972e+00 -0.2275373110
14232 2017-12-18 2.164907e+00 -0.3217205817
14233 2017-12-19 2.276258e+00 -0.5773412429
14234 2017-12-20 1.862291e+00 -0.7728091393
14235 2017-12-21 1.125083e+00 -0.9099696881
14236 2017-12-22 7.737118e-01 -1.2441963604
14237 2017-12-23 7.863508e-01 -1.4802661587
14238 2017-12-24 4.313111e-01 -1.4111320559
14239 2017-12-25 -8.814799e-02 -1.0024805520
14240 2017-12-26 -3.615127e-01 -0.4943077147
14241 2017-12-27 -5.011363e-01 -0.0308588186
14242 2017-12-28 -8.474088e-01 0.3717555895
14243 2017-12-29 -7.283247e-01 0.8230450219
14244 2017-12-30 -4.566981e-01 1.2495961116
14245 2017-12-31 -4.577034e-01 1.4805369230
14246 2018-01-01 1.946166e-01 1.5310004017
14247 2018-01-02 5.203149e-01 1.5384595802
14248 2018-01-03 5.024570e-02 1.4036679018
14249 2018-01-04 -7.065297e-01 1.0749574137
14250 2018-01-05 -8.741815e-01 0.7608524752
14251 2018-01-06 1.589530e-01 0.7891084646
14252 2018-01-07 8.632378e-01 1.1230358751
As requested, the class is "Date".
You can use lubridate and base R:
library(lubridate)
dats[month(ymd(dats$V2)) >= 10,]
# EDIT if the class of the date variable is date, it should be only
dats[month(dats$V2) >= 10,]
Or fully base without any date work:
dats[substr(dats$V2,6,7) %in% c("10","11","12"),]
With data:
V1 V2 V3 V4
1 14138 2017-09-15 0.4655946 0.06035159
2 14139 2017-09-16 0.7881137 0.04799333
...
From your question, it is unclear what format the date variable is in. Maybe add the output of class(your_date_variable) to the question. As a general rule, though, you'll want to use filter from the dplyr package. Something like this:
new_data <- data %>% filter(format(date_variable, "%m") >= 10)
This might change slightly depending on the class of your date variable.
Assuming the 'date_variable' is Date class, extract the month and do a comparison in filter (action verb from dplyr)
library(dplyr)
library(lubridate)
data %>%
filter(month(date_variable) >= 10)

How to plot lagged data against other data in R

I would like to lag one variable by, say, 10 time steps and plot it against the other variable which remains the same. I would like to do this for various lags to see if there is a time period that the first variable influences the other. The data I have is daily and after lagging I am separating into Dec-Feb data only. The problem I am having is the plot and correlation between the lagged variable and the other data is coming out the same as the non-lagged plot and correlation every time. I am not sure how to achieve this.
A sample of my data frame "data" can be seen below.
Date x y
14158 2017-10-05 1.913918e+00 -0.1538234614
14159 2017-10-06 1.479714e+00 -0.1937094170
14160 2017-10-07 8.783669e-01 -0.1703790211
14161 2017-10-08 5.706581e-01 -0.1294144428
14162 2017-10-09 4.979405e-01 -0.0666569815
14163 2017-10-10 3.233477e-01 0.0072006102
14164 2017-10-11 3.057630e-01 0.0863445067
14165 2017-10-12 5.877673e-01 0.1097707831
14166 2017-10-13 1.208526e+00 0.1301967193
14167 2017-10-14 1.671705e+00 0.1728109268
14168 2017-10-15 1.810979e+00 0.2264911145
14169 2017-10-16 1.426651e+00 0.2702958315
14170 2017-10-17 1.241140e+00 0.3242637704
14171 2017-10-18 8.997498e-01 0.3879727861
14172 2017-10-19 5.594161e-01 0.4172990825
14173 2017-10-20 3.980254e-01 0.3915170864
14174 2017-10-21 2.138538e-01 0.3249736995
14175 2017-10-22 3.926440e-01 0.2224834840
14176 2017-10-23 2.268644e-01 0.0529143372
14177 2017-10-24 5.664923e-01 -0.0081443464
14178 2017-10-25 6.167520e-01 0.0312073984
14179 2017-10-26 7.751882e-02 0.0043897693
14180 2017-10-27 -5.634851e-02 -0.0726825266
14181 2017-10-28 -2.122061e-01 -0.1711305549
14182 2017-10-29 -8.500991e-01 -0.2068581639
14183 2017-10-30 -1.039685e+00 -0.2909120824
14184 2017-10-31 -3.057745e-01 -0.3933633317
14185 2017-11-01 -1.288774e-01 -0.3726346136
14186 2017-11-02 -5.608007e-03 -0.2425754386
14187 2017-11-03 4.853990e-01 -0.0503543980
14188 2017-11-04 5.822672e-01 0.0896130098
14189 2017-11-05 8.491505e-01 0.1299151006
14190 2017-11-06 1.052999e+00 0.0749888307
14191 2017-11-07 1.170470e+00 0.0287317882
14192 2017-11-08 7.919862e-01 0.0788187381
14193 2017-11-09 4.574565e-01 0.1539981316
14194 2017-11-10 4.552032e-01 0.2034393145
14195 2017-11-11 -3.621350e-01 0.2077476707
14196 2017-11-12 -8.053965e-01 0.1759558604
14197 2017-11-13 -8.307459e-01 0.1802858410
14198 2017-11-14 -9.421325e-01 0.2175529008
14199 2017-11-15 -9.880204e-01 0.2392924580
14200 2017-11-16 -7.448127e-01 0.2519253751
14201 2017-11-17 -8.081435e-01 0.2614254732
14202 2017-11-18 -1.216806e+00 0.2629971336
14203 2017-11-19 -1.122674e+00 0.3469995055
14204 2017-11-20 -1.242597e+00 0.4553094014
14205 2017-11-21 -1.294885e+00 0.5049438231
14206 2017-11-22 -9.325514e-01 0.4684133163
14207 2017-11-23 -4.632281e-01 0.4071673624
14208 2017-11-24 -9.689322e-02 0.3710270269
14209 2017-11-25 4.704467e-01 0.4126721465
14210 2017-11-26 8.682453e-01 0.3745057653
14211 2017-11-27 5.105564e-01 0.2373454931
14212 2017-11-28 4.747265e-01 0.1650783370
14213 2017-11-29 5.905379e-01 0.2632154120
14214 2017-11-30 4.083787e-01 0.3888834762
14215 2017-12-01 3.451736e-01 0.5008047592
14216 2017-12-02 5.161312e-01 0.5388177242
14217 2017-12-03 7.109279e-01 0.5515360710
14218 2017-12-04 4.458635e-01 0.5127537202
14219 2017-12-05 -3.986610e-01 0.3896493238
14220 2017-12-06 -5.968253e-01 0.1095843268
14221 2017-12-07 -1.604398e-01 -0.2455506506
14222 2017-12-08 -4.384744e-01 -0.5801038215
14223 2017-12-09 -7.255016e-01 -0.8384627087
14224 2017-12-10 -9.691828e-01 -0.9223171538
14225 2017-12-11 -1.140588e+00 -0.8177806761
14226 2017-12-12 -1.956622e-01 -0.5250998474
14227 2017-12-13 -1.083792e-01 -0.3430768534
14228 2017-12-14 -8.016345e-02 -0.3163476104
14229 2017-12-15 8.899266e-01 -0.2813253830
14230 2017-12-16 1.322833e+00 -0.2545953062
14231 2017-12-17 1.547972e+00 -0.2275373110
14232 2017-12-18 2.164907e+00 -0.3217205817
14233 2017-12-19 2.276258e+00 -0.5773412429
14234 2017-12-20 1.862291e+00 -0.7728091393
14235 2017-12-21 1.125083e+00 -0.9099696881
14236 2017-12-22 7.737118e-01 -1.2441963604
14237 2017-12-23 7.863508e-01 -1.4802661587
14238 2017-12-24 4.313111e-01 -1.4111320559
14239 2017-12-25 -8.814799e-02 -1.0024805520
14240 2017-12-26 -3.615127e-01 -0.4943077147
14241 2017-12-27 -5.011363e-01 -0.0308588186
14242 2017-12-28 -8.474088e-01 0.3717555895
14243 2017-12-29 -7.283247e-01 0.8230450219
14244 2017-12-30 -4.566981e-01 1.2495961116
14245 2017-12-31 -4.577034e-01 1.4805369230
14246 2018-01-01 1.946166e-01 1.5310004017
14247 2018-01-02 5.203149e-01 1.5384595802
14248 2018-01-03 5.024570e-02 1.4036679018
14249 2018-01-04 -7.065297e-01 1.0749574137
14250 2018-01-05 -8.741815e-01 0.7608524752
14251 2018-01-06 1.589530e-01 0.7891084646
14252 2018-01-07 8.632378e-01 1.1230358751
I am using
lagged <- lag(ts(x), k=10)
This is so the tsp isn't ignored. However, when I do
cor(data$x, data$y)
and
cor(lagged, data$y)
the result is the same, where I would have thought it would have been different. How do I get this lag to work before I can go ahead separate via date?
Many thanks!

How to insert missing dates/times using R based on criteria?

A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).
Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.
There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.
I tried these lines from this question (even though they fill all without differentiation) but not getting there.
How can it be done in R?
date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)
The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data
library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"),
tz = "GMT"),
length.out = 24, by = "1 hour")), staff]
setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[,
.(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
.(staff, date_time)]
table(res$staff, as.Date(res$date_time))
# 2018-03-02 2018-03-03 2018-03-04
# David 3 24 2
# Jack 24 1 1
# Kate 3 1 24
head(res)
# staff date_time reading
#1: Jack 2018-03-02 00:00:00 7.5
#2: Jack 2018-03-02 01:00:00 8.3
#3: Kate 2018-03-02 02:00:00 7.0
#4: Jack 2018-03-02 03:00:00 6.9
#5: Jack 2018-03-02 05:00:00 7.1
#6: Jack 2018-03-02 06:00:00 8.1
tail(res)
# staff date_time reading
#1: Kate 2018-03-04 04:00:00 0
#2: Kate 2018-03-04 09:00:00 0
#3: Kate 2018-03-04 13:00:00 0
#4: Kate 2018-03-04 18:00:00 0
#5: Kate 2018-03-04 21:00:00 0
#6: Kate 2018-03-04 23:00:00 0
Try this code:
Identify each daily hour and all staff members
date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")
Uniform date format in df
df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")
Merge information
out<-merge(comb,df,all.x=T)
Your output:
head(out)
staff date_time reading
1 Jack 2018-03-02 00:00:00 7.5
2 Jack 2018-03-02 01:00:00 8.3
3 Jack 2018-03-02 02:00:00 NA
4 Jack 2018-03-02 03:00:00 6.9
5 Jack 2018-03-02 04:00:00 NA
6 Jack 2018-03-02 05:00:00 7.1

Resources