separate row containing two separate dates into before and after midnight - r
I have a data frame containing sleep data, with several sleep increments, with a column for the start and a column for the end of the sleep.
For some rows, the starting time is on the previous day and the end time is on the next day.
What I would like to do is to separate such rows into two rows, where the first row contains the starting time till 23:59:59, and the second row 00:00:00 till the end time.
For example:
# A tibble: 6 x 3
sleepdatestarttime sleepdateendtime sleepstage
<dttm> <dttm> <chr>
1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
3 2018-03-02 23:55:00 2018-03-03 00:02:00 wake
4 2018-03-03 00:02:00 2018-03-03 00:03:30 light
5 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
6 2018-03-03 00:23:30 2018-03-03 02:58:00 light
and the desired output is:
# A tibble: 6 x 3
sleepdatestarttime sleepdateendtime sleepstage
<dttm> <dttm> <chr>
1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
**3 2018-03-02 23:55:00 2018-03-02 23:59:59 wake
4 2018-03-03 00:00:00 2018-03-03 00:01:59 wake**
5 2018-03-03 00:02:00 2018-03-03 00:03:30 light
6 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
7 2018-03-03 00:23:30 2018-03-03 02:58:00 light
A dplyr solution would be very helpful.
Here is a possible solution but using just base R and not a dplyr. I converted all times to UTC to avoid issue with time conversions. (See a related answer change time zone in R without it returning to original time zone)
Note this solution resorts the entire dataframe by sleepdatestarttime so if there are multiple people on the same day, then the order function on the last line needs modification.
df<-read.table(header=TRUE, text="sleepdatestarttime sleepdateendtime sleepstage
'2018-03-02 23:31:00' '2018-03-02 23:54:00' rem
'2018-03-02 23:54:00' '2018-03-02 23:55:00' light
'2018-03-02 23:55:00' '2018-03-03 00:02:00' wake
'2018-03-03 00:02:00' '2018-03-03 00:03:30' light
'2018-03-03 00:03:30' '2018-03-03 00:23:30' deep
'2018-03-03 00:23:30' '2018-03-03 02:58:00' light")
df$sleepdatestarttime<-as.POSIXct(as.character(df$sleepdatestarttime), tz="UTC")
df$sleepdateendtime<-as.POSIXct(as.character(df$sleepdateendtime), tz="UTC")
#find rows across days
rows<-which(as.Date(df$sleepdatestarttime) !=as.Date(df$sleepdateendtime))
#create the new rows
nstart<-data.frame(sleepdatestarttime= df$sleepdatestarttime[rows],
sleepdateendtime= as.POSIXct(paste(as.Date(df$sleepdatestarttime[rows]), "23:59:59"), tz="UTC"),
sleepstage=df$sleepstage[rows])
nend<-data.frame(sleepdatestarttime= as.POSIXct(paste(as.Date(df$sleepdateendtime[rows]), "00:00:00"), tz="UTC"),
sleepdateendtime= df$sleepdateendtime[rows],
sleepstage=df$sleepstage[rows])
#substitute in the new start rows
df[rows,]<-nstart
#tack on the new ending rows
df<-rbind(df, nend)
#resort the dataframe
df<-df[order(df$sleepdatestarttime ),]
This is a common issue in genomics. The IRanges package on BioConductor has the findOverlaps() function for this purpose. foverlaps() is its data.table version which is used here. AFAIK, there is no dplyr equivalent available.
First we need to create a vector of day start and end times. The call to foverlaps() returns all possible types of overlaps. Finally, the start and end times are adjusted to match with the expected result.
library(data.table)
library(lubridate)
day_seq <- setDT(df)[, .(day_start = seq(
floor_date(min(sleepdatestarttime), "day"),
ceiling_date(max(sleepdateendtime), "day"), "day"))][
, day_end := day_start + days(1)]
setkey(day_seq, day_start, day_end)
foverlaps(
df, day_seq, by.x = c("sleepdatestarttime", "sleepdateendtime"), nomatch = 0L)[
, `:=`(sleepdatestarttime = pmax(sleepdatestarttime, day_start),
sleepdateendtime = pmin(sleepdateendtime, day_end - seconds(1)))][
, c("day_start", "day_end") := NULL][]
i sleepdatestarttime sleepdateendtime sleepstage
1: 1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2: 2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
3: 3 2018-03-02 23:55:00 2018-03-02 23:59:59 wake
4: 3 2018-03-03 00:00:00 2018-03-03 00:02:00 wake
5: 4 2018-03-03 00:02:00 2018-03-03 00:03:30 light
6: 5 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
7: 6 2018-03-03 00:23:30 2018-03-03 02:58:00 light
Data
df <- readr::read_table("i sleepdatestarttime sleepdateendtime sleepstage
1 2018-03-02 23:31:00 2018-03-02 23:54:00 rem
2 2018-03-02 23:54:00 2018-03-02 23:55:00 light
3 2018-03-02 23:55:00 2018-03-03 00:02:00 wake
4 2018-03-03 00:02:00 2018-03-03 00:03:30 light
5 2018-03-03 00:03:30 2018-03-03 00:23:30 deep
6 2018-03-03 00:23:30 2018-03-03 02:58:00 light")
Related
How to check if time is similar every 8 rows after floor_date function (and correct it if needed)
I have a dataset where new data are recorded at a fixed interval (3-4 minutes). Each 8 records (rows) correspond to a same set of data (CC_01->04 and DC01->04) that I want to stamp to the previous half-hour. For this I use the floor date function of lubridate that works perfectly: lubridate::floor_date(data$Date_IV, "30 minutes") However, sometimes the eighth record starts after the begining of the next half-hour and so the floor_date function stamps it with this new half-hour. But I would like it to be stamped with the previous one (as part of the subset). Therefore I'm looking for a way to check when this eighth value differs from the previous 7, and correct it if needed. An exemple : Label Date_IV Obs. Exp_Flux Floor_date 1 CC_01 2021-07-08 12:38:00 1 -0.290000 2021-07-08 12:30:00 2 DC_01 2021-07-08 12:42:00 2 3.830000 2021-07-08 12:30:00 3 CC_02 2021-07-08 12:45:00 3 -0.527937 2021-07-08 12:30:00 4 DC_02 2021-07-08 12:49:00 4 2.260000 2021-07-08 12:30:00 5 CC_03 2021-07-08 12:52:00 5 -0.743471 2021-07-08 12:30:00 6 DC_03 2021-07-08 12:55:00 6 2.230000 2021-07-08 12:30:00 7 CC_04 2021-07-08 12:59:00 7 -1.510000 2021-07-08 12:30:00 8 DC_04 2021-07-08 13:02:00 8 1.820000 2021-07-08 13:00:00 9 CC_01 2021-07-08 13:05:00 9 -0.190000 2021-07-08 13:00:00 10 DC_01 2021-07-08 13:08:00 10 3.750000 2021-07-08 13:00:00 11 CC_02 2021-07-08 13:11:00 11 -0.423572 2021-07-08 13:00:00 12 DC_02 2021-07-08 13:14:00 12 2.230000 2021-07-08 13:00:00 13 CC_03 2021-07-08 13:18:00 13 -0.635882 2021-07-08 13:00:00 14 DC_03 2021-07-08 13:22:00 14 2.670000 2021-07-08 13:00:00 15 CC_04 2021-07-08 13:25:00 15 -1.440000 2021-07-08 13:00:00 16 DC_04 2021-07-08 13:29:00 16 1.860000 2021-07-08 13:00:00 In my example, the first 8 lines should be stamped to to 12:30:00. The function works for the first 7, but the eighth is stamped to 13:00 as the record was done at 13:02. This situation doesn't appear for the second measurements set (lines 9->16) as the last measurement started before the next half-hour, so the eight are stamped with 13:00, which is correct. Nothing to correct here. These measurements are repeated many times, so I cannot modify it by hands. I hope it makes sens. Thanks in advance for your help, Adrien
You can create a group of every 8 rows or create a new group every time CC_01 occurs whichever is most appropriate according to your data and take floor_date value of first value in the group. library(dplyr) library(lubridate) data %>% group_by(grp = ceiling(Obs/8)) %>% #Or increment the group value at every occurrence of CC_01 #group_by(grp = cumsum(Label == 'CC_01')) %>% mutate(Floor_date = floor_date(first(Date_IV), '30 minutes')) %>% ungroup
Evaluating Prophet model in R, using cross-validation
I am trying to cross-validate a Prophet model in R. The problem - this package does not work well with monthly data. I managed to build the model and even used a custom monthly seasonality. as recommended by authors of this tool. But cannot cross-validate monthly data. Tried to follow recommendations in the GitHub issue, but missing something. Currently my code looks like this model1_cv <- cross_validation(model1, initial = 156, period = 365/12, as.difftime(horizon = 365/12, units = "days")) Updated: Based on answer to this question, I visualized CV results. There some problems here. I used full data and partial data. Also metrics do not look that good
I just tested a bit with training data from the package and from what I understood the package is not really well suited for monthly forecast, this part: [...] as.difftime(365/12, units = "days") [...] seems to have been informed just to prove the size of the month with 30something days. Meaning you can use this instead of just 365/12 por "period" and/or "horizon". One thing I noticed is, that both arguments are of type integer per description but when you look into the function they are calculated per as.datediff() so they are doubles actually. library(dplyr) library(prophet) library(data.table) #training data df <- data.table::fread("ds y 1992-01-01 146376 1992-02-01 147079 1992-03-01 159336 1992-04-01 163669 1992-05-01 170068 1992-06-01 168663 1992-07-01 169890 1992-08-01 170364 1992-09-01 164617 1992-10-01 173655 1992-11-01 171547 1992-12-01 208838 1993-01-01 153221 1993-02-01 150087 1993-03-01 170439 1993-04-01 176456 1993-05-01 182231 1993-06-01 181535 1993-07-01 183682 1993-08-01 183318 1993-09-01 177406 1993-10-01 182737 1993-11-01 187443 1993-12-01 224540 1994-01-01 161349 1994-02-01 162841 1994-03-01 192319 1994-04-01 189569 1994-05-01 194927 1994-06-01 197946 1994-07-01 193355 1994-08-01 202388 1994-09-01 193954 1994-10-01 197956 1994-11-01 202520 1994-12-01 241111 1995-01-01 175344 1995-02-01 172138 1995-03-01 201279 1995-04-01 196039 1995-05-01 210478 1995-06-01 211844 1995-07-01 203411 1995-08-01 214248 1995-09-01 202122 1995-10-01 204044 1995-11-01 212190 1995-12-01 247491 1996-01-01 185019 1996-02-01 192380 1996-03-01 212110 1996-04-01 211718 1996-05-01 226936 1996-06-01 217511 1996-07-01 218111") df <- df %>% dplyr::mutate(ds = as.Date(ds)) model <- prophet::prophet(df) (tscv.myfit <- prophet::cross_validation(model, horizon = 365/12, units = "days", period = 365/12, initial = 365/12 * 12 * 3)) y ds yhat yhat_lower yhat_upper cutoff 1: 175344 1995-01-01 170988.8 170145.9 171828.0 1994-12-31 02:00:00 2: 172138 1995-02-01 178117.4 176975.2 179070.2 1995-01-30 12:00:00 3: 201279 1995-03-01 211462.8 210277.4 212670.8 1995-01-30 12:00:00 4: 196039 1995-04-01 200113.9 198079.5 201977.8 1995-03-01 22:00:00 5: 210478 1995-05-01 202100.5 200390.8 203797.9 1995-04-01 08:00:00 6: 211844 1995-06-01 208330.5 206229.9 210497.4 1995-05-01 18:00:00 7: 203411 1995-07-01 202563.8 200786.5 204313.0 1995-06-01 04:00:00 8: 214248 1995-08-01 214639.6 212748.3 216461.3 1995-07-01 14:00:00 9: 202122 1995-09-01 204954.0 203048.9 206768.4 1995-08-31 12:00:00 10: 204044 1995-10-01 205097.5 203209.7 206882.3 1995-09-30 22:00:00 11: 212190 1995-11-01 213586.7 211728.1 215617.6 1995-10-31 08:00:00 12: 247491 1995-12-01 251518.8 249708.2 253589.2 1995-11-30 18:00:00 13: 185019 1996-01-01 182403.7 180520.1 184494.7 1995-12-31 04:00:00 14: 192380 1996-02-01 184722.9 182772.7 186686.9 1996-01-30 14:00:00 15: 212110 1996-03-01 205020.1 202823.2 206996.9 1996-01-30 14:00:00 16: 211718 1996-04-01 214514.0 211891.9 217175.3 1996-03-31 14:00:00 17: 226936 1996-05-01 218845.2 216133.8 221420.4 1996-03-31 14:00:00 18: 217511 1996-06-01 218672.2 216007.8 221459.9 1996-05-31 14:00:00 19: 218111 1996-07-01 221156.1 218540.7 224184.1 1996-05-31 14:00:00 The cutoff is not as regular as one would expect - I guess this is due to using average days per month somehow - though I could not figute out the logic. You can replace 365/12 with as.difftime(365/12, units = "days") and will get the same result. But if you use (365+365+365+366) / 48 instead due to the 29.02. you get a slighly different average days per month and this leads to a different output: (tscv.myfit_2 <- prophet::cross_validation(model, horizon = (365+365+365+366)/48, units = "days", period = (365+365+365+366)/48, initial = (365+365+365+366)/48 * 12 * 3)) y ds yhat yhat_lower yhat_upper cutoff 1: 172138 1995-02-01 178117.4 177075.3 179203.9 1995-01-29 13:30:00 2: 201279 1995-03-01 211462.8 210340.5 212607.3 1995-01-29 13:30:00 3: 196039 1995-04-01 200113.9 198022.6 202068.1 1995-03-31 13:30:00 4: 210478 1995-05-01 204100.2 202009.8 206098.7 1995-03-31 13:30:00 5: 211844 1995-06-01 208330.5 206114.5 210515.8 1995-05-31 13:30:00 6: 203411 1995-07-01 202606.0 200319.1 204663.4 1995-05-31 13:30:00 7: 214248 1995-08-01 214639.6 212684.4 216495.7 1995-07-31 22:30:00 8: 202122 1995-09-01 204954.0 203127.7 206951.0 1995-08-31 09:00:00 9: 204044 1995-10-01 205097.5 203285.3 207036.5 1995-09-30 19:30:00 10: 212190 1995-11-01 213586.7 211516.8 215516.2 1995-10-31 06:00:00 11: 247491 1995-12-01 251518.8 249658.3 253590.1 1995-11-30 16:30:00 12: 185019 1996-01-01 182403.7 180359.7 184399.2 1995-12-31 03:00:00 13: 192380 1996-02-01 184722.9 182652.4 186899.8 1996-01-30 13:30:00 14: 212110 1996-03-01 205020.1 203040.3 207171.9 1996-01-30 13:30:00 15: 211718 1996-04-01 214514.0 211942.6 217252.6 1996-03-31 13:30:00 16: 226936 1996-05-01 218845.2 216203.1 221506.5 1996-03-31 13:30:00 17: 217511 1996-06-01 218672.2 215823.9 221292.4 1996-05-31 13:30:00 18: 218111 1996-07-01 221156.1 218236.7 223862.0 1996-05-31 13:30:00 Form this behaviour I would say the work arround is not ideal, especially depending how exact you want the crossvalidation to be in terms of rolling month. If you need the cutoff points to be exact you could write your own function and predict always one month from the starting point, collect these results and build final comparision. I would trust this approach more than the work arround.
How do I extract data from a data frame based on the months?
I have a data frame, df, that has date and two variables in it. I would like to either extract all of Oct-Dec data or delete the other months data from the data frame. I have put the data into a data frame but at the moment have the whole year, I just want to extract the wanted data. In future I will also be extracting just winter data. I have attached my chunk of my data frame, I tried using format() with just %m but couldn't get it to work. 14138 2017-09-15 4.655946e-01 0.0603515884 14139 2017-09-16 7.881137e-01 0.0479933304 14140 2017-09-17 5.018990e-01 0.0256871025 14141 2017-09-18 -1.583625e-01 -0.0040893990 14142 2017-09-19 -6.733220e-01 -0.0313100989 14143 2017-09-20 -1.225730e+00 -0.0587706331 14144 2017-09-21 -1.419133e+00 -0.0958125544 14145 2017-09-22 -1.338630e+00 -0.0902803173 14146 2017-09-23 -1.272554e+00 -0.0659170673 14147 2017-09-24 -1.132318e+00 -0.0387240370 14148 2017-09-25 -1.255414e+00 -0.0392615823 14149 2017-09-26 -1.497188e+00 -0.0438491356 14150 2017-09-27 -1.427622e+00 -0.0633879185 14151 2017-09-28 -1.051756e+00 -0.0992427127 14152 2017-09-29 -4.876309e-01 -0.1448044528 14153 2017-09-30 -6.829681e-02 -0.1749463647 14154 2017-10-01 -1.413768e-01 -0.2009916094 14155 2017-10-02 6.359742e-02 -0.1975848313 14156 2017-10-03 9.103277e-01 -0.1828581805 14157 2017-10-04 1.695776e+00 -0.1589352546 14158 2017-10-05 1.913918e+00 -0.1538234614 14159 2017-10-06 1.479714e+00 -0.1937094170 14160 2017-10-07 8.783669e-01 -0.1703790211 14161 2017-10-08 5.706581e-01 -0.1294144428 14162 2017-10-09 4.979405e-01 -0.0666569815 14163 2017-10-10 3.233477e-01 0.0072006102 14164 2017-10-11 3.057630e-01 0.0863445067 14165 2017-10-12 5.877673e-01 0.1097707831 14166 2017-10-13 1.208526e+00 0.1301967193 14167 2017-10-14 1.671705e+00 0.1728109268 14168 2017-10-15 1.810979e+00 0.2264911145 14169 2017-10-16 1.426651e+00 0.2702958315 14170 2017-10-17 1.241140e+00 0.3242637704 14171 2017-10-18 8.997498e-01 0.3879727861 14172 2017-10-19 5.594161e-01 0.4172990825 14173 2017-10-20 3.980254e-01 0.3915170864 14174 2017-10-21 2.138538e-01 0.3249736995 14175 2017-10-22 3.926440e-01 0.2224834840 14176 2017-10-23 2.268644e-01 0.0529143372 14177 2017-10-24 5.664923e-01 -0.0081443464 14178 2017-10-25 6.167520e-01 0.0312073984 14179 2017-10-26 7.751882e-02 0.0043897693 14180 2017-10-27 -5.634851e-02 -0.0726825266 14181 2017-10-28 -2.122061e-01 -0.1711305549 14182 2017-10-29 -8.500991e-01 -0.2068581639 14183 2017-10-30 -1.039685e+00 -0.2909120824 14184 2017-10-31 -3.057745e-01 -0.3933633317 14185 2017-11-01 -1.288774e-01 -0.3726346136 14186 2017-11-02 -5.608007e-03 -0.2425754386 14187 2017-11-03 4.853990e-01 -0.0503543980 14188 2017-11-04 5.822672e-01 0.0896130098 14189 2017-11-05 8.491505e-01 0.1299151006 14190 2017-11-06 1.052999e+00 0.0749888307 14191 2017-11-07 1.170470e+00 0.0287317882 14192 2017-11-08 7.919862e-01 0.0788187381 14193 2017-11-09 4.574565e-01 0.1539981316 14194 2017-11-10 4.552032e-01 0.2034393145 14195 2017-11-11 -3.621350e-01 0.2077476707 14196 2017-11-12 -8.053965e-01 0.1759558604 14197 2017-11-13 -8.307459e-01 0.1802858410 14198 2017-11-14 -9.421325e-01 0.2175529008 14199 2017-11-15 -9.880204e-01 0.2392924580 14200 2017-11-16 -7.448127e-01 0.2519253751 14201 2017-11-17 -8.081435e-01 0.2614254732 14202 2017-11-18 -1.216806e+00 0.2629971336 14203 2017-11-19 -1.122674e+00 0.3469995055 14204 2017-11-20 -1.242597e+00 0.4553094014 14205 2017-11-21 -1.294885e+00 0.5049438231 14206 2017-11-22 -9.325514e-01 0.4684133163 14207 2017-11-23 -4.632281e-01 0.4071673624 14208 2017-11-24 -9.689322e-02 0.3710270269 14209 2017-11-25 4.704467e-01 0.4126721465 14210 2017-11-26 8.682453e-01 0.3745057653 14211 2017-11-27 5.105564e-01 0.2373454931 14212 2017-11-28 4.747265e-01 0.1650783370 14213 2017-11-29 5.905379e-01 0.2632154120 14214 2017-11-30 4.083787e-01 0.3888834762 14215 2017-12-01 3.451736e-01 0.5008047592 14216 2017-12-02 5.161312e-01 0.5388177242 14217 2017-12-03 7.109279e-01 0.5515360710 14218 2017-12-04 4.458635e-01 0.5127537202 14219 2017-12-05 -3.986610e-01 0.3896493238 14220 2017-12-06 -5.968253e-01 0.1095843268 14221 2017-12-07 -1.604398e-01 -0.2455506506 14222 2017-12-08 -4.384744e-01 -0.5801038215 14223 2017-12-09 -7.255016e-01 -0.8384627087 14224 2017-12-10 -9.691828e-01 -0.9223171538 14225 2017-12-11 -1.140588e+00 -0.8177806761 14226 2017-12-12 -1.956622e-01 -0.5250998474 14227 2017-12-13 -1.083792e-01 -0.3430768534 14228 2017-12-14 -8.016345e-02 -0.3163476104 14229 2017-12-15 8.899266e-01 -0.2813253830 14230 2017-12-16 1.322833e+00 -0.2545953062 14231 2017-12-17 1.547972e+00 -0.2275373110 14232 2017-12-18 2.164907e+00 -0.3217205817 14233 2017-12-19 2.276258e+00 -0.5773412429 14234 2017-12-20 1.862291e+00 -0.7728091393 14235 2017-12-21 1.125083e+00 -0.9099696881 14236 2017-12-22 7.737118e-01 -1.2441963604 14237 2017-12-23 7.863508e-01 -1.4802661587 14238 2017-12-24 4.313111e-01 -1.4111320559 14239 2017-12-25 -8.814799e-02 -1.0024805520 14240 2017-12-26 -3.615127e-01 -0.4943077147 14241 2017-12-27 -5.011363e-01 -0.0308588186 14242 2017-12-28 -8.474088e-01 0.3717555895 14243 2017-12-29 -7.283247e-01 0.8230450219 14244 2017-12-30 -4.566981e-01 1.2495961116 14245 2017-12-31 -4.577034e-01 1.4805369230 14246 2018-01-01 1.946166e-01 1.5310004017 14247 2018-01-02 5.203149e-01 1.5384595802 14248 2018-01-03 5.024570e-02 1.4036679018 14249 2018-01-04 -7.065297e-01 1.0749574137 14250 2018-01-05 -8.741815e-01 0.7608524752 14251 2018-01-06 1.589530e-01 0.7891084646 14252 2018-01-07 8.632378e-01 1.1230358751 As requested, the class is "Date".
You can use lubridate and base R: library(lubridate) dats[month(ymd(dats$V2)) >= 10,] # EDIT if the class of the date variable is date, it should be only dats[month(dats$V2) >= 10,] Or fully base without any date work: dats[substr(dats$V2,6,7) %in% c("10","11","12"),] With data: V1 V2 V3 V4 1 14138 2017-09-15 0.4655946 0.06035159 2 14139 2017-09-16 0.7881137 0.04799333 ...
From your question, it is unclear what format the date variable is in. Maybe add the output of class(your_date_variable) to the question. As a general rule, though, you'll want to use filter from the dplyr package. Something like this: new_data <- data %>% filter(format(date_variable, "%m") >= 10) This might change slightly depending on the class of your date variable.
Assuming the 'date_variable' is Date class, extract the month and do a comparison in filter (action verb from dplyr) library(dplyr) library(lubridate) data %>% filter(month(date_variable) >= 10)
How to plot lagged data against other data in R
I would like to lag one variable by, say, 10 time steps and plot it against the other variable which remains the same. I would like to do this for various lags to see if there is a time period that the first variable influences the other. The data I have is daily and after lagging I am separating into Dec-Feb data only. The problem I am having is the plot and correlation between the lagged variable and the other data is coming out the same as the non-lagged plot and correlation every time. I am not sure how to achieve this. A sample of my data frame "data" can be seen below. Date x y 14158 2017-10-05 1.913918e+00 -0.1538234614 14159 2017-10-06 1.479714e+00 -0.1937094170 14160 2017-10-07 8.783669e-01 -0.1703790211 14161 2017-10-08 5.706581e-01 -0.1294144428 14162 2017-10-09 4.979405e-01 -0.0666569815 14163 2017-10-10 3.233477e-01 0.0072006102 14164 2017-10-11 3.057630e-01 0.0863445067 14165 2017-10-12 5.877673e-01 0.1097707831 14166 2017-10-13 1.208526e+00 0.1301967193 14167 2017-10-14 1.671705e+00 0.1728109268 14168 2017-10-15 1.810979e+00 0.2264911145 14169 2017-10-16 1.426651e+00 0.2702958315 14170 2017-10-17 1.241140e+00 0.3242637704 14171 2017-10-18 8.997498e-01 0.3879727861 14172 2017-10-19 5.594161e-01 0.4172990825 14173 2017-10-20 3.980254e-01 0.3915170864 14174 2017-10-21 2.138538e-01 0.3249736995 14175 2017-10-22 3.926440e-01 0.2224834840 14176 2017-10-23 2.268644e-01 0.0529143372 14177 2017-10-24 5.664923e-01 -0.0081443464 14178 2017-10-25 6.167520e-01 0.0312073984 14179 2017-10-26 7.751882e-02 0.0043897693 14180 2017-10-27 -5.634851e-02 -0.0726825266 14181 2017-10-28 -2.122061e-01 -0.1711305549 14182 2017-10-29 -8.500991e-01 -0.2068581639 14183 2017-10-30 -1.039685e+00 -0.2909120824 14184 2017-10-31 -3.057745e-01 -0.3933633317 14185 2017-11-01 -1.288774e-01 -0.3726346136 14186 2017-11-02 -5.608007e-03 -0.2425754386 14187 2017-11-03 4.853990e-01 -0.0503543980 14188 2017-11-04 5.822672e-01 0.0896130098 14189 2017-11-05 8.491505e-01 0.1299151006 14190 2017-11-06 1.052999e+00 0.0749888307 14191 2017-11-07 1.170470e+00 0.0287317882 14192 2017-11-08 7.919862e-01 0.0788187381 14193 2017-11-09 4.574565e-01 0.1539981316 14194 2017-11-10 4.552032e-01 0.2034393145 14195 2017-11-11 -3.621350e-01 0.2077476707 14196 2017-11-12 -8.053965e-01 0.1759558604 14197 2017-11-13 -8.307459e-01 0.1802858410 14198 2017-11-14 -9.421325e-01 0.2175529008 14199 2017-11-15 -9.880204e-01 0.2392924580 14200 2017-11-16 -7.448127e-01 0.2519253751 14201 2017-11-17 -8.081435e-01 0.2614254732 14202 2017-11-18 -1.216806e+00 0.2629971336 14203 2017-11-19 -1.122674e+00 0.3469995055 14204 2017-11-20 -1.242597e+00 0.4553094014 14205 2017-11-21 -1.294885e+00 0.5049438231 14206 2017-11-22 -9.325514e-01 0.4684133163 14207 2017-11-23 -4.632281e-01 0.4071673624 14208 2017-11-24 -9.689322e-02 0.3710270269 14209 2017-11-25 4.704467e-01 0.4126721465 14210 2017-11-26 8.682453e-01 0.3745057653 14211 2017-11-27 5.105564e-01 0.2373454931 14212 2017-11-28 4.747265e-01 0.1650783370 14213 2017-11-29 5.905379e-01 0.2632154120 14214 2017-11-30 4.083787e-01 0.3888834762 14215 2017-12-01 3.451736e-01 0.5008047592 14216 2017-12-02 5.161312e-01 0.5388177242 14217 2017-12-03 7.109279e-01 0.5515360710 14218 2017-12-04 4.458635e-01 0.5127537202 14219 2017-12-05 -3.986610e-01 0.3896493238 14220 2017-12-06 -5.968253e-01 0.1095843268 14221 2017-12-07 -1.604398e-01 -0.2455506506 14222 2017-12-08 -4.384744e-01 -0.5801038215 14223 2017-12-09 -7.255016e-01 -0.8384627087 14224 2017-12-10 -9.691828e-01 -0.9223171538 14225 2017-12-11 -1.140588e+00 -0.8177806761 14226 2017-12-12 -1.956622e-01 -0.5250998474 14227 2017-12-13 -1.083792e-01 -0.3430768534 14228 2017-12-14 -8.016345e-02 -0.3163476104 14229 2017-12-15 8.899266e-01 -0.2813253830 14230 2017-12-16 1.322833e+00 -0.2545953062 14231 2017-12-17 1.547972e+00 -0.2275373110 14232 2017-12-18 2.164907e+00 -0.3217205817 14233 2017-12-19 2.276258e+00 -0.5773412429 14234 2017-12-20 1.862291e+00 -0.7728091393 14235 2017-12-21 1.125083e+00 -0.9099696881 14236 2017-12-22 7.737118e-01 -1.2441963604 14237 2017-12-23 7.863508e-01 -1.4802661587 14238 2017-12-24 4.313111e-01 -1.4111320559 14239 2017-12-25 -8.814799e-02 -1.0024805520 14240 2017-12-26 -3.615127e-01 -0.4943077147 14241 2017-12-27 -5.011363e-01 -0.0308588186 14242 2017-12-28 -8.474088e-01 0.3717555895 14243 2017-12-29 -7.283247e-01 0.8230450219 14244 2017-12-30 -4.566981e-01 1.2495961116 14245 2017-12-31 -4.577034e-01 1.4805369230 14246 2018-01-01 1.946166e-01 1.5310004017 14247 2018-01-02 5.203149e-01 1.5384595802 14248 2018-01-03 5.024570e-02 1.4036679018 14249 2018-01-04 -7.065297e-01 1.0749574137 14250 2018-01-05 -8.741815e-01 0.7608524752 14251 2018-01-06 1.589530e-01 0.7891084646 14252 2018-01-07 8.632378e-01 1.1230358751 I am using lagged <- lag(ts(x), k=10) This is so the tsp isn't ignored. However, when I do cor(data$x, data$y) and cor(lagged, data$y) the result is the same, where I would have thought it would have been different. How do I get this lag to work before I can go ahead separate via date? Many thanks!
How to insert missing dates/times using R based on criteria?
A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day). Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day. There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate. I tried these lines from this question (even though they fill all without differentiation) but not getting there. How can it be done in R? date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00") staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack") reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3) df <- data.frame(date_time, staff, reading)
The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data library(data.table) stf <- c("Jack", "David", "Kate") date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04")) df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"), tz = "GMT"), length.out = 24, by = "1 hour")), staff] setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")] res <- rbindlist(list(df, df1), fill = TRUE)[, .(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading), .(staff, date_time)] table(res$staff, as.Date(res$date_time)) # 2018-03-02 2018-03-03 2018-03-04 # David 3 24 2 # Jack 24 1 1 # Kate 3 1 24 head(res) # staff date_time reading #1: Jack 2018-03-02 00:00:00 7.5 #2: Jack 2018-03-02 01:00:00 8.3 #3: Kate 2018-03-02 02:00:00 7.0 #4: Jack 2018-03-02 03:00:00 6.9 #5: Jack 2018-03-02 05:00:00 7.1 #6: Jack 2018-03-02 06:00:00 8.1 tail(res) # staff date_time reading #1: Kate 2018-03-04 04:00:00 0 #2: Kate 2018-03-04 09:00:00 0 #3: Kate 2018-03-04 13:00:00 0 #4: Kate 2018-03-04 18:00:00 0 #5: Kate 2018-03-04 21:00:00 0 #6: Kate 2018-03-04 23:00:00 0
Try this code: Identify each daily hour and all staff members date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60) staff_u<-unique(staff) comb<-expand.grid(staff_u,date_h) colnames(comb)<-c("staff","date_time") Uniform date format in df df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M") Merge information out<-merge(comb,df,all.x=T) Your output: head(out) staff date_time reading 1 Jack 2018-03-02 00:00:00 7.5 2 Jack 2018-03-02 01:00:00 8.3 3 Jack 2018-03-02 02:00:00 NA 4 Jack 2018-03-02 03:00:00 6.9 5 Jack 2018-03-02 04:00:00 NA 6 Jack 2018-03-02 05:00:00 7.1