I have a time series, that spans almost 20 years with a resolution of 15 min.
I want to extract only hourly values (00:00:00, 01:00:00, and so on...) and plot the resulting time series.
The df looks like this:
3 columns: date, time, and discharge
How would you approach this?
a reproducible example would be good for this kind of question. Here is my code, hope it helps you:
#creating dummy data
df <- data.frame(time = seq(as.POSIXct("2018-01-01 00:00:00"), as.POSIXct("2018-01-01 23:59:59"), by = "15 min"), variable = runif(96, 0, 1))
example output: (only 5 rows)
time variable
1 2018-01-01 00:00:00 0.331546992
2 2018-01-01 00:15:00 0.407269290
3 2018-01-01 00:30:00 0.635367577
4 2018-01-01 00:45:00 0.808612045
5 2018-01-01 01:00:00 0.258801201
df %>% filter(format(time, "%M:%S") == "00:00")
output:
1 2018-01-01 00:00:00 0.76198532
2 2018-01-01 01:00:00 0.01304103
3 2018-01-01 02:00:00 0.10729465
4 2018-01-01 03:00:00 0.74534184
5 2018-01-01 04:00:00 0.25942667
plot(df %>% filter(format(time, "%M:%S") == "00:00") %>% ggplot(aes(x = time, y = variable)) + geom_line())
Related
I am working on a project where I have to only include patients who had lab tests ordered at least 12 hours apart, and to keep the timestamp of each included lab test. The issue is that many patients get several labs done within the 12 hour window, but the client has asked to not include those tests. I have made it this far:
#Create dummy dataset
df = data.frame(
"Encounter" = c(rep("12345", times=16), rep("67890", times = 5)),
"Timestamp" = c("01/06/2022 04:00:00", "01/07/2022 08:00:00",
"01/08/2022 00:00:00", "01/08/2022 04:00:00",
"01/08/2022 08:00:00", "01/08/2022 20:00:00",
"01/09/2022 04:00:00", "01/09/2022 08:00:00",
"01/09/2022 20:00:00", "01/09/2022 23:26:00",
"01/10/2022 00:00:00", "01/10/2022 08:00:00",
"01/10/2022 20:00:00", "01/11/2022 00:00:00",
"01/11/2022 20:00:00", "01/12/2022 04:00:00",
"11/10/2021 11:00:00", "11/10/2021 12:00:00",
"11/10/2021 13:00:00", "11/10/2021 14:00:00",
"11/11/2021 00:00:00"))
#Convert timestamp to POSIXlt format
df$Timestamp <- strptime(as.character(df$Timestamp), format="%m/%d/%Y %H:%M")
#Calculate time (in hours) between each previous timestamp by Encounter
df <- df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(difftime(Timestamp, lag(Timestamp), units="hours"))
I can't seem to figure out what to do next. It seems like I need to calculate a rolling 12-hours that then resets to 0 once a row hits 12 hours, but I'm not sure how to go about it. Below is my ideal result:
df$Keep.Row <- c(1,1,1,0,0,1,0,1,1,0,0,1,1,0,1,0,1,0,0,0,1)
There is absolutely nothing elegant about this, but I believe it gives you what you’re looking for. I use a temporary variable to store the “rolling” sum before it’s reset once the hours between is 12 or greater.
library(tidyverse)
df <- df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(time_diff = difftime(Timestamp, lag(Timestamp), units="hours")) %>%
replace_na(list(time_diff = 0)) %>%
mutate(temp = ifelse(time_diff < 12 & lag(time_diff) >= 12, time_diff, lag(time_diff) + time_diff),
temp = ifelse(is.na(temp), 0, temp),
hours_between = ifelse(time_diff >= 12, time_diff,
ifelse(time_diff < 12 & lag(time_diff) >= 12, time_diff, lag(temp) + time_diff)),
keep = ifelse(hours_between >= 12 | is.na(hours_between), 1, 0)) %>%
select(-temp)
Created on 2022-01-27 by the reprex package (v2.0.1)
Here is an alternative option using accumulate. Here, you can use you differences, and once they exceed the threshold of 12 hours, reset by just using the diff value (starting over) instead of using the cumulative sum. To include the first time for each Encounter, you can either make that diff 12 hours, or add a separate mutate and check where Timestamp == first(Timestamp) and in those cases set keep to 1.
library(tidyverse)
thresh <- 12
df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(diff = difftime(Timestamp, lag(Timestamp, default = first(Timestamp) - (thresh * 60 * 60)), units = "hours"),
keep = +(accumulate(diff, ~if_else(.x >= thresh, .y, .x + .y)) >= thresh))
Output
Encounter Timestamp diff keep
<chr> <dttm> <drtn> <int>
1 12345 2022-01-06 04:00:00 12.0000000 hours 1
2 12345 2022-01-07 08:00:00 28.0000000 hours 1
3 12345 2022-01-08 00:00:00 16.0000000 hours 1
4 12345 2022-01-08 04:00:00 4.0000000 hours 0
5 12345 2022-01-08 08:00:00 4.0000000 hours 0
6 12345 2022-01-08 20:00:00 12.0000000 hours 1
7 12345 2022-01-09 04:00:00 8.0000000 hours 0
8 12345 2022-01-09 08:00:00 4.0000000 hours 1
9 12345 2022-01-09 20:00:00 12.0000000 hours 1
10 12345 2022-01-09 23:26:00 3.4333333 hours 0
11 12345 2022-01-10 00:00:00 0.5666667 hours 0
12 12345 2022-01-10 08:00:00 8.0000000 hours 1
13 12345 2022-01-10 20:00:00 12.0000000 hours 1
14 12345 2022-01-11 00:00:00 4.0000000 hours 0
15 12345 2022-01-11 20:00:00 20.0000000 hours 1
16 12345 2022-01-12 04:00:00 8.0000000 hours 0
17 67890 2021-11-10 11:00:00 12.0000000 hours 1
18 67890 2021-11-10 12:00:00 1.0000000 hours 0
19 67890 2021-11-10 13:00:00 1.0000000 hours 0
20 67890 2021-11-10 14:00:00 1.0000000 hours 0
21 67890 2021-11-11 00:00:00 10.0000000 hours 1
Probably missing something, but wouldn't this work:
library(dplyr)
df %>%
group_by(Encounter) %>%
arrange(Encounter, Timestamp) %>%
mutate(time_dif = difftime(Timestamp, lag(Timestamp), units="hours")) %>%
filter(time_dif > 12)
I have a dataset with a date-time vector (format is m/d/y h:m) that looks like this:
june2018_2$datetime
[1] "6/1/2018 1:00" "6/1/2018 2:00" "6/1/2018 3:00" "6/1/2018 4:00"
And I have 61 other variables that are all numeric (with some already missing values indicated with 'NA'). My date time vector is missing some hourly slots and I want to make the date-time vector full and fill in missing spots in the other 61 variables with 'NA'. I tried to use what's already out there but I can't seem to find some code or function that works for what I'm specifically working with. Any tips?
If your datetime is not in POSIXct then could be mutated. With complete you can fill in rows by the hour. Other columns in the data frame will be NA.
library(tidyverse)
df %>%
mutate(datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M")) %>%
complete(datetime = seq(from = first(datetime), to = last(datetime), by = "hours"))
For example, if you have test data:
set.seed(123)
df <- data.frame(
datetime = c("6/1/2018 1:00", "6/1/2018 3:00", "6/1/2018 5:00", "6/1/2018 9:00"),
var1 = sample(10,4)
)
The output would be:
# A tibble: 9 x 2
datetime var1
<dttm> <int>
1 2018-06-01 01:00:00 3
2 2018-06-01 02:00:00 NA
3 2018-06-01 03:00:00 10
4 2018-06-01 04:00:00 NA
5 2018-06-01 05:00:00 2
6 2018-06-01 06:00:00 NA
7 2018-06-01 07:00:00 NA
8 2018-06-01 08:00:00 NA
9 2018-06-01 09:00:00 8
I have a data frame containing some daily data timestamped at midnight on each day and some hourly data timestamped at the beginning of each hour throughout the day. I want to expand the data so it's all hourly, and I'd like to do so within a tidyverse "pipe chain".
My thought was to create a data frame containing the full hourly time series and then dplyr::right_join() my data against this time series. I thought this would populate the proper values where there was a match for the daily data (at midnight) and populate NA wherever there was no match (any hour except midnight). This seems to work only when the time series in my data is daily only, rather than a mix of daily and hourly values, which was unexpected. Why does the right join not expand the daily time series when it coexists in a data frame along with another hourly time series?
I've generated a minimal example below. My representative data set that I want to expand is named allData and contains a mix of daily and hourly datasets from two different time series variables, Daily TS and Hourly TS.
dailyData <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07', truncated=3),
by='day'),
Name = 'Daily TS'
)
allHours <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07 23:00:00'),
by='hour')
)
hourlyData <- allHours %>%
dplyr::mutate( Name = 'Hourly TS' )
allData <- rbind( dailyData, hourlyData )
This gives
head( allData, n=15 )
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Now, I thought that dplyr::right_join() of the full hourly sequence of POSIXct values against allData$DateTime would have expanded the daily time series, leaving NA values for any hours not explicitly present in the data. I could then use tidyr::fill() to fill these in over the day. However, the following code does not behave this way:
expanded_BAD <- allData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' ) %>%
dplyr::arrange( Name, DateTime )
expanded_BAD shows that the daily data hasn't been expanded by the right_join(). That is, the hours in allHours missing from allData were not retained in the result, which I thought was the whole purpose of using a right join. Here's the head of the result:
head(expanded_BAD, n=15)
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Interestingly, if we perform the exact same right join on only the daily data, we get the desired result:
dailyData_expanded_GOOD <- dailyData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' )
Here's the head:
head(dailyData_expanded_GOOD, n=15)
DateTime Value
1 2019-01-01 00:00:00 Daily TS
2 2019-01-01 01:00:00 Daily TS
3 2019-01-01 02:00:00 Daily TS
4 2019-01-01 03:00:00 Daily TS
5 2019-01-01 04:00:00 Daily TS
6 2019-01-01 05:00:00 Daily TS
7 2019-01-01 06:00:00 Daily TS
8 2019-01-01 07:00:00 Daily TS
9 2019-01-01 08:00:00 Daily TS
10 2019-01-01 09:00:00 Daily TS
11 2019-01-01 10:00:00 Daily TS
12 2019-01-01 11:00:00 Daily TS
13 2019-01-01 12:00:00 Daily TS
14 2019-01-01 13:00:00 Daily TS
15 2019-01-01 14:00:00 Daily TS
Why does the right join do different things on the full data compared to only the daily data?
I think the problem is that you are trying to bind the dataframes together too soon. I believe this gives you what you want:
result <- bind_rows(dailyData_expanded_GOOD, hourlyData)
head(result)
#> DateTime Name
#> 1 2019-01-01 00:00:00 Daily TS
#> 2 2019-01-01 01:00:00 Daily TS
#> 3 2019-01-01 02:00:00 Daily TS
#> 4 2019-01-01 03:00:00 Daily TS
#> 5 2019-01-01 04:00:00 Daily TS
#> 6 2019-01-01 05:00:00 Daily TS
The reason right_join() doesn't work is that allHours matches perfectly the
rows in allData for hourly timeseries. From ?right_join
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
You're hoping that rows in x with no match in y will have NA values, but the rows in y do match rows in x already. There are actually multiple matches, one for the daily and one for the hourly, but right_join() just returns both without expanding the daily time series rows.
This is different from the situation in this question, where the datetimes to be expanded do not occur in the left hand data frame. Then the strategy of merging would expand your result as expected.
So that explains why a bare right_join() doesn't work, but doesn't solve
the problem because you have to manually split up the data, and that would
get old fast if there are varying numbers of time series. There are a couple solutions in comments, and then one additional that I will add below.
tidyr::expand()
expandedData <- allData %>%
tidyr::expand( DateTime, Name ) %>%
dplyr::arrange( Name, DateTime )
This works, but only with both time series present. If there is only
dailyData, then the result is not expanded.
The kitchen sink
expandedData1 <- allData %>%
dplyr::right_join(allHours, by = 'DateTime') %>%
tidyr::fill(everything()) %>%
tidyr::expand( DateTime, Name) %>%
dplyr::arrange( Name, DateTime )
As pointed out in the comments, this works for all cases - both types,
only daily data, only hourly data. This solution and the next generate
warnings unless you use stringsAsFactors = FALSE in the data.frame()
calls above.
The only issue with this solution is that fill() and right_join() are
only there to deal with edge cases. I don't know if that is a real problem
or not.
"Split" in the pipe
The simple solution splits the dataset, and this can be done inside the
pipe in a couple ways.
expandedData2 <- allData %>%
tidyr::nest(-Name) %>%
mutate(data = purrr::map(data, ~right_join(., allHours, by = 'DateTime'))) %>%
tidyr::unnest()
The other way would use base::split() and then purrr::map_dfr()
Created on 2019-03-24 by the reprex
package (v0.2.0).
I have a dataset with periods
active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2018-01-01 01:50:00 2018-01-01 02:00:00
3: 2 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 3 2018-01-01 01:50:00 2018-01-01 02:00:00
during which an id was active. I would like to aggregate across ids and determine for every point in
time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))
the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like
>ans
time inactive av.time
1: 2018-01-01 01:10:00 2 30
2: 2018-01-01 01:11:00 2 29
...
50: 2018-01-01 02:00:00 0 0
I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.
Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.
library(tidyverse)
time %>%
mutate(dummy = TRUE) %>%
inner_join({
active %>%
mutate(dummy = TRUE)
#join by the dummy variable to get the Cartesian product
}, by = c("dummy" = "dummy")) %>%
select(-dummy) %>%
#define what makes an id inactive and the time until it becomes active
mutate(inactive = time < beg | time > end,
TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>%
#group by time and summarise
group_by(time) %>%
summarise(inactive = sum(inactive),
av.time = mean(TimeUntilActive, na.rm = TRUE))
# A tibble: 51 x 3
time inactive av.time
<dttm> <int> <dbl>
1 2018-01-01 01:10:00 3 40
2 2018-01-01 01:11:00 3 39
3 2018-01-01 01:12:00 3 38
4 2018-01-01 01:13:00 3 37
5 2018-01-01 01:14:00 3 36
6 2018-01-01 01:15:00 3 35
7 2018-01-01 01:16:00 3 34
8 2018-01-01 01:17:00 3 33
9 2018-01-01 01:18:00 3 32
10 2018-01-01 01:19:00 3 31
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]