I have a data frame containing some daily data timestamped at midnight on each day and some hourly data timestamped at the beginning of each hour throughout the day. I want to expand the data so it's all hourly, and I'd like to do so within a tidyverse "pipe chain".
My thought was to create a data frame containing the full hourly time series and then dplyr::right_join() my data against this time series. I thought this would populate the proper values where there was a match for the daily data (at midnight) and populate NA wherever there was no match (any hour except midnight). This seems to work only when the time series in my data is daily only, rather than a mix of daily and hourly values, which was unexpected. Why does the right join not expand the daily time series when it coexists in a data frame along with another hourly time series?
I've generated a minimal example below. My representative data set that I want to expand is named allData and contains a mix of daily and hourly datasets from two different time series variables, Daily TS and Hourly TS.
dailyData <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07', truncated=3),
by='day'),
Name = 'Daily TS'
)
allHours <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07 23:00:00'),
by='hour')
)
hourlyData <- allHours %>%
dplyr::mutate( Name = 'Hourly TS' )
allData <- rbind( dailyData, hourlyData )
This gives
head( allData, n=15 )
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Now, I thought that dplyr::right_join() of the full hourly sequence of POSIXct values against allData$DateTime would have expanded the daily time series, leaving NA values for any hours not explicitly present in the data. I could then use tidyr::fill() to fill these in over the day. However, the following code does not behave this way:
expanded_BAD <- allData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' ) %>%
dplyr::arrange( Name, DateTime )
expanded_BAD shows that the daily data hasn't been expanded by the right_join(). That is, the hours in allHours missing from allData were not retained in the result, which I thought was the whole purpose of using a right join. Here's the head of the result:
head(expanded_BAD, n=15)
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Interestingly, if we perform the exact same right join on only the daily data, we get the desired result:
dailyData_expanded_GOOD <- dailyData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' )
Here's the head:
head(dailyData_expanded_GOOD, n=15)
DateTime Value
1 2019-01-01 00:00:00 Daily TS
2 2019-01-01 01:00:00 Daily TS
3 2019-01-01 02:00:00 Daily TS
4 2019-01-01 03:00:00 Daily TS
5 2019-01-01 04:00:00 Daily TS
6 2019-01-01 05:00:00 Daily TS
7 2019-01-01 06:00:00 Daily TS
8 2019-01-01 07:00:00 Daily TS
9 2019-01-01 08:00:00 Daily TS
10 2019-01-01 09:00:00 Daily TS
11 2019-01-01 10:00:00 Daily TS
12 2019-01-01 11:00:00 Daily TS
13 2019-01-01 12:00:00 Daily TS
14 2019-01-01 13:00:00 Daily TS
15 2019-01-01 14:00:00 Daily TS
Why does the right join do different things on the full data compared to only the daily data?
I think the problem is that you are trying to bind the dataframes together too soon. I believe this gives you what you want:
result <- bind_rows(dailyData_expanded_GOOD, hourlyData)
head(result)
#> DateTime Name
#> 1 2019-01-01 00:00:00 Daily TS
#> 2 2019-01-01 01:00:00 Daily TS
#> 3 2019-01-01 02:00:00 Daily TS
#> 4 2019-01-01 03:00:00 Daily TS
#> 5 2019-01-01 04:00:00 Daily TS
#> 6 2019-01-01 05:00:00 Daily TS
The reason right_join() doesn't work is that allHours matches perfectly the
rows in allData for hourly timeseries. From ?right_join
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
You're hoping that rows in x with no match in y will have NA values, but the rows in y do match rows in x already. There are actually multiple matches, one for the daily and one for the hourly, but right_join() just returns both without expanding the daily time series rows.
This is different from the situation in this question, where the datetimes to be expanded do not occur in the left hand data frame. Then the strategy of merging would expand your result as expected.
So that explains why a bare right_join() doesn't work, but doesn't solve
the problem because you have to manually split up the data, and that would
get old fast if there are varying numbers of time series. There are a couple solutions in comments, and then one additional that I will add below.
tidyr::expand()
expandedData <- allData %>%
tidyr::expand( DateTime, Name ) %>%
dplyr::arrange( Name, DateTime )
This works, but only with both time series present. If there is only
dailyData, then the result is not expanded.
The kitchen sink
expandedData1 <- allData %>%
dplyr::right_join(allHours, by = 'DateTime') %>%
tidyr::fill(everything()) %>%
tidyr::expand( DateTime, Name) %>%
dplyr::arrange( Name, DateTime )
As pointed out in the comments, this works for all cases - both types,
only daily data, only hourly data. This solution and the next generate
warnings unless you use stringsAsFactors = FALSE in the data.frame()
calls above.
The only issue with this solution is that fill() and right_join() are
only there to deal with edge cases. I don't know if that is a real problem
or not.
"Split" in the pipe
The simple solution splits the dataset, and this can be done inside the
pipe in a couple ways.
expandedData2 <- allData %>%
tidyr::nest(-Name) %>%
mutate(data = purrr::map(data, ~right_join(., allHours, by = 'DateTime'))) %>%
tidyr::unnest()
The other way would use base::split() and then purrr::map_dfr()
Created on 2019-03-24 by the reprex
package (v0.2.0).
Related
I have a time series, that spans almost 20 years with a resolution of 15 min.
I want to extract only hourly values (00:00:00, 01:00:00, and so on...) and plot the resulting time series.
The df looks like this:
3 columns: date, time, and discharge
How would you approach this?
a reproducible example would be good for this kind of question. Here is my code, hope it helps you:
#creating dummy data
df <- data.frame(time = seq(as.POSIXct("2018-01-01 00:00:00"), as.POSIXct("2018-01-01 23:59:59"), by = "15 min"), variable = runif(96, 0, 1))
example output: (only 5 rows)
time variable
1 2018-01-01 00:00:00 0.331546992
2 2018-01-01 00:15:00 0.407269290
3 2018-01-01 00:30:00 0.635367577
4 2018-01-01 00:45:00 0.808612045
5 2018-01-01 01:00:00 0.258801201
df %>% filter(format(time, "%M:%S") == "00:00")
output:
1 2018-01-01 00:00:00 0.76198532
2 2018-01-01 01:00:00 0.01304103
3 2018-01-01 02:00:00 0.10729465
4 2018-01-01 03:00:00 0.74534184
5 2018-01-01 04:00:00 0.25942667
plot(df %>% filter(format(time, "%M:%S") == "00:00") %>% ggplot(aes(x = time, y = variable)) + geom_line())
I am trying to separate the events from streamflow data. I have hourly data. I have run the code
dailyMQ <- data.frame(Date=seq(from=as.Date("01.01.2000", format="%d.%m.%Y"),
to=as.Date("01.01.2004", format="%d.%m.%Y"), by="days"),
discharge=rbeta(1462,2,20)*100)
for daily data. But I am trying for hourly data but getting errors.
Could anyone suggest me how to write a code for hourly data?
Thanks
Date format can't directly be split into hours.
You could use POSIXct datetime format:
HourlyMQ <- data.frame(Date=seq(from=as.POSIXct("01.01.2019", format="%d.%m.%Y"), to=as.POSIXct("11.12.2019", format="%d.%m.%Y"),by="hours"),discharge=rbeta(8257,2,20))
HourlyMQ
#> Date discharge
#> 1 2019-01-01 00:00:00 0.2452214482
#> 2 2019-01-01 01:00:00 0.0620291334
#> 3 2019-01-01 02:00:00 0.0608788870
#> 4 2019-01-01 03:00:00 0.0697449808
#> 5 2019-01-01 04:00:00 0.0302780135
I have below dataframe (df) from ENTSO-E showing German power prices. I created the "Hour" column with lubridate function hour(df$date). Output was a range (1,2,..,23,0)
# to replace 0 with 24
df["Hour"][df["Hour"]=="0"]<- "24"
I will need to work on an hourly basis. So I filtered each hour from 1 till 24, but I cannot filter the replaced hour: H24.
H1 <- df %>%
filter(Hour==1)
H24 <- df %>%
filter(Hour==24)
Error in match.fun(FUN) : object 'Hour' not found
24 values are still in Hour col, and class is numeric but I cannot do any calculation with the Hour column.
class(df$Hour)
[1] "numeric"
mean(german_last_4$Hour)
[1] NA
I am thinking problem is with replace function. is there any other way to produce a result that works with H24?
date
price
Hour
2019-01-01 01:00:00
28.32
1
2019-01-01 02:00:00
10.07
2
2019-01-01 03:00:00
-4.08
3
2019-01-01 04:00:00
-9.91
4
2019-01-01 05:00:00
-7.41
5
2019-01-01 06:00:00
-12.55
6
I have a dataset that has a rather complicated problem. It includes 600,000 observations. The main issue is related to the data collection process. As an example I provided the following dataset that has similar structure to the real datast I have in hand:
df <- data.frame(row_number = c(1,2,3,4,5,6,7,8,9),
date = c("2020-01-01", "2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-02","2020-01-03","2020-01-03","2020-01-03"),
time = c("01:00:00","09:00:00","17:00:00", "09:00:00","01:00:00","17:00:00", "01:00:00","NA","09:00:00"),
order = c(1,2,3,1,2,3,1,2,3),
value = c(10,20,30,40,10,20,30,NA,50)
I know in each day the data was recorded 3 times (order variable). That is in each day, the first time in which the data was recorded was 1:00:00, the second time in which the data was recorded was 09:00:00 and the last time in which the data was recorded was 17:00:00.
However, the person who collected data has made mistakes. For instance, in row_num 4, the time is supposed to be 01:00:00, however, the data collector recorded 09:00:00.
Also, in row number 8 I expect the time should be 9:00:00, however, since there was no information was recorded in value, the person did not fill that row and rather recorded the time to be 09:00:00 at order number 3 while it is expected that the time in order number 3 is 17:00:00.
Given the fact that we know the order of the data collection, I was wondering if you have any solution to deal with such an issue in the dataset.
Thanks in advance for your time.
Create a group of 3 rows and give time in the order we want :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number/3)) %>%
mutate(time = c('01:00:00', '09:00:00', '17:00:00')) %>%
ungroup %>% select(-grp)
# row_number date time order value
# <dbl> <chr> <chr> <dbl> <dbl>
#1 1 2020-01-01 01:00:00 1 10
#2 2 2020-01-01 09:00:00 2 20
#3 3 2020-01-01 17:00:00 3 30
#4 4 2020-01-02 01:00:00 1 40
#5 5 2020-01-02 09:00:00 2 10
#6 6 2020-01-02 17:00:00 3 20
#7 7 2020-01-03 01:00:00 1 30
#8 8 2020-01-03 09:00:00 2 NA
#9 9 2020-01-03 17:00:00 3 50
time <- c("01:00:00","09:00:00","17:00:00")
rep(time, 200000)
The rep() function allows you to repeat a vector as many times as you want for your dataset. This allows you to create the 3 time slots for observations, and then repeat them for you 600,000 observations, so you can eliminate humane error.
So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)