I'm using heatwaveR package in R to make a plot (event_line()) and visualize the heatwaves over the years. The first step is to run ts2clm(), but this command turn my temp colum into NA so I can't plot anything. Does anyone see any errors?
This is my data:
>>> Data
t temp
[Date] [num]
0 2020-05-14 6.9
1 2020-05-06 6.8
2 2020-04-23 5.5
3 2020-04-16 3.6
4 2020-03-31 2.5
5 2020-02-25 2.3
6 2020-01-30 2.8
7 2019-10-02 13.4
8 2022-09-02 19
9 2022-08-15 18.7
...
687 1974-05-06 4.2
This is my code:
#Load data
Data <- read_xlsx("seili_raw_temp.xlsx")
#Set t as class Date
Data$t <- as.Date(Data$t, format = "%Y-%m-%d")
#Constructs seasonal and threshold climatologies
ts <- ts2clm(Data, climatologyPeriod = c("1974-05-06", "2020-05-14"))
#This is the point where almost all temp values turn into NA, so you can ignore below.
#Detect_even
res <- detect_event(ts)
#Draw heatwave plot
event_line(res, min_duration = "3",metric = "int_cum",
start_date = c("1974-05-06"), end_date = c("2020-05-14"))
The data you posted isn't long enough to get the function to work, so I just made some up:
library(heatwaveR)
library(lubridate)
set.seed(1234)
Data <- data.frame(
t = seq(ymd("2015-01-01"), ymd("2023-01-01"), by="7 day"))
Data$temp <- runif(nrow(Data), 0,45)
Then, when I execute the function, I get the result below. The problem is that your data (like the ones I generated) have one observation every 7 days. The ts2clm() function pads out the dataset so that every day has an entry and if a temperature was not observed on that day, it fills in with a missing value.
ts <- ts2clm(Data, climatologyPeriod = c("2015-01-01", "2022-12-29"))
ts
#> # A tibble: 2,920 × 5
#> doy t temp seas thresh
#> <int> <date> <dbl> <dbl> <dbl>
#> 1 1 2015-01-01 5.12 22.5 38.6
#> 2 2 2015-01-02 NA 22.4 38.5
#> 3 3 2015-01-03 NA 22.2 38.2
#> 4 4 2015-01-04 NA 22.1 37.9
#> 5 5 2015-01-05 NA 21.9 37.3
#> 6 6 2015-01-06 NA 21.7 36.8
#> 7 7 2015-01-07 NA 21.5 36.5
#> 8 8 2015-01-08 28.0 21.3 36.1
#> 9 9 2015-01-09 NA 21.2 36.1
#> 10 10 2015-01-10 NA 21.0 35.8
#> # … with 2,910 more rows
Created on 2023-02-10 by the reprex package (v2.0.1)
Related
I want to forecast a certain stock using ARIMA in a similar way that R. Hyndman does it in FPP3.
The first issue that I've run into is that stock data is obviously irregular, since the stock exchange is closed during weekends and some holidays. This creates some issues if I want to use functions from the tidyverts packages:
> stock
# A tsibble: 750 x 6 [1D]
Date Open High Low Close Volume
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-05-21 36.3 36.4 36.3 36.4 232
2 2019-05-22 36.4 37.0 36.4 36.8 1007
3 2019-05-23 36.7 36.8 36.1 36.1 4298
4 2019-05-24 36.4 36.5 36.4 36.4 452
5 2019-05-27 36.5 36.5 36.3 36.4 2032
6 2019-05-28 36.5 36.8 36.4 36.5 3049
7 2019-05-29 36.2 36.5 36.1 36.5 2962
8 2019-05-30 36.8 37.1 36.8 37.1 432
9 2019-05-31 36.8 37.4 36.8 37.4 8424
10 2019-06-03 37.3 37.5 37.2 37.3 1550
# ... with 740 more rows
> stock %>%
+ feasts::ACF(difference(Close)) %>%
+ autoplot()
Error in `check_gaps()`:
! .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.
The same error regarding gaps in time applies to other functions like fable::ARIMA() or feasts::gg_tsdisplay().
I have tried filling the gaps with values from previous rows:
stock %>%
group_by_key() %>%
fill_gaps() %>%
tidyr::fill(Close, .direction = "down")
# A tsibble: 1,096 x 6 [1D]
Date Open High Low Close Volume
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-05-21 36.3 36.4 36.3 36.4 232
2 2019-05-22 36.4 37.0 36.4 36.8 1007
3 2019-05-23 36.7 36.8 36.1 36.1 4298
4 2019-05-24 36.4 36.5 36.4 36.4 452
5 2019-05-25 NA NA NA 36.4 NA
6 2019-05-26 NA NA NA 36.4 NA
7 2019-05-27 36.5 36.5 36.3 36.4 2032
8 2019-05-28 36.5 36.8 36.4 36.5 3049
9 2019-05-29 36.2 36.5 36.1 36.5 2962
10 2019-05-30 36.8 37.1 36.8 37.1 432
# ... with 1,086 more rows
and everything works as it should from there. My question is:
Is there a way to use the "tidyverts approach" without running into the issue regarding gaps in time?
If not, is filling the gaps with values from previous rows a correct way to overcome this or will it bias the model?
First, you're clearly using an old version of the feasts package, because the current version gives a warning rather than an error when computing the ACF from data with implicit gaps.
Second, the answer depends on what analysis you want to do. You have three choices:
use day as the time index and fill the gaps with NAs;
use day as the time index and fill the gaps with the previous closing stock prices;
use trading day as the time index, in which case there are no gaps.
Here are the results for each of them, using an example of Apple stock over the period 2014-2018.
library(fpp3)
#> ── Attaching packages ─────────────────────────────────────── fpp3 0.4.0.9000 ──
#> ✔ tibble 3.1.7 ✔ tsibble 1.1.1
#> ✔ dplyr 1.0.9 ✔ tsibbledata 0.4.0
#> ✔ tidyr 1.2.0 ✔ feasts 0.2.2
#> ✔ lubridate 1.8.0 ✔ fable 0.3.1
#> ✔ ggplot2 3.3.6 ✔ fabletools 0.3.2
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> ✖ lubridate::date() masks base::date()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tsibble::intersect() masks base::intersect()
#> ✖ tsibble::interval() masks lubridate::interval()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tsibble::setdiff() masks base::setdiff()
#> ✖ tsibble::union() masks base::union()
1. Fill non-trading days with missing values
stock <- gafa_stock %>%
filter(Symbol == "AAPL") %>%
tsibble(index = Date, regular = TRUE) %>%
fill_gaps()
stock
#> # A tsibble: 1,825 x 8 [1D]
#> Symbol Date Open High Low Close Adj_Close Volume
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 58671200
#> 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 98116900
#> 3 <NA> 2014-01-04 NA NA NA NA NA NA
#> 4 <NA> 2014-01-05 NA NA NA NA NA NA
#> 5 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 103152700
#> 6 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 79302300
#> 7 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 64632400
#> 8 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 69787200
#> 9 AAPL 2014-01-10 77.1 77.3 75.9 76.1 64.5 76244000
#> 10 <NA> 2014-01-11 NA NA NA NA NA NA
#> # … with 1,815 more rows
stock %>%
model(ARIMA(Close ~ pdq(d=1)))
#> A mable: 1 x 1
#> `ARIMA(Close ~ pdq(d = 1))`
#> <model>
#> 1 <ARIMA(0,1,0)>
In this case, calculations of the ACF will find the longest contiguous part which is too small to be meaningful, so there isn't any point showing the results of ACF() or gg_tsdisplay(). Also, the automated choice of differencing in the ARIMA model fails due to the missing values, so I have manually set it to one. The other parts of the ARIMA model work fine in the presence of missing values.
2. Fill non-trading days with the last observed values
stock <- stock %>%
tidyr::fill(Close, .direction = "down")
stock
#> # A tsibble: 1,825 x 8 [1D]
#> Symbol Date Open High Low Close Adj_Close Volume
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 58671200
#> 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 98116900
#> 3 <NA> 2014-01-04 NA NA NA 77.3 NA NA
#> 4 <NA> 2014-01-05 NA NA NA 77.3 NA NA
#> 5 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 103152700
#> 6 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 79302300
#> 7 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 64632400
#> 8 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 69787200
#> 9 AAPL 2014-01-10 77.1 77.3 75.9 76.1 64.5 76244000
#> 10 <NA> 2014-01-11 NA NA NA 76.1 NA NA
#> # … with 1,815 more rows
stock %>%
ACF(difference(Close)) %>%
autoplot()
stock %>%
model(ARIMA(Close))
#> # A mable: 1 x 1
#> `ARIMA(Close)`
#> <model>
#> 1 <ARIMA(0,1,0)>
stock %>%
gg_tsdisplay(Close)
3. Re-index by trading day
stock <- gafa_stock %>%
filter(Symbol == "AAPL") %>%
tsibble(index = Date, regular = TRUE) %>%
mutate(trading_day = row_number()) %>%
tsibble(index = trading_day)
stock
#> # A tsibble: 1,258 x 9 [1]
#> Symbol Date Open High Low Close Adj_Close Volume trading_day
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 58671200 1
#> 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 98116900 2
#> 3 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 103152700 3
#> 4 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 79302300 4
#> 5 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 64632400 5
#> 6 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 69787200 6
#> 7 AAPL 2014-01-10 77.1 77.3 75.9 76.1 64.5 76244000 7
#> 8 AAPL 2014-01-13 75.7 77.5 75.7 76.5 64.9 94623200 8
#> 9 AAPL 2014-01-14 76.9 78.1 76.8 78.1 66.1 83140400 9
#> 10 AAPL 2014-01-15 79.1 80.0 78.8 79.6 67.5 97909700 10
#> # … with 1,248 more rows
stock %>%
ACF(difference(Close)) %>%
autoplot()
stock %>%
model(ARIMA(Close))
#> # A mable: 1 x 1
#> `ARIMA(Close)`
#> <model>
#> 1 <ARIMA(2,1,3)>
stock %>%
gg_tsdisplay(Close)
Created on 2022-05-22 by the reprex package (v2.0.1)
I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows
I have two dataframes, interest rates and monthly standard deviation prices returns, that I have managed to merge together. However the interest rate data has gaps in its dates where the markets were not open, i.e. weekends and holidays. The monthly returns all start on the first of the month so where this lines up with a market closure the data doesn't merge correctly. An example of the dataframes is
Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341
Date InterestRate
03/11/2014 2
31/10/2014 1.5
As you can see there is no 01/11/2014 in the interest rate data so merging together gives me
Date InterestRate Rollingstd
03/11/2014 2 0.01341
31/10/2014 1.5 0.01341
I guess a fix for this would be to expand the interest rate dataframe so that it includes all dates and just fill the interest rate data up so it looks like this
Date InterestRate
03/11/2014 2
02/11/2014 1.5
01/11/2014 1.5
31/10/2014 1.5
This would ensure there are no missing dates in the dataframe. Any ideas on how I could do this?
Do you want this?
df2 <- read.table(text = 'Date InterestRate
03/11/2014 2
31/10/2014 1.5', header = T)
df1 <- read.table(text = 'Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341', header = T)
library(tidyverse)
df1 %>% full_join(df2, by = 'Date') %>%
mutate(Date = as.Date(Date, '%d/%m/%Y')) %>%
arrange(Date) %>%
complete(Date = seq.Date(min(Date), max(Date), 'days')) %>%
fill(InterestRate, .direction = 'up') %>%
as.data.frame()
#> Date Rollingstd InterestRate
#> 1 2014-10-01 0.01341 1.5
#> 2 2014-10-02 NA 1.5
#> 3 2014-10-03 NA 1.5
#> 4 2014-10-04 NA 1.5
#> 5 2014-10-05 NA 1.5
#> 6 2014-10-06 NA 1.5
#> 7 2014-10-07 NA 1.5
#> 8 2014-10-08 NA 1.5
#> 9 2014-10-09 NA 1.5
#> 10 2014-10-10 NA 1.5
#> 11 2014-10-11 NA 1.5
#> 12 2014-10-12 NA 1.5
#> 13 2014-10-13 NA 1.5
#> 14 2014-10-14 NA 1.5
#> 15 2014-10-15 NA 1.5
#> 16 2014-10-16 NA 1.5
#> 17 2014-10-17 NA 1.5
#> 18 2014-10-18 NA 1.5
#> 19 2014-10-19 NA 1.5
#> 20 2014-10-20 NA 1.5
#> 21 2014-10-21 NA 1.5
#> 22 2014-10-22 NA 1.5
#> 23 2014-10-23 NA 1.5
#> 24 2014-10-24 NA 1.5
#> 25 2014-10-25 NA 1.5
#> 26 2014-10-26 NA 1.5
#> 27 2014-10-27 NA 1.5
#> 28 2014-10-28 NA 1.5
#> 29 2014-10-29 NA 1.5
#> 30 2014-10-30 NA 1.5
#> 31 2014-10-31 NA 1.5
#> 32 2014-11-01 0.00925 2.0
#> 33 2014-11-02 NA 2.0
#> 34 2014-11-03 NA 2.0
Created on 2021-05-23 by the reprex package (v2.0.0)
I have a data set like below:
Timestamp Value1 Value2
2020-10-29 05:00:00 10 20
2020-10-29 05:00:01 10 20
2020-10-29 05:00:02 11 22
2020-10-29 05:00:03 11 22
and so on, in one second intervals, and upto a few hours of data. I want to generate an average value two minutes, but left align the data. Essentially, rolling average of 2 minutes at 2020-10-29 05:00:00 should be the average of data points between 2020-10-29 05:00:00 and 2020-10-29 05:01:59
I have used data %>% group_by(Timestamp =cut (Timestamp, breaks="2min"))%>% summarize(Meanval1=mean(Value1), Meanval2=mean(Value2) but this right aligns the data. How can I left align it?
Thanks!
You can round down the Timestamp column to the nearest two minutes using lubridate::floor_date. If you then group_by this new column, you will get a left-aligned two-minute mean:
library(dplyr)
df %>%
mutate(time = lubridate::floor_date(df$TimeStamp, "2 minutes")) %>%
group_by(time) %>%
summarize(mean_val1 = mean(Value1), mean_val2 = mean(Value2))
#> # A tibble: 9 x 3
#> time mean_val1 mean_val2
#> <dttm> <dbl> <dbl>
#> 1 2020-10-29 05:00:00 10.2 19.9
#> 2 2020-10-29 05:02:00 9.84 20.0
#> 3 2020-10-29 05:04:00 10.1 19.9
#> 4 2020-10-29 05:06:00 9.72 20.3
#> 5 2020-10-29 05:08:00 9.98 19.9
#> 6 2020-10-29 05:10:00 9.98 20.0
#> 7 2020-10-29 05:12:00 10.1 20.0
#> 8 2020-10-29 05:14:00 10.0 20.1
#> 9 2020-10-29 05:16:00 10.0 20.2
Data used
set.seed(69)
t <- seq(as.POSIXct("2020-10-29 05:00:00"), by = "1 sec", length.out = 1000)
df <- data.frame(TimeStamp = t,
Value1 = sample(8:12, 1000, TRUE),
Value2 = sample(18:22, 1000, TRUE))
head(df)
#> TimeStamp Value1 Value2
#> 1 2020-10-29 05:00:00 8 20
#> 2 2020-10-29 05:00:01 10 21
#> 3 2020-10-29 05:00:02 9 19
#> 4 2020-10-29 05:00:03 12 19
#> 5 2020-10-29 05:00:04 12 19
#> 6 2020-10-29 05:00:05 9 18
I am trying to find mean of A and B for each row and save it as separate column but seems like the code only average the first row and fill the rest of the rows with that value. Any suggestion how to fix this?
library(tidyverse)
library(lubridate)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("2001-01-01"), to = as.Date("2003-12-31"), by = "day"),
A = runif(1095, 1,60),
Z = runif(1095, 5,100)) %>%
mutate(MeanofAandZ= mean(A:Z))
Are you looking for this:
DF %>% rowwise() %>% mutate(MeanofAandZ = mean(c_across(A:Z)))
# A tibble: 1,095 x 4
# Rowwise:
Date A Z MeanofAandZ
<date> <dbl> <dbl> <dbl>
1 2001-01-01 26.5 7.68 17.1
2 2001-01-02 54.9 33.1 44.0
3 2001-01-03 37.1 82.0 59.5
4 2001-01-04 6.91 18.0 12.4
5 2001-01-05 53.0 8.76 30.9
6 2001-01-06 26.1 7.63 16.9
7 2001-01-07 59.3 30.8 45.0
8 2001-01-08 39.9 14.6 27.3
9 2001-01-09 59.2 93.6 76.4
10 2001-01-10 30.7 89.1 59.9
you can do it with Base R: rowMeans
Full Base R:
DF$MeanofAandZ <- rowMeans(DF[c("A", "Z")])
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
or inside a mutate:
library(dplyr)
DF <- DF %>% mutate(MeanofAandZ = rowMeans(cbind(A,Z)))
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
We can also do
DF$MeanofAandZ <- Reduce(`+`, DF[c("A", "Z")])/2
Or using apply
DF$MeanofAandZ <- apply(DF[c("A", "Z")], 1, mean)