I'm forecasting hierarchical data with fable that has 2 levels of aggregation (but will have more in the future), and am having trouble knowing which predictions correspond to which series. Here is a simplified version of what I have:
# A fable: 7 x 6 [12M]
# Key: type, name, .model [7]
type name .model date value .mean
<chr*> <chr*> <chr> <mth> <dist> <dbl>
1 x x1 mint 2021 Jan N(20, 0.82) 19.9
2 x x2 mint 2021 Jan N(20, 1.3) 19.9
3 x <aggregated> mint 2021 Jan N(40, 1) 39.8
4 y y1 mint 2021 Jan N(9.7, 1.9) 9.73
5 y y2 mint 2021 Jan N(9.9, 1.7) 9.92
6 y <aggregated> mint 2021 Jan N(20, 3.8) 19.6
7 <aggregated> <aggregated> mint 2021 Jan N(59, 5.9) 59.4
Is there a way to rename the aggregated vectors as I am pivoting and aggregating the table? So it would look something like this:
# A fable: 7 x 6 [12M]
# Key: type, name, .model [7]
type name .model date value .mean
<chr*> <chr*> <chr> <mth> <dist> <dbl>
1 x x1 mint 2021 Jan N(20, 0.82) 19.9
2 x x2 mint 2021 Jan N(20, 1.3) 19.9
3 x x mint 2021 Jan N(40, 1) 39.8
4 y y1 mint 2021 Jan N(9.7, 1.9) 9.73
5 y y2 mint 2021 Jan N(9.9, 1.7) 9.92
6 y y mint 2021 Jan N(20, 3.8) 19.6
7 xy xy mint 2021 Jan N(59, 5.9) 59.4
I can do it manually for one level of aggregation by just renaming all aggregate vectors to what I want, but for two (or more) I'm not sure how to do it. I have tried using the is_aggregated() function but when I have 20 series at the bottom level it becomes very weird to try and find what corresponds to what.
Thanks so much!
Here's a repex
df <- tibble(
date = seq(from = as.Date("2011/1/1"), to = as.Date("2020/1/1"), by = "year"),
x1 = 11:20,
x2 = x1 + rnorm(10),
y1 = 1:10,
y2 = y1 - rnorm(10)
)
df %>%
mutate(date = yearmonth(date)) %>%
as_tsibble(index = date) %>%
pivot_longer(!date) %>%
group_by(name) %>%
mutate(type = case_when(
name %in% c("x1", "x2") ~ "x",
name %in% c("y1", "y2") ~ "y")) %>%
aggregate_key((type / name), value = sum(value)) %>%
model(arima = ARIMA(value)) %>%
reconcile(mint = min_trace(arima, method = "mint_shrink")) %>%
forecast(h = 1) %>%
filter(.model == "mint") %>%
print(n = 7)
Let's say I have school enrollment data stored in this format, with start date and end date fields:
unique_name
enrollment_start
enrollment_end
Amy
1, Jan, 2017
30, Sep 2018
Franklin
1, Jan, 2017
19, Feb, 2017
Franklin
5, Jun, 2017
4, Feb, 2018
Franklin
21, Oct, 2018
9, Mar, 2019
Samir
1, Jun, 2017
4, Feb, 2017
Samir
5, Apr, 2017
12, Sep, 2018
...
...
...
And I want to produce aggregated counts of enrollment by month like this:
month
enrollment_count
Jan, 2017
25
Feb, 2017
31
Mar, 2017
19
Apr, 2017
34
May, 2017
29
Jun, 2017
32
...
...
Is there an easy way to accomplish this with dplyr?
The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.
I think this can be done pretty elegantly with the clock and ivs packages. You seem to want monthly counts, so you can use the year-month type from clock. And ivs is a package dedicated to working with intervals of data, which is exactly what you have here. Here we assume that if your enrollment start/end fell in a month, then you should be considered active in that month.
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
Created on 2022-04-05 by the reprex package (v2.0.1)
Create a list column containing the sequence of months between each set of dates, then unnest and count.
Notes:
I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
Here's my take on this with dplyr and tidyr.
Pivot the data creating multiple rows per student and format your dates.
group on student and generate missing months using complete.
group on the generated periods and count.
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.
so far the result I get is as below, which is a 'tibble' format.
However, as this is a group of time series, I wish to transform this result x into a 'tsibble' format afterwords.
Anyone know what further step I need to do?
I tried using
x <- function_name(n.ts = 3, freq = 12, nComp = 2, n = 120,output_format = "list")
as_tsibble(x)
And the error said
Another trying & error
The content should display something similar to this
The 'x' is a list with multiple components. We can loop over the list with map, extract the 'x' component and convert it to tsibble
library(dplyr)
library(purrr)
x1 <- map(x, ~ as_tsibble(.x$x))
x1
#$N1
# A tsibble: 120 x 2 [1M]
# index value
# <mth> <dbl>
# 1 0001 Jan 186.
# 2 0001 Feb 184.
# 3 0001 Mar 189.
# 4 0001 Apr 188.
# 5 0001 May 193.
# 6 0001 Jun 189.
# 7 0001 Jul 192.
# 8 0001 Aug 194.
# 9 0001 Sep 194.
#10 0001 Oct 196.
# … with 110 more rows
#$N2
# A tsibble: 120 x 2 [1M]
# index value
# <mth> <dbl>
# 1 0001 Jan 18.4
# 2 0001 Feb 16.0
# 3 0001 Mar 16.9
# ...
Here, the index part is created by the column name ('Month') and as there was no information regarding the 'year', it started with '0001'
data
library(gratis)
x <- generate_ts(n.ts = 3, freq = 12, nComp = 2, n = 120)
I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
Edit:
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.
There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
library(lubridate)
library(dplyr)
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
ungroup()
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
ungroup()
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
head(drought.dur)
Date spei year year_20 drought drought_length
NA <NA> NA NA <NA> NA 2
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1
I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000