imputeTS maxgap unexpected behaviuor with Tsibble - tsibble

I have to fill missing data in a tsibble with imputeTS.
I want to limit the number of consecutive NAs to be filled with maxgap.
Any maxgap value different from Inf gives an error.
library(tsibble)
library(imputeTS)
harvest <- tsibble(
year = c(2010, 2011, 2011, 2014),
fruit = rep(c("kiwi", "cherry"), each = 2),
kilo = sample(1:10, size = 4),
key = fruit,
index = year
)
# Not Working
harvest |>
fill_gaps(.full = TRUE) |>
na_kalman(maxgap = 1)
# Working
harvest |>
fill_gaps(.full = TRUE) |>
na_kalman(maxgap = Inf)
na_kalman: No imputation performed for column 3 of the input dataset.
Reason: 'x' must be a vector of an atomic type
How can I solve the error?

Related

Middle out fabletools reconciliation approach give problem with forecast function

I have a grouped time series with items and their category and I would like to make 6months sales forecasting.
I would like to o use intermediate level (category) to make base forecasting because the stagionality and trends maybe are better valued.
So i grouped my data for key, and i would like to use middle_out approch, the total sales use bottom up and single item are forected useing top down approach
I'm using fabletools middle_out function, but when i try to make forecast it doesn't work
this is my code:
library(reshape)
library(tidyverse)
library(tsibble)
library(dplyr)
library(fable)
library(fpp2)
library(forecast)
#read data from csv
#example dataset
set.seed(42) ## for sake of reproducibility
n <- 6
data_example <- data.frame(Date=seq.Date(as.Date("2020-12-01"), as.Date("2021-05-01"), "month"),
No_=sample(1800:1830, n, replace=TRUE),
Category=rep(LETTERS[1:3], n),
Quantity=sample(18:24, n, replace=TRUE))
sell_full <- data_example %>% mutate(Month=yearmonth(Date)) %>% group_by(No_,Category, Month) %>% summarise(Quant = sum(Quantity), .groups = 'keep')
sell_full <- na.omit(sell_full)
#data
#conversion to tsibble for forecastings
sell_full <- as_tsibble(sell_full, key=c(No_, Category), index=Month)
sell_full <- sell_full %>% aggregate_key((Category/No_), Quant= sum(Quant))
#sell_full<- filter(sell_full, !is.na(sell_full$Quant))
sell_full <- sell_full %>% fill_gaps(Quant=0, .full=TRUE)
fit <- sell_full %>%model(ets = ETS(Quant~ error("A") + trend("A") + season("A")))%>% middle_out(split=1)
fc <- forecast(fit, h = "6 months", level=1,lambda="auto")
if I put method="mo" in forecast method as documentation says it return this error
Error in meanf(object, h = h, level = level, fan = fan, lambda = lambda, :
unused argument (method = "mo")
if i doesn't put method info in forecast it return this error:
<error/vctrs_error_ptype2>
Error in `vec_compare()`:
! Can't combine `..1` <agg_vec> and `..2` <double>.
---
Backtrace:
1. generics::forecast(fit, h = "6 months", level = 1, lambda = "auto")
2. forecast:::forecast.default(fit, h = "6 months", level = 1, lambda = "auto")
3. forecast:::forecast.ts(object, ...)
4. forecast::meanf(...)
5. forecast::BoxCox(x, lambda)
6. forecast::BoxCox.lambda(x, lower = -0.9)
7. fabletools:::Ops.lst_mdl(x, 0)
11. fabletools:::map2(e1, e2, .Generic)
12. base::mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE)
13. vctrs:::`<=.vctrs_vctr`(dots[[1L]][[1L]], dots[[2L]][[1L]])
14. vctrs::vec_compare(e1, e2)
The Documentions about it is very bad,
someone can help me?
UPDATE:
As someone suggest to me, I tried to remove some package, now my library are:
library(tsibble)
library(dplyr)
library(fable)
library(fpp3)
library(conflicted)
Now the error is changed. when I try to make forecast function I have this error:
Error in build_key_data_smat(key_data) :
argument "key_data" is missing, with no default
and if I put key_data = "Category" (Category is the split layer) the error is:
fc <- forecast(fit, h = "6 months",level=1,lambda="auto", key_data= "Category")
Error in -ncol(x) : invalid argument to unary operator
library(conflicted)
library(fpp3)
library(tidyverse)
n <- 6
data_example <- data.frame(Date = seq.Date(as.Date("2020-12-01"), as.Date("2021-05-01"), "month"),
No_ = sample(1800:1830, n, replace = TRUE),
Category = rep(LETTERS[1:3], n),
Quantity = sample(18:24, n, replace = TRUE))
sell_full <- data_example |> mutate(Month = yearmonth(Date)) |> group_by(No_,Category, Month) |> summarise(Quant = sum(Quantity), .groups = 'keep')
sell_full <- ungroup(sell_full)
sell_full <- as_tsibble(sell_full, key = c(No_, Category), index = Month)
sell_full <- sell_full %>% aggregate_key((Category/No_), Quant = sum(Quant))
sell_full <- sell_full %>% fill_gaps(Quant = 0, .full = TRUE)
fit <- sell_full %>% model(ets = ETS(Quant~ error("A") + trend("A")))
fc <- fabletools::forecast(fit, h = "6 months", lambda = "auto")
Thought I'd have a look at the code to generate sell_full.
Added an ungroup, took out the seasonal, and took out the middle_out. Runs now, and no longer asks for key_value. The ungroup, as it seemed that you were finished with the grouping. The seasonal as it was not supported by the data. The middle out as it would cause the prompt for key_value. Spent a bit of time on the middle_out leading to forecast asking for key_value, though, hence comment above.
This led me to try another way to do middle_out:
fit <- sell_full %>% model(ets = ETS(Quant~ error("A") + trend("A"))) |> reconcile(mo = middle_out(ets))
This runs fine. This idea came from fpp3 Hoping that this helps! :-)

How to identify the name of a column with the maximum of a the dataset in R? [duplicate]

I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"

Using skimr to create a data frame of summary statistics

I have recently come across the package called skimr which helps create useful summary statistics. I have written the following codes to extract summary stats only on numerical columns. My first question is, is there a more direct way that skimr permits to specify the type of variables for which I want summary stats? My second question is, what does append == TRUE actually achieve when I write the my_skim "closure"?
library(skimr)
library(dplyr)
### Creating an example dataset
test.df1 <- data.frame("Year" = sample(2018:2020, 20, replace = TRUE),
"Firm" = head(LETTERS, 5),
"Exporter"= sample(c("Yes", "No"), 20, replace = TRUE),
"Revenue" = sample(100:200, 20, replace = TRUE),
stringsAsFactors = FALSE)
test.df1 <- rbind(test.df1,
data.frame("Year" = c(2018, 2018),
"Firm" = c("Y", "Z"),
"Exporter" = c("Yes", "No"),
"Revenue" = c(NA, NA)))
test.df1 <- test.df1 %>% mutate(Profit = Revenue - sample(20:30, 22, replace = TRUE ))
### Using skimr package to extract summary stats
my_skim <- skim_with(numeric = sfl(minimum = min, maximum = max, hist = NULL), append = TRUE)
test.df1_skim1 <- test.df1 %>%
group_by(Year) %>%
my_skim() %>%
filter (skim_type != "character") %>%
select(-starts_with("character"))
If you only want summary of the numeric variables you could set all the other types to NULL or else you could run the skim and use yank() to get subtable for a type.
From https://docs.ropensci.org/skimr/articles/skimr.html#reshaping-the-results-from-skim-
skim(Orange) %>% yank("numeric")
The append option lets you either replace the default statistics or append to the defaults.

How do I find the column name corresponding to the maximum value in multiple rows and columns?

I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"

summarise data for multiple variables of a data.frame in r?

I am trying to compute the upper and lower quartile of the two variables in my data.frame across the time period of my interest. The code below gave me single digit for upper and lower value.
set.seed(50)
FakeData <- data.frame(seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10),
D = runif(1095,5,15))
colnames(FakeData) <- c("Date", "A","D")
statistics <- FakeData %>%
gather(-Date, key = "Variable", value = "Value") %>%
mutate(Year = year(Date), Month = month(Date)) %>%
filter(between(Month,3,5)) %>%
mutate(NewDate = ymd(paste("2020", Month,day(Date), sep = "-"))) %>%
group_by(Variable, NewDate) %>%
summarise(Upper = quantile(Value,0.75, na.rm = T),
Lower = quantile(Value, 0.25, na.rm = T))
I would want an output like below (the Final_output is what i am interested)
Output1 <- data.frame(seq(as.Date("2000-03-01"), to= as.Date("2000-05-31"), by="day"),
Upper = runif(92, 0,10), lower = runif(92,5,15), Variable = rep("A",92))
colnames(Output1)[1] <- "Date"
Output2 <- data.frame(seq(as.Date("2000-03-01"), to= as.Date("2000-05-31"), by="day"),
Upper = runif(92, 2,10), lower = runif(92,5,15), Variable = rep("D",92))
colnames(Output2)[1] <- "Date"
Final_Output<- bind_rows(Output1,Output2)
I can propose you a data.table solution. In fact there are several ways to do that.
The final steps (apply quartile by group on the Value variable) could be translated into (if you want, as in your example, two columns):
statistics[,.('p25' = quantile(get('Value'), probs = 0.25), 'p75' = quantile(get('Value'), probs = 0.75)),
by = c("Variable", "NewDate")]
If you prefer long-formatted output:
library(data.table)
setDT(statistics)
statistics[,.(lapply(get('Value'), quantile, probs = .25,.75)) ,
by = c("Variable", "NewDate")]
All steps together
It's probably better if you chose to use data.table to do all steps using data.table verbs. I will assume your data have the structure similar to the dataframe you generated and arranged, i.e.
statistics <- FakeData %>%
gather(-Date, key = "Variable", value = "Value")
In that case, mutate and filter steps would become
statistics[,`:=`(Year = year(Date), Month = month(Date))]
statistics <- statistics[Month %between% c(3,5)]
statistics[, NewDate = :ymd(paste("2020", Month,day(Date), sep = "-"))]
And choose the final step you prefer, e.g.
statistics[,.('p25' = quantile(get('Value'), probs = 0.25), 'p75' = quantile(get('Value'), probs = 0.75)),
by = c("Variable", "NewDate")]

Resources