I'm trying to calculate a rolling Beta regression for multiple stocks with a
width of 12 past months.
I have the following dataset
Looks like:
I was searching a lot of posts, but somehow I didn't get it to work for my data frame.
func1 <- . %>% {
roll_regres.fit(x = cbind(1, .$MKT_ex),
y = .$r_rf, width = 12L)$coefs }
out <- dt %>%
group_by(Stkcd) %>%
# make it explicit that data needs to be sorted
arrange(Date, .by_group = TRUE) %>%
do(cbind(reg_col = select(., MKT_ex, r_rf) %>% func1,
date_col = select(., Date))) %>%
ungroup
I get the error message:
Error in roll_cpp(Y = y, X = x, window = width, do_compute_R_sqs =
do_compute_R_sqs, : 'dchdd' failed with code -1
My goal is to get an output, which contains the Date, Stkcd (Stocknumber) and the calculatet Beta (r_rf regressed on MKT_ex).
What did I miss in my code?
I found also this code in the forum from
G. Grothendieck
, sadly it doesn't work for my dataset and I can't find out why.
rolli <- function(ix) {
data.frame(coef = rollapplyr(ix, width = 12, function(ix) {
coef(lm(y ~ x, data = dat, subset = ix))[2]
}, by = 1), Date = dt$Date[ix][1], Stkcd = dt$Stkcd[ix][1])
}
do.call("rbind", by(1:nrow(dt), dat[c("Date", "Stkcd")], rolli)
Related
I have a grouped time series with items and their category and I would like to make 6months sales forecasting.
I would like to o use intermediate level (category) to make base forecasting because the stagionality and trends maybe are better valued.
So i grouped my data for key, and i would like to use middle_out approch, the total sales use bottom up and single item are forected useing top down approach
I'm using fabletools middle_out function, but when i try to make forecast it doesn't work
this is my code:
library(reshape)
library(tidyverse)
library(tsibble)
library(dplyr)
library(fable)
library(fpp2)
library(forecast)
#read data from csv
#example dataset
set.seed(42) ## for sake of reproducibility
n <- 6
data_example <- data.frame(Date=seq.Date(as.Date("2020-12-01"), as.Date("2021-05-01"), "month"),
No_=sample(1800:1830, n, replace=TRUE),
Category=rep(LETTERS[1:3], n),
Quantity=sample(18:24, n, replace=TRUE))
sell_full <- data_example %>% mutate(Month=yearmonth(Date)) %>% group_by(No_,Category, Month) %>% summarise(Quant = sum(Quantity), .groups = 'keep')
sell_full <- na.omit(sell_full)
#data
#conversion to tsibble for forecastings
sell_full <- as_tsibble(sell_full, key=c(No_, Category), index=Month)
sell_full <- sell_full %>% aggregate_key((Category/No_), Quant= sum(Quant))
#sell_full<- filter(sell_full, !is.na(sell_full$Quant))
sell_full <- sell_full %>% fill_gaps(Quant=0, .full=TRUE)
fit <- sell_full %>%model(ets = ETS(Quant~ error("A") + trend("A") + season("A")))%>% middle_out(split=1)
fc <- forecast(fit, h = "6 months", level=1,lambda="auto")
if I put method="mo" in forecast method as documentation says it return this error
Error in meanf(object, h = h, level = level, fan = fan, lambda = lambda, :
unused argument (method = "mo")
if i doesn't put method info in forecast it return this error:
<error/vctrs_error_ptype2>
Error in `vec_compare()`:
! Can't combine `..1` <agg_vec> and `..2` <double>.
---
Backtrace:
1. generics::forecast(fit, h = "6 months", level = 1, lambda = "auto")
2. forecast:::forecast.default(fit, h = "6 months", level = 1, lambda = "auto")
3. forecast:::forecast.ts(object, ...)
4. forecast::meanf(...)
5. forecast::BoxCox(x, lambda)
6. forecast::BoxCox.lambda(x, lower = -0.9)
7. fabletools:::Ops.lst_mdl(x, 0)
11. fabletools:::map2(e1, e2, .Generic)
12. base::mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE)
13. vctrs:::`<=.vctrs_vctr`(dots[[1L]][[1L]], dots[[2L]][[1L]])
14. vctrs::vec_compare(e1, e2)
The Documentions about it is very bad,
someone can help me?
UPDATE:
As someone suggest to me, I tried to remove some package, now my library are:
library(tsibble)
library(dplyr)
library(fable)
library(fpp3)
library(conflicted)
Now the error is changed. when I try to make forecast function I have this error:
Error in build_key_data_smat(key_data) :
argument "key_data" is missing, with no default
and if I put key_data = "Category" (Category is the split layer) the error is:
fc <- forecast(fit, h = "6 months",level=1,lambda="auto", key_data= "Category")
Error in -ncol(x) : invalid argument to unary operator
library(conflicted)
library(fpp3)
library(tidyverse)
n <- 6
data_example <- data.frame(Date = seq.Date(as.Date("2020-12-01"), as.Date("2021-05-01"), "month"),
No_ = sample(1800:1830, n, replace = TRUE),
Category = rep(LETTERS[1:3], n),
Quantity = sample(18:24, n, replace = TRUE))
sell_full <- data_example |> mutate(Month = yearmonth(Date)) |> group_by(No_,Category, Month) |> summarise(Quant = sum(Quantity), .groups = 'keep')
sell_full <- ungroup(sell_full)
sell_full <- as_tsibble(sell_full, key = c(No_, Category), index = Month)
sell_full <- sell_full %>% aggregate_key((Category/No_), Quant = sum(Quant))
sell_full <- sell_full %>% fill_gaps(Quant = 0, .full = TRUE)
fit <- sell_full %>% model(ets = ETS(Quant~ error("A") + trend("A")))
fc <- fabletools::forecast(fit, h = "6 months", lambda = "auto")
Thought I'd have a look at the code to generate sell_full.
Added an ungroup, took out the seasonal, and took out the middle_out. Runs now, and no longer asks for key_value. The ungroup, as it seemed that you were finished with the grouping. The seasonal as it was not supported by the data. The middle out as it would cause the prompt for key_value. Spent a bit of time on the middle_out leading to forecast asking for key_value, though, hence comment above.
This led me to try another way to do middle_out:
fit <- sell_full %>% model(ets = ETS(Quant~ error("A") + trend("A"))) |> reconcile(mo = middle_out(ets))
This runs fine. This idea came from fpp3 Hoping that this helps! :-)
My dataframe looks like this:
Date = c(rep(as.Date(seq(15000,15012)),2))
Group = c(rep("a",13),rep("b",13))
y = c(seq(1,26,1))
x1 = c(seq(0.01,0.26,0.01))
x2 = c(seq(0.02,0.26*2,0.02))
df = data.frame(Group,Date,y,x1,x2)
head(df,3)
Group
Date
y
x1
x2
a
2011-01-26
1
0.01
0.02
a
2011-01-27
2
0.02
0.04
a
2011-01-28
3
0.03
0.06
And I would like to do multiple regression by group (y as the dependent variable and x1, x2 as the independent variables) in a rolling window i.e. 3.
I have tried to achieve this using packages tidyverse and zoo with following codes but failed.
## define multi-var-linear regression function and get the residual
rsd <- function(df){
lm(formula = y~x1+x2, data = as.data.frame(df), na.action = na.omit) %>%
resid() %>%
return()
}
## apply it by group with rolling window
x <- df %>% group_by(Group) %>%
rollapplyr(. , width = 3, FUN = rsd)
The output of this code is not what I acutually want.
Does anyone know how to do multiple regression by group in a rolling window?
Thanks in advance, Giselle
Thank Grothendieck and Marcus for your codes!
It really helped me a lot:)
I now appened them here:
# Grothendieck method
rsd <- function(df){
lm(formula = y~x1+x2, data = as.data.frame(df), na.action = na.omit) %>%
resid() %>%
return()
}
width <- 5
df_m2 <-
df %>%
group_by(Group) %>%
group_modify(~ {
cbind(., rollapplyr(.[c("y", "x1", "x2")], width, rsd, fill = NA,
by.column = FALSE))
}) %>%
ungroup %>%
select(c("Group","Date","5")) %>%
dplyr::rename(residual_m2 = "5")
# Marcus method
output <- data.frame()
for (i in unique(df$Group)) {
a = df%>% subset(Group==i)
a[,"residual"] = NA
max = nrow(a)
if(max<5){
next
}
for (j in seq(5,max,by=1)) {
b = a %>% slice((j-4):j)
lm_ = lm(y~x1+x2, data = b)
a[j,]$residual = residuals(lm_)[5]
}
output <-
output %>%
rbind(a)
}
Use group_modify and use rollapplyr with the by.column = FALSE argument so that rsd is applied to all columns at once rather than one at a time.
Note that if you use width 3 with two predictors and an intercept the residuals will necessarily be all zero so we changed the width to 5.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
width <- 5
df %>%
group_by(Group) %>%
group_modify(~ {
cbind(., rollapplyr(.[c("y", "x1", "x2")], width, rsd, fill = NA,
by.column = FALSE))
}) %>%
ungroup
A good old-fashioned for-loop here could be:
for (i in unique(df$Group)){
for (j in (seq(15000,15012, 3))){
lm_ <- lm(formula = df[df$Group== i & df$Date %in% c(j, j+1, j+2), 3] ~ df[df$Group== i & df$Date %in% c(j, j+1, j+2), 4] + df[df$Group== i & df$Date %in% c(j, j+1, j+2), 5], na.action = na.omit)
print(paste('Group', i, 'Dates from', j, 'to', j+3, residuals(lm_)))
}
}
Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")
I'm collecting time series data from Wikipedia and want to run a change-point analysis on each time series using dplyr. But when I do so I get an error saying the data need to be numeric, even though the class function states it is numeric. Hope you can help.
library(changepoint)
library(dplyr)
library(pageviews)
library(data.table)
articles <- c("Rugby_union", "Football")
foo <- function(x){article_pageviews(project = "en.wikipedia",
article = x,
start = as.Date('2017-01-01'),
end = as.Date("2017-12-31")
, user_type = "user", platform = c("mobile-web"))
}
output<-articles %>% foo
output %>%
select(article, views) %>%
do(cpt.mean(.))
class(output$views)
library(changepoint)
library(dplyr)
library(pageviews)
articles <- c("Rugby_union", "Football")
foo <- function(x){article_pageviews(project = "en.wikipedia", article = x,
start = as.Date('2017-01-01'),
end = as.Date("2017-12-31"),
user_type = "user", platform = c("mobile-web"))
}
output <- articles %>%
foo
df <- as.data.frame(table(output$article))
output1 <- output %>%
dplyr::select(article, views) %>%
dplyr::filter(article == df[1,1])
output2 <- output %>%
dplyr::select(article, views) %>%
dplyr::filter(article == df[2,1])
q <- floor((min(length(output1$views), length(output2$views)))/2 + 1)
cp1 <- changepoint::cpt.mean(data = output1$views, Q = q, method = "BinSeg", penalty
= "SIC")
plot(cp1)
cp2 <- changepoint::cpt.mean(data = output2$views, Q = q, method = "BinSeg", penalty
= "SIC")
plot(cp2)
I have a numeric, a count, and an over-dispersed count large matrices:
set.seed(1)
numeric.mat <- matrix(rnorm(10000*6000),10000,6000)
count.mat <- matrix(rpois(10000*6000,10),10000,6000)
dispersed.count.mat <- matrix(rnegbin(10000*6000,10,2),10000,6000)
And one corresponding factors data.frame (can be a matrix too):
factors.df <- data.frame(f1 = sample(LETTERS[1:3], 10000, replace = T),
f2 = sample(LETTERS[4:5], 10000, replace = T))
The number of factors is pretty small (in this case only 2 but won't be more than 5 for real data), and the number of levels in each (they're all categorical) is also small (also up to 5).
I'd like to obtain the residuals for fitting a linear, poisson, and negative binomial regression models to each of the columns in each of the matrices, respectively.
So for a single column:
data.df <- factors.df %>%
dplyr::mutate(numeric.y = numeric.mat[,1],
count.y = count.mat[,1],
dispersed.count.y = dispersed.count.mat[,1])
I'd use:
lm(numeric.y ~ f1+f2, data = data.df)$residuals
residuals(object = glm(count.y ~ f1+f2, data = data.df, family = "poisson"), type = 'pearson')
residuals(object = glm.nb(formula = model.formula, data = regression.df), type = 'pearson')
For the three regression models.
Is there a faster way of obtaining these residuals other than, for example, using do.call, for each. E.g.:
do.call(cbind,
lapply(1:ncol(numeric.mat),
function(i)
lm(numeric.y ~ f1+f2,
data = dplyr::mutate(factors.df,
numeric.y = numeric.mat[,i])
)$residuals
))
I'd slightly readjust how the workflow runs and allow it to be easily run in parallel.
# Use variables to adjust models, makes it easier to change sizes
iter <- 60
iter_samps <- 1000
factors_df <- data.frame(f1 = sample(LETTERS[1:3], iter_samps, replace = T),
f2 = sample(LETTERS[4:5], iter_samps, replace = T))
# using a data.frame in a longer format to hold the data, allows easier splitting
data_df <- rep(list(factors_df), iter) %>%
bind_rows(.id = "id") %>%
mutate(numeric_y = rnorm(iter_samps * iter),
count_y = rpois(iter_samps * iter, 10),
dispersed_count_y = MASS::rnegbin(iter_samps * iter, 10, 2))
# creating function that determines residuals
model_residuals <- function(data) {
data$lm_resid <- lm(numeric_y ~ f1+f2, data = data)$residuals
data$glm_resid <- residuals(object = glm(count_y ~ f1+f2, data = data, family = "poisson"), type = 'pearson')
return(data)
}
# How to run the models not in parallel
data_df %>%
split(.$id) %>%
map(model_residuals) %>%
bind_rows()
To run the models in parallel you can use multidplyr to do all the annoying work
library("multidplyr")
test = data_df %>%
partition(id) %>%
cluster_library("tidyverse") %>%
cluster_library("MASS") %>%
cluster_assign_value("model_residuals", model_residuals) %>%
do(results = model_residuals(.)) %>%
collect() %>%
.$results %>%
bind_rows()