I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran
Related
I want to create GLMER models with the SMOTE algorithm and resampling. This means that I need to create a recipe with step_smote() and use fit_resamples(). After reading this post, and this post, I learned that I need to use add_model() and add_variable() in order to create a workflow object, and that add_recipe() cannot be used in this process. This means I cannot use step_smote(). The first link addresses this limitation. Now the question is how I can use the SMOTE algorithm in this situation. In addition, I want to use a few other step_***() functions such as step_dummy() and step_rm(). In short, I want to use a recipe and cross validation in GLMER modeling processes. Is there any way to make this happen? I decided to use the example in this post so that everyone can reproduce the same situation. Note that solutions do not have to be in tidymodels ways.
library(tidyverse)
library(tidymodels)
library(themis)
# I use data from https://juliasilge.com/blog/himalayan-climbing/
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
# These are the same processes from the post above.
members_df <- members %>%
filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate(died = case_when(died ~ "died", TRUE ~ "survived")) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# Since the dataset is large, I subset it here for this question.
set.seed(386)
bind_rows(filter(members_df, died == "died"),
filter(members_df, died == "survived") %>% sample_n(2000)) -> members_df
# Create data sets
set.seed(123)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)
# Create CV folds
set.seed(123)
members_folds <- vfold_cv(members_train, v = 10, repeats = 5, strata = died)
# Here I wanna use glmer. Create model specification
logistic_reg() %>%
set_engine("glmer", family = binomial,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 2e5))) %>%
set_mode("classification") -> members_spec
# As far as I learned from these links, I cannot use a recipe in order to run GLMER. But I create a recipe, anyway.
# The key is that I want to apply SMOTE algorithm to glmer modeling in tidymodels ways.
recipe(died ~ ., data = members_train) %>%
step_rm(citizenship) %>%
step_impute_median(age) %>%
step_other(peak_id) %>%
step_dummy(all_nominal(), -died) %>%
step_smote(died) -> members_recipe
# This works as long as users do not have to use the recipe package.
# This is addressed in https://github.com/tidymodels/multilevelmod/issues/4
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) -> members_workflow
# This is the WRONG way.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_recipe(members_recipe) -> wrong_workflow
# This is another WRONG way. This causes an error message.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) %>%
add_recipe(members_recipe)
# Ideally, the workflow object here includes a recipe. But can we make it?
set.seed(777)
fit_resamples(object = wrong_workflow, # This workflow object needs to include a recipe
resamples = members_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE)) -> members_res
I am running a Stochastic Frontier model (using the package frontier) by the group industry as follows:
data is a panel data frame with index year and individual id and columns as below:
Columns: y1, x1, x2and x3 are all numerical variables. industry is a character variable.
library(dplyr)
library(frontier)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod <- sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = .))
I want mod to store the output of the industry-group SFA estimated models. I don't think SFA-specific knowledge is required here. Thanks.
This did the trick:
library(dplyr)
library(frontier)
library(plm)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod = sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = pdata.frame(., index = c("individual id", "year"))))
To then display each model you could do this:
# to display all industry models
sfa_out$mod
# to display specific industry model
sfa$mod[[1]]
sfa$mod[[2]]
.
.
.
sfa$mod[[n]]
# to get the estimated efficiency measure for nth model in sfa_out$mod
efficiencies(sfa_out$mod$[[n]])
I have a model that as a predictor has the previous prediction. e.g. target ~ lag(target prediction)
Using purrr::accumulate I'm able to write a custom function to predict. Example of some silly data and a silly model that illustrates:
### A model that uses a lag prediction as a predictor using purrr::accumulate() ###
my_diamonds <- diamonds %>%
group_by(cut) %>%
mutate(cumprice = cumsum(price)) %>% # cumulative within groups
mutate(lag_cumprice = lag(cumprice)) %>%
mutate(InitialValue = min(cumprice)) %>%
filter(!is.na(lag_cumprice)) %>%
select(cut, cumprice, lag_cumprice, x, InitialValue)
silly_model <- glm(formula = cumprice ~ x + lag_cumprice, family = 'poisson', data = my_diamonds)
This model uses the previous prediction as input to the next prediction. I'm able to write a custom function to mutate a prediction:
# when predicting won't have lag_cumprice, instead the result of the previous pediction should be an input to the model:
accPrice <- function(mod, acc, cur) {
db=cur_data_all() # grouped data segment
x = db$x[cur] # cur is the current row in the data, use it to get 'this' iterations value of x
total_exponent <- mod$coefficients['(Intercept)'] +
(mod$coefficients['x'] * x) +
(mod$coefficients['lag_cumprice'] * acc) # acc is the accumulated prediction for cumprice
}
# now predict
my_diamonds <- my_diamonds %>%
mutate(predicted = accumulate(.x = row_number()[-1], .init = InitialValue %>% unique, .f = accPrice, mod = silly_model))
So far so good. In this example I used the previous prediction acc as an input.
But, I created a variation model that now uses two lagged variables as predictors:
### now a model with lag on two variables not just one ###
my_diamonds2 <- diamonds %>%
group_by(cut) %>%
mutate(cumprice = cumsum(price)) %>% # cumulative within groups
mutate(lag_cumprice = lag(cumprice)) %>%
mutate(InitialValue = min(cumprice)) %>%
mutate(rn = row_number()) %>%
mutate(cumrn = cumsum(rn)) %>%
mutate(lag_cumrn = lag(cumrn)) %>%
filter(!is.na(lag_cumprice)) %>%
select(cut, cumprice, lag_cumprice, lag_cumrn, x, InitialValue)
silly_model2 <- glm(formula = cumprice ~ x + lag_cumprice + lag_cumrn, family = 'poisson', data = my_diamonds2)
### Stuck after here ###
How can I modify the function accPrice() above to accumulate 2 variables, both lag_cumprice and lag_cumrn as opposed to just lag_cumprice as before?
We could add an argument to the function. Then, extract the corresponding coefficient from the model and multiply by it
accPrice2 <- function(mod, acc, acc2, cur) {
db=cur_data_all() # grouped data segment
x = db$x[cur] # cur is the current row in the data, use it to get 'this' iterations value of x
total_exponent <- mod$coefficients['(Intercept)'] +
(mod$coefficients['x'] * x) +
(mod$coefficients['lag_cumprice'] * acc) +
(mod$coefficients['lag_cumrn'] * acc2)
}
my_diamonds2 %>%
mutate(predicted = accumulate(.x = row_number()[-1],
.init = InitialValue %>%
unique, .f = accPrice2, mod = silly_model))
Once again I would like to thank Daniel Sjoberg and other collaborators for the constant implementation of functionality in the gtsummary package. For me, one of the most efficient suites in R for processing and reporting results in tables / in line.
A while ago I asked for help on including effect size [90%CI] in the analytical tables generated by the gtsummary package. However, in this new post I intend to change the ES calculation package for giving me a vast repertoire of indexes and also for having their qualitative magnitude. I tried to implement this other package in new code. However, this message is returned:
There was an error for variable 'age':
Error in .deal_with_cohens_d_arguments (x, y, data): Please provide data argument.
I believe I am not able to configure the function (CohenD object). Could someone please help me with my code?
I copied it below:
CohenD <- function(data, variable, by, ...) {
# Cohen's d, Hedges's g (correction=TRUE or FALSE) and Glass’s delta
ES <- effectsize::cohens_d(data[[variable]] ~ as.factor(data[[by]]),
ci=.90,
pooled_sd=TRUE,
paired=FALSE,
correction=TRUE)
# Formatting statistic with CI
est <- style_sigfig(abs(ES$Cohens_d))
ci_lower <- style_sigfig(ES$CI_low)
ci_upper <- style_sigfig(ES$CI_high)
# Returning estimate with CI together
str_glue("{est} ({ci_lower, ci_upper})")
}
Table <-
trial %>%
select(trt, age) %>%
tbl_summary(by = trt, missing = "no", label = list (age ~ "Age (yrs)"),
statistic = list(all_continuous() ~ "{mean} ± {sd}"),
digits = list(all_continuous() ~ c(1,1))) %>%
bold_labels() %>%
italicize_levels() %>%
add_p(test = everything() ~ t.test, pvalue_fun = partial(style_pvalue, digits = 2)) %>%
add_stat(
fns = everything() ~ CohenD,
fmt_fun = NULL,
header = "**ES (90% CI)**"
) %>%
modify_footnote(add_stat_1 ~ "Hedges's g (90% CI)") %>%
modify_header(label = "**Variables**", stat_by = "**{level}** (N= {n})")
Table
Would it be possible to include a new column in Table or to join with the ES +/- CI, already provided by this function, the qualitative magnitude of the observed ES (interpret a value based on a set of rules)? The suggestion comes for this feature:
effectsize::interpret_d(ES$Cohens_d, rules = "cohen1988")
Cheers,
Cristiano
The issue you're experiencing is in your user-defined function CohenD(): it did not like the way you were passing the formula. In the example below, I corrected the syntax. I also included the interpretation of the effect size.
library(gtsummary)
library(tidyverse)
# function that returns either Cohen's D or the 1988 interpretation of its size
CohenD <- function(data, variable, by, ...) {
# Cohen's d, Hedges's g (correction=TRUE or FALSE) and Glass’s delta
ES <- effectsize::cohens_d(data[[variable]], factor(data[[by]]),
ci=.90,
pooled_sd=TRUE,
paired=FALSE,
correction=TRUE)
# Formatting statistic with CI
est <- style_sigfig(ES$Cohens_d)
ci_lower <- style_sigfig(ES$CI_low)
ci_upper <- style_sigfig(ES$CI_high)
# Returning estimate with CI together
tibble(
cohen_d = stringr::str_glue("{est} ({ci_lower}, {ci_upper})"),
interpret_d = stringr::str_glue("{effectsize::interpret_d(ES$Cohens_d, rules = 'cohen1988')}")
)
}
tbl <-
trial %>%
select(trt, age, marker) %>%
tbl_summary(by = trt, missing = "no", statistic = all_continuous() ~ "{mean} ± {sd}") %>%
add_p(test = everything() ~ t.test) %>%
add_stat(fns = everything() ~ CohenD) %>%
modify_header(cohen_d = "**ES (90% CI)**", interpret_d = "**Interpretation**")
Created on 2021-04-15 by the reprex package (v2.0.0)
I have a data frame "customers" build of customer id, month and total purchases that month.
I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x))
0.6888112
So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))
It look like that's the way to go, thanks!
It seems like what caused the singularity was that I didn't change the default .complete from F to T.
So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window.
Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")