I want to create GLMER models with the SMOTE algorithm and resampling. This means that I need to create a recipe with step_smote() and use fit_resamples(). After reading this post, and this post, I learned that I need to use add_model() and add_variable() in order to create a workflow object, and that add_recipe() cannot be used in this process. This means I cannot use step_smote(). The first link addresses this limitation. Now the question is how I can use the SMOTE algorithm in this situation. In addition, I want to use a few other step_***() functions such as step_dummy() and step_rm(). In short, I want to use a recipe and cross validation in GLMER modeling processes. Is there any way to make this happen? I decided to use the example in this post so that everyone can reproduce the same situation. Note that solutions do not have to be in tidymodels ways.
library(tidyverse)
library(tidymodels)
library(themis)
# I use data from https://juliasilge.com/blog/himalayan-climbing/
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
# These are the same processes from the post above.
members_df <- members %>%
filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate(died = case_when(died ~ "died", TRUE ~ "survived")) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# Since the dataset is large, I subset it here for this question.
set.seed(386)
bind_rows(filter(members_df, died == "died"),
filter(members_df, died == "survived") %>% sample_n(2000)) -> members_df
# Create data sets
set.seed(123)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)
# Create CV folds
set.seed(123)
members_folds <- vfold_cv(members_train, v = 10, repeats = 5, strata = died)
# Here I wanna use glmer. Create model specification
logistic_reg() %>%
set_engine("glmer", family = binomial,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 2e5))) %>%
set_mode("classification") -> members_spec
# As far as I learned from these links, I cannot use a recipe in order to run GLMER. But I create a recipe, anyway.
# The key is that I want to apply SMOTE algorithm to glmer modeling in tidymodels ways.
recipe(died ~ ., data = members_train) %>%
step_rm(citizenship) %>%
step_impute_median(age) %>%
step_other(peak_id) %>%
step_dummy(all_nominal(), -died) %>%
step_smote(died) -> members_recipe
# This works as long as users do not have to use the recipe package.
# This is addressed in https://github.com/tidymodels/multilevelmod/issues/4
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) -> members_workflow
# This is the WRONG way.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_recipe(members_recipe) -> wrong_workflow
# This is another WRONG way. This causes an error message.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) %>%
add_recipe(members_recipe)
# Ideally, the workflow object here includes a recipe. But can we make it?
set.seed(777)
fit_resamples(object = wrong_workflow, # This workflow object needs to include a recipe
resamples = members_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE)) -> members_res
Related
I am running a Stochastic Frontier model (using the package frontier) by the group industry as follows:
data is a panel data frame with index year and individual id and columns as below:
Columns: y1, x1, x2and x3 are all numerical variables. industry is a character variable.
library(dplyr)
library(frontier)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod <- sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = .))
I want mod to store the output of the industry-group SFA estimated models. I don't think SFA-specific knowledge is required here. Thanks.
This did the trick:
library(dplyr)
library(frontier)
library(plm)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod = sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = pdata.frame(., index = c("individual id", "year"))))
To then display each model you could do this:
# to display all industry models
sfa_out$mod
# to display specific industry model
sfa$mod[[1]]
sfa$mod[[2]]
.
.
.
sfa$mod[[n]]
# to get the estimated efficiency measure for nth model in sfa_out$mod
efficiencies(sfa_out$mod$[[n]])
I am very new to in bayesian analysis and I was trying to practice with an example from tidytuesday (https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv)
I have set my model but when trying to run it the following error message appears:
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
Below my approach:
Sports2 =
bind_rows(
sports_clean %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_clean %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
An example row of the dataset:
Year
institution_name
sports
participants
revenue
expenditure
gender
2015
Alabama A&M Uni
Soccer
21
410717
432648
1
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(bgues[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
I can't figure out how to resolve the issue.
Looking for advice, please.
Thank you in advance!
Edit:
I have removed all (for the model) "unnecessary" parts. This is now the corrected code - unfortunately I cannot figure out why the error still persists.
Compiling model graph
Resolving undeclared variables
Deleting model
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(b_guess[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
The problem was that in your original code, you're subsetting a tibble using the [ and unlike in a regular data frame, where it would turn that single column into a vector, the tibble remains a tibble with one variable. The error really states that instead of being a vector as you intend in the model code, the y variable is actually a one-column data frame, which JAGS treats differently from a vector.
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
dim(dataset$y)
[1] 130748 1
There are two ways to fix this, you can make y a vector in the data:
dataset = list(x=Sports2[,c(4,7)],
y=Sports2$expenditure, n=nrow(Sports2))
dim(dataset$y)
# NULL
length(dataset$y)
# [1] 130748
Or, you could change the likelihood part of your model to acknowledge that y is a one-column matrix:
y[i,1] ~ dnorm(mu[i], tau)
The rest of the model could stay as it is. Do one or the other of these (though not both at the same time) and your model will run.
I am using below code to build and predict model using tidymodels. I am fairly new to tidymodels, so may be I am totally wrong in my approach. But here is what the problem is.
When input datatype for test dataset is different from train, I am getting this error. Otherwise, the code works fine(In cases where train and test data structures are identical). I am assuming that the preprocessing step should have tackled this while processing test data.
If anyone knows/encountered this problem. Please let me know the possible solution.
I search for this issue, but haven't found anything of this sort.
Thanks for looking into it.
Code:
library(tidymodels)
library(dplyr)
mt1 <- mtcars ## assume this is the train data
mt2 <- mtcars ## assume this is the test data
mt2$mpg <- as.character(mt2$mpg) ## just forcing them to be character to reproduce the problem in my actual data
mt2$qsec <- as.character(mt2$qsec)
dp_pipe <- recipe(am ~ .,data=mt1) %>%
update_role(cyl,vs,new_role = "drop_vars") %>%
update_role(mpg,
disp,
drat,wt, qsec, new_role="to_numeric") %>%
step_rm(has_role("drop_vars")) %>%
step_mutate_at(has_role(match = "to_numeric"),fn = as.numeric)
# Cross folds
folds = vfold_cv(mt1, v = 10)
# define parameter grid to be tuned
my_grid = tibble(penalty = 10^seq(-2, -1, length.out = 10))
# define lasso model
lasso_mod = linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
# add everything to a workflow
wf = workflow() %>%
add_model(lasso_mod) %>%
add_recipe(dp_pipe)
# tune the workflow
my_res <- wf %>%
tune_grid(resamples = folds,
grid = my_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse))
best_mod = my_res %>% select_best("rmse")
best_mod
final_fitted = finalize_workflow(wf, best_mod) %>% fit(data=mt1)
# predicted for train
final_fitted %>%
predict(mt1)
final_fitted %>%
predict(mt2)
Error at my end:
> Error: ! Can't convert `data$mpg` <character> to match type of `mpg`
> <double>. Run `rlang::last_error()` to see where the error occurred.
I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran
broom::augment outputs only columns from data that are used in formula. This is problematic behavior, because being able to find something like a respondent id can be very helpful at times. Using the newdata argument could be a workaround, but it still doesn't provide a fix when working with nested data.
Additional explanations in-line:
#simulated glm data
glmdata = data.frame(ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
#fit model, not including the ID variable
glmfit = glm(response~A+B, glmdata,family='binomial')
#ID variable is contained in glm$data
str(glmfit$data)
#works!
head(glmfit$data$ID)
#use broom::augment
augmented = glmfit %>% augment
#does not work, wth broom?!
augmented$ID
#ok ... I could use the newdata argument
augmented = glmfit %>% augment(newdata=glmdata)
augmented$ID
#however, that is a hacky workaround ....
#... and it does not fix the following scenario:
#Let's say I want to use nest
#simulated glm data
glmdata1 = data.frame(segm=1,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
glmdata2 = data.frame(segm=2,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-3*A-2*B)) ))
glmdata_nest = rbind(glmdata1,glmdata2) %>% group_by(segm) %>% nest
#fit the two models via map
glmfit_nest= glmdata_nest %>% mutate(model=map(data, glm, formula=response~A+B, family='binomial') )
#run augment via map
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment))
#ID is not here ...
glmfit_nest_augmented$augmented$ID
#ok, so then we have to use map2 ....
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map2(model,data,augment,newdata=.y))
#but even this doesn't work
#also, trying to recycling glm$data does not work
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment,newdata=.$data))
Update:
broom developers deliberately choose this inconsistent behavior
https://github.com/tidymodels/broom/issues/753
Here the .x, and .y goes with the anonymous function call with ~
glmfit_nest_augmented <- glmfit_nest %>%
mutate(augmented = map2(model,data,~ augment(.x, newdata=.y))