How to run GLMER with SMOTE and resampling - r

I want to create GLMER models with the SMOTE algorithm and resampling. This means that I need to create a recipe with step_smote() and use fit_resamples(). After reading this post, and this post, I learned that I need to use add_model() and add_variable() in order to create a workflow object, and that add_recipe() cannot be used in this process. This means I cannot use step_smote(). The first link addresses this limitation. Now the question is how I can use the SMOTE algorithm in this situation. In addition, I want to use a few other step_***() functions such as step_dummy() and step_rm(). In short, I want to use a recipe and cross validation in GLMER modeling processes. Is there any way to make this happen? I decided to use the example in this post so that everyone can reproduce the same situation. Note that solutions do not have to be in tidymodels ways.
library(tidyverse)
library(tidymodels)
library(themis)
# I use data from https://juliasilge.com/blog/himalayan-climbing/
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
# These are the same processes from the post above.
members_df <- members %>%
filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate(died = case_when(died ~ "died", TRUE ~ "survived")) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# Since the dataset is large, I subset it here for this question.
set.seed(386)
bind_rows(filter(members_df, died == "died"),
filter(members_df, died == "survived") %>% sample_n(2000)) -> members_df
# Create data sets
set.seed(123)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)
# Create CV folds
set.seed(123)
members_folds <- vfold_cv(members_train, v = 10, repeats = 5, strata = died)
# Here I wanna use glmer. Create model specification
logistic_reg() %>%
set_engine("glmer", family = binomial,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 2e5))) %>%
set_mode("classification") -> members_spec
# As far as I learned from these links, I cannot use a recipe in order to run GLMER. But I create a recipe, anyway.
# The key is that I want to apply SMOTE algorithm to glmer modeling in tidymodels ways.
recipe(died ~ ., data = members_train) %>%
step_rm(citizenship) %>%
step_impute_median(age) %>%
step_other(peak_id) %>%
step_dummy(all_nominal(), -died) %>%
step_smote(died) -> members_recipe
# This works as long as users do not have to use the recipe package.
# This is addressed in https://github.com/tidymodels/multilevelmod/issues/4
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) -> members_workflow
# This is the WRONG way.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_recipe(members_recipe) -> wrong_workflow
# This is another WRONG way. This causes an error message.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) %>%
add_recipe(members_recipe)
# Ideally, the workflow object here includes a recipe. But can we make it?
set.seed(777)
fit_resamples(object = wrong_workflow, # This workflow object needs to include a recipe
resamples = members_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE)) -> members_res

Related

Run several SFA models by a group and store the output in a list

I am running a Stochastic Frontier model (using the package frontier) by the group industry as follows:
data is a panel data frame with index year and individual id and columns as below:
Columns: y1, x1, x2and x3 are all numerical variables. industry is a character variable.
library(dplyr)
library(frontier)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod <- sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = .))
I want mod to store the output of the industry-group SFA estimated models. I don't think SFA-specific knowledge is required here. Thanks.
This did the trick:
library(dplyr)
library(frontier)
library(plm)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod = sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = pdata.frame(., index = c("individual id", "year"))))
To then display each model you could do this:
# to display all industry models
sfa_out$mod
# to display specific industry model
sfa$mod[[1]]
sfa$mod[[2]]
.
.
.
sfa$mod[[n]]
# to get the estimated efficiency measure for nth model in sfa_out$mod
efficiencies(sfa_out$mod$[[n]])

Dimension mismatch in subset expression in JAGS

I am very new to in bayesian analysis and I was trying to practice with an example from tidytuesday (https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv)
I have set my model but when trying to run it the following error message appears:
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
Below my approach:
Sports2 =
bind_rows(
sports_clean %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_clean %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
An example row of the dataset:
Year
institution_name
sports
participants
revenue
expenditure
gender
2015
Alabama A&M Uni
Soccer
21
410717
432648
1
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(bgues[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
I can't figure out how to resolve the issue.
Looking for advice, please.
Thank you in advance!
Edit:
I have removed all (for the model) "unnecessary" parts. This is now the corrected code - unfortunately I cannot figure out why the error still persists.
Compiling model graph
Resolving undeclared variables
Deleting model
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(b_guess[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
The problem was that in your original code, you're subsetting a tibble using the [ and unlike in a regular data frame, where it would turn that single column into a vector, the tibble remains a tibble with one variable. The error really states that instead of being a vector as you intend in the model code, the y variable is actually a one-column data frame, which JAGS treats differently from a vector.
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
dim(dataset$y)
[1] 130748 1
There are two ways to fix this, you can make y a vector in the data:
dataset = list(x=Sports2[,c(4,7)],
y=Sports2$expenditure, n=nrow(Sports2))
dim(dataset$y)
# NULL
length(dataset$y)
# [1] 130748
Or, you could change the likelihood part of your model to acknowledge that y is a one-column matrix:
y[i,1] ~ dnorm(mu[i], tau)
The rest of the model could stay as it is. Do one or the other of these (though not both at the same time) and your model will run.

Tidymodels Predict Error in R while predict on test

I am using below code to build and predict model using tidymodels. I am fairly new to tidymodels, so may be I am totally wrong in my approach. But here is what the problem is.
When input datatype for test dataset is different from train, I am getting this error. Otherwise, the code works fine(In cases where train and test data structures are identical). I am assuming that the preprocessing step should have tackled this while processing test data.
If anyone knows/encountered this problem. Please let me know the possible solution.
I search for this issue, but haven't found anything of this sort.
Thanks for looking into it.
Code:
library(tidymodels)
library(dplyr)
mt1 <- mtcars ## assume this is the train data
mt2 <- mtcars ## assume this is the test data
mt2$mpg <- as.character(mt2$mpg) ## just forcing them to be character to reproduce the problem in my actual data
mt2$qsec <- as.character(mt2$qsec)
dp_pipe <- recipe(am ~ .,data=mt1) %>%
update_role(cyl,vs,new_role = "drop_vars") %>%
update_role(mpg,
disp,
drat,wt, qsec, new_role="to_numeric") %>%
step_rm(has_role("drop_vars")) %>%
step_mutate_at(has_role(match = "to_numeric"),fn = as.numeric)
# Cross folds
folds = vfold_cv(mt1, v = 10)
# define parameter grid to be tuned
my_grid = tibble(penalty = 10^seq(-2, -1, length.out = 10))
# define lasso model
lasso_mod = linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
# add everything to a workflow
wf = workflow() %>%
add_model(lasso_mod) %>%
add_recipe(dp_pipe)
# tune the workflow
my_res <- wf %>%
tune_grid(resamples = folds,
grid = my_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse))
best_mod = my_res %>% select_best("rmse")
best_mod
final_fitted = finalize_workflow(wf, best_mod) %>% fit(data=mt1)
# predicted for train
final_fitted %>%
predict(mt1)
final_fitted %>%
predict(mt2)
Error at my end:
> Error: ! Can't convert `data$mpg` <character> to match type of `mpg`
> <double>. Run `rlang::last_error()` to see where the error occurred.

Change normal regression model to rolling regression

I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran

broom::augment omits columns from data

broom::augment outputs only columns from data that are used in formula. This is problematic behavior, because being able to find something like a respondent id can be very helpful at times. Using the newdata argument could be a workaround, but it still doesn't provide a fix when working with nested data.
Additional explanations in-line:
#simulated glm data
glmdata = data.frame(ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
#fit model, not including the ID variable
glmfit = glm(response~A+B, glmdata,family='binomial')
#ID variable is contained in glm$data
str(glmfit$data)
#works!
head(glmfit$data$ID)
#use broom::augment
augmented = glmfit %>% augment
#does not work, wth broom?!
augmented$ID
#ok ... I could use the newdata argument
augmented = glmfit %>% augment(newdata=glmdata)
augmented$ID
#however, that is a hacky workaround ....
#... and it does not fix the following scenario:
#Let's say I want to use nest
#simulated glm data
glmdata1 = data.frame(segm=1,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
glmdata2 = data.frame(segm=2,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-3*A-2*B)) ))
glmdata_nest = rbind(glmdata1,glmdata2) %>% group_by(segm) %>% nest
#fit the two models via map
glmfit_nest= glmdata_nest %>% mutate(model=map(data, glm, formula=response~A+B, family='binomial') )
#run augment via map
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment))
#ID is not here ...
glmfit_nest_augmented$augmented$ID
#ok, so then we have to use map2 ....
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map2(model,data,augment,newdata=.y))
#but even this doesn't work
#also, trying to recycling glm$data does not work
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment,newdata=.$data))
Update:
broom developers deliberately choose this inconsistent behavior
https://github.com/tidymodels/broom/issues/753
Here the .x, and .y goes with the anonymous function call with ~
glmfit_nest_augmented <- glmfit_nest %>%
mutate(augmented = map2(model,data,~ augment(.x, newdata=.y))

Resources