Dimension mismatch in subset expression in JAGS - r

I am very new to in bayesian analysis and I was trying to practice with an example from tidytuesday (https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv)
I have set my model but when trying to run it the following error message appears:
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
Below my approach:
Sports2 =
bind_rows(
sports_clean %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_clean %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
An example row of the dataset:
Year
institution_name
sports
participants
revenue
expenditure
gender
2015
Alabama A&M Uni
Soccer
21
410717
432648
1
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(bgues[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
I can't figure out how to resolve the issue.
Looking for advice, please.
Thank you in advance!
Edit:
I have removed all (for the model) "unnecessary" parts. This is now the corrected code - unfortunately I cannot figure out why the error still persists.
Compiling model graph
Resolving undeclared variables
Deleting model
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(b_guess[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)

The problem was that in your original code, you're subsetting a tibble using the [ and unlike in a regular data frame, where it would turn that single column into a vector, the tibble remains a tibble with one variable. The error really states that instead of being a vector as you intend in the model code, the y variable is actually a one-column data frame, which JAGS treats differently from a vector.
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
dim(dataset$y)
[1] 130748 1
There are two ways to fix this, you can make y a vector in the data:
dataset = list(x=Sports2[,c(4,7)],
y=Sports2$expenditure, n=nrow(Sports2))
dim(dataset$y)
# NULL
length(dataset$y)
# [1] 130748
Or, you could change the likelihood part of your model to acknowledge that y is a one-column matrix:
y[i,1] ~ dnorm(mu[i], tau)
The rest of the model could stay as it is. Do one or the other of these (though not both at the same time) and your model will run.

Related

How to run GLMER with SMOTE and resampling

I want to create GLMER models with the SMOTE algorithm and resampling. This means that I need to create a recipe with step_smote() and use fit_resamples(). After reading this post, and this post, I learned that I need to use add_model() and add_variable() in order to create a workflow object, and that add_recipe() cannot be used in this process. This means I cannot use step_smote(). The first link addresses this limitation. Now the question is how I can use the SMOTE algorithm in this situation. In addition, I want to use a few other step_***() functions such as step_dummy() and step_rm(). In short, I want to use a recipe and cross validation in GLMER modeling processes. Is there any way to make this happen? I decided to use the example in this post so that everyone can reproduce the same situation. Note that solutions do not have to be in tidymodels ways.
library(tidyverse)
library(tidymodels)
library(themis)
# I use data from https://juliasilge.com/blog/himalayan-climbing/
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
# These are the same processes from the post above.
members_df <- members %>%
filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate(died = case_when(died ~ "died", TRUE ~ "survived")) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# Since the dataset is large, I subset it here for this question.
set.seed(386)
bind_rows(filter(members_df, died == "died"),
filter(members_df, died == "survived") %>% sample_n(2000)) -> members_df
# Create data sets
set.seed(123)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)
# Create CV folds
set.seed(123)
members_folds <- vfold_cv(members_train, v = 10, repeats = 5, strata = died)
# Here I wanna use glmer. Create model specification
logistic_reg() %>%
set_engine("glmer", family = binomial,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 2e5))) %>%
set_mode("classification") -> members_spec
# As far as I learned from these links, I cannot use a recipe in order to run GLMER. But I create a recipe, anyway.
# The key is that I want to apply SMOTE algorithm to glmer modeling in tidymodels ways.
recipe(died ~ ., data = members_train) %>%
step_rm(citizenship) %>%
step_impute_median(age) %>%
step_other(peak_id) %>%
step_dummy(all_nominal(), -died) %>%
step_smote(died) -> members_recipe
# This works as long as users do not have to use the recipe package.
# This is addressed in https://github.com/tidymodels/multilevelmod/issues/4
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) -> members_workflow
# This is the WRONG way.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_recipe(members_recipe) -> wrong_workflow
# This is another WRONG way. This causes an error message.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) %>%
add_recipe(members_recipe)
# Ideally, the workflow object here includes a recipe. But can we make it?
set.seed(777)
fit_resamples(object = wrong_workflow, # This workflow object needs to include a recipe
resamples = members_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE)) -> members_res

Run several SFA models by a group and store the output in a list

I am running a Stochastic Frontier model (using the package frontier) by the group industry as follows:
data is a panel data frame with index year and individual id and columns as below:
Columns: y1, x1, x2and x3 are all numerical variables. industry is a character variable.
library(dplyr)
library(frontier)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod <- sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = .))
I want mod to store the output of the industry-group SFA estimated models. I don't think SFA-specific knowledge is required here. Thanks.
This did the trick:
library(dplyr)
library(frontier)
library(plm)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod = sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = pdata.frame(., index = c("individual id", "year"))))
To then display each model you could do this:
# to display all industry models
sfa_out$mod
# to display specific industry model
sfa$mod[[1]]
sfa$mod[[2]]
.
.
.
sfa$mod[[n]]
# to get the estimated efficiency measure for nth model in sfa_out$mod
efficiencies(sfa_out$mod$[[n]])

ANOVA on multiple models stored in a tibble

It is possible to use anova on multible models stored in a tibble without listing them manually.
An example prediction of wage from age in Wage dataset from the ISLR2 library. I have a tibble a column for polynomial degrees in one column, GLM models in another and CV errors in the third column.
I can use anova through do.call but it does not show p-values without passing test = 'F' as an argument.
library(ISLR2)
library(tidyverse)
library(boot)
GLM <- function(n) {
result <- glm(wage ~ poly(age, n), data = Wage)
return(result)
}
CV <- function(n) {
glm_fit <- glm(wage ~ poly(age, n), data = Wage)
result <- cv.glm(Wage, glm_fit, K = 10)$delta[1]
return(result)
}
set.seed(1)
models <- tibble(polynom_degrees = 1:10) %>%
mutate(linear_model = map(polynom_degrees, GLM)) %>%
mutate(CV_error = map(polynom_degrees, CV)) %>%
mutate(CV_error = unlist(CV_error))
do.call(anova, models$linear_model)

Fixed within one class intercepts, fixed within another class regression coefficients, and errors; as columns in a tibble

library(tidyverse)
library(lme4)
library(broom.mixed)
Tibble = tibble(
Class1 = rep(c("TITUS","CAIUS"),27),
Class2 = rep(c("A","A","A",
"B","B","B",
"C","C","C"),6
),
Outcome = rnorm(54,5,2),
Predictor = Outcome + rnorm(54,0,2.5),
alpha = NA,
beta = NA)
lmer(data = Tibble,
Outcome ~ (0 + (Class1) + (0 + Predictor|(Class2)))) %>%
tidy(effects = c("fixed","ran_coefs")) -> model
for(i in 1:54) {
Tibble$alpha[i] <- model %>%
filter(effect == "fixed",
term == str_c("Class1",Tibble$Class1[i])) %>%
pull(estimate)
Tibble$beta[i] <- model %>%
filter(effect == "ran_coefs",
level == Tibble$Class2[i],
term == "Predictor") %>%
pull(estimate)
}
Tibble %>% mutate(
Predicted = (alpha + Predictor*beta),
epsilon = Outcome - Predicted) -> Tibble
Tibble %>% summarise(cor(Predicted,Outcome, method = "kendall"),
)
Key concepts: there is a alpha for each class1- There is a beta for each class2.
Epsilon is the residual.
I want to make the code above faster without recurring to a for cycle.
Also, I am very worried about the regression model, because I plan to do it in a tibble with more than 1 million observation, 600k class1, 40k class2.
Notice that that model formula and the regression package is only one possible combination and you can give suggestions. I am most interested in alpha's estimation, not in minimizing epsilons.

Change normal regression model to rolling regression

I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran

Resources