I am very new to in bayesian analysis and I was trying to practice with an example from tidytuesday (https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv)
I have set my model but when trying to run it the following error message appears:
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
Below my approach:
Sports2 =
bind_rows(
sports_clean %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_clean %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
An example row of the dataset:
Year
institution_name
sports
participants
revenue
expenditure
gender
2015
Alabama A&M Uni
Soccer
21
410717
432648
1
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(bgues[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
I can't figure out how to resolve the issue.
Looking for advice, please.
Thank you in advance!
Edit:
I have removed all (for the model) "unnecessary" parts. This is now the corrected code - unfortunately I cannot figure out why the error still persists.
Compiling model graph
Resolving undeclared variables
Deleting model
Error in jags.model(textConnection(jags.script_with), data = dataset, :
RUNTIME ERROR:
Compilation error on line 5.
Dimension mismatch in subset expression of y
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
#estimation coefficients
dataset$b_guess = model_with$coefficients
#Model
jags.script_with =
"
model{
#likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] = intercept + participants*x[i,1]
}
#prioirs
intercept ~ dnorm(b_guess[1], 0.1)
participants ~ dnorm(b_guess[2], 0.1)
tau ~ dgamma(0.01,0.01)
#transform
sigma = 1/sqrt(tau)
}
"
#compiling
mod_with = jags.model(textConnection(jags.script_with),
data = dataset,
n.chains = 4, n.adapt = 2000)
The problem was that in your original code, you're subsetting a tibble using the [ and unlike in a regular data frame, where it would turn that single column into a vector, the tibble remains a tibble with one variable. The error really states that instead of being a vector as you intend in the model code, the y variable is actually a one-column data frame, which JAGS treats differently from a vector.
library(rjags)
library(tidyverse)
library(ggplot2)
sports_raw = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Sports2 =
bind_rows(
sports_raw %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender=1), #women
sports_raw %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender=0) #men
) %>% na.omit
#modeling with regression
set.seed(123)
model_with =
lm(expenditure ~ gender + participants, data=Sports2)
model_with
#dataset for jags model
dataset = list(x=Sports2[,c(4,7)], y=Sports2[,6], n=nrow(Sports2))
dim(dataset$y)
[1] 130748 1
There are two ways to fix this, you can make y a vector in the data:
dataset = list(x=Sports2[,c(4,7)],
y=Sports2$expenditure, n=nrow(Sports2))
dim(dataset$y)
# NULL
length(dataset$y)
# [1] 130748
Or, you could change the likelihood part of your model to acknowledge that y is a one-column matrix:
y[i,1] ~ dnorm(mu[i], tau)
The rest of the model could stay as it is. Do one or the other of these (though not both at the same time) and your model will run.
Related
I want to create GLMER models with the SMOTE algorithm and resampling. This means that I need to create a recipe with step_smote() and use fit_resamples(). After reading this post, and this post, I learned that I need to use add_model() and add_variable() in order to create a workflow object, and that add_recipe() cannot be used in this process. This means I cannot use step_smote(). The first link addresses this limitation. Now the question is how I can use the SMOTE algorithm in this situation. In addition, I want to use a few other step_***() functions such as step_dummy() and step_rm(). In short, I want to use a recipe and cross validation in GLMER modeling processes. Is there any way to make this happen? I decided to use the example in this post so that everyone can reproduce the same situation. Note that solutions do not have to be in tidymodels ways.
library(tidyverse)
library(tidymodels)
library(themis)
# I use data from https://juliasilge.com/blog/himalayan-climbing/
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
# These are the same processes from the post above.
members_df <- members %>%
filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
mutate(died = case_when(died ~ "died", TRUE ~ "survived")) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# Since the dataset is large, I subset it here for this question.
set.seed(386)
bind_rows(filter(members_df, died == "died"),
filter(members_df, died == "survived") %>% sample_n(2000)) -> members_df
# Create data sets
set.seed(123)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)
# Create CV folds
set.seed(123)
members_folds <- vfold_cv(members_train, v = 10, repeats = 5, strata = died)
# Here I wanna use glmer. Create model specification
logistic_reg() %>%
set_engine("glmer", family = binomial,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 2e5))) %>%
set_mode("classification") -> members_spec
# As far as I learned from these links, I cannot use a recipe in order to run GLMER. But I create a recipe, anyway.
# The key is that I want to apply SMOTE algorithm to glmer modeling in tidymodels ways.
recipe(died ~ ., data = members_train) %>%
step_rm(citizenship) %>%
step_impute_median(age) %>%
step_other(peak_id) %>%
step_dummy(all_nominal(), -died) %>%
step_smote(died) -> members_recipe
# This works as long as users do not have to use the recipe package.
# This is addressed in https://github.com/tidymodels/multilevelmod/issues/4
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) -> members_workflow
# This is the WRONG way.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_recipe(members_recipe) -> wrong_workflow
# This is another WRONG way. This causes an error message.
workflow() %>%
add_model(spec = members_spec,
formula = died ~ year + season + sex + age + (1|peak_id)) %>%
add_variables(outcomes = died,
predictors = c(year, season, sex, age, peak_id)) %>%
add_recipe(members_recipe)
# Ideally, the workflow object here includes a recipe. But can we make it?
set.seed(777)
fit_resamples(object = wrong_workflow, # This workflow object needs to include a recipe
resamples = members_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE)) -> members_res
I am running a Stochastic Frontier model (using the package frontier) by the group industry as follows:
data is a panel data frame with index year and individual id and columns as below:
Columns: y1, x1, x2and x3 are all numerical variables. industry is a character variable.
library(dplyr)
library(frontier)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod <- sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = .))
I want mod to store the output of the industry-group SFA estimated models. I don't think SFA-specific knowledge is required here. Thanks.
This did the trick:
library(dplyr)
library(frontier)
library(plm)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod = sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = pdata.frame(., index = c("individual id", "year"))))
To then display each model you could do this:
# to display all industry models
sfa_out$mod
# to display specific industry model
sfa$mod[[1]]
sfa$mod[[2]]
.
.
.
sfa$mod[[n]]
# to get the estimated efficiency measure for nth model in sfa_out$mod
efficiencies(sfa_out$mod$[[n]])
It is possible to use anova on multible models stored in a tibble without listing them manually.
An example prediction of wage from age in Wage dataset from the ISLR2 library. I have a tibble a column for polynomial degrees in one column, GLM models in another and CV errors in the third column.
I can use anova through do.call but it does not show p-values without passing test = 'F' as an argument.
library(ISLR2)
library(tidyverse)
library(boot)
GLM <- function(n) {
result <- glm(wage ~ poly(age, n), data = Wage)
return(result)
}
CV <- function(n) {
glm_fit <- glm(wage ~ poly(age, n), data = Wage)
result <- cv.glm(Wage, glm_fit, K = 10)$delta[1]
return(result)
}
set.seed(1)
models <- tibble(polynom_degrees = 1:10) %>%
mutate(linear_model = map(polynom_degrees, GLM)) %>%
mutate(CV_error = map(polynom_degrees, CV)) %>%
mutate(CV_error = unlist(CV_error))
do.call(anova, models$linear_model)
library(tidyverse)
library(lme4)
library(broom.mixed)
Tibble = tibble(
Class1 = rep(c("TITUS","CAIUS"),27),
Class2 = rep(c("A","A","A",
"B","B","B",
"C","C","C"),6
),
Outcome = rnorm(54,5,2),
Predictor = Outcome + rnorm(54,0,2.5),
alpha = NA,
beta = NA)
lmer(data = Tibble,
Outcome ~ (0 + (Class1) + (0 + Predictor|(Class2)))) %>%
tidy(effects = c("fixed","ran_coefs")) -> model
for(i in 1:54) {
Tibble$alpha[i] <- model %>%
filter(effect == "fixed",
term == str_c("Class1",Tibble$Class1[i])) %>%
pull(estimate)
Tibble$beta[i] <- model %>%
filter(effect == "ran_coefs",
level == Tibble$Class2[i],
term == "Predictor") %>%
pull(estimate)
}
Tibble %>% mutate(
Predicted = (alpha + Predictor*beta),
epsilon = Outcome - Predicted) -> Tibble
Tibble %>% summarise(cor(Predicted,Outcome, method = "kendall"),
)
Key concepts: there is a alpha for each class1- There is a beta for each class2.
Epsilon is the residual.
I want to make the code above faster without recurring to a for cycle.
Also, I am very worried about the regression model, because I plan to do it in a tibble with more than 1 million observation, 600k class1, 40k class2.
Notice that that model formula and the regression package is only one possible combination and you can give suggestions. I am most interested in alpha's estimation, not in minimizing epsilons.
I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran