brms: how do I set prior on categorical variable? - r

I am building a binomial regression model using 2 categorical variables. This is from an example in the book, Statistical rethinking. In the book, while using the rethinking package, we can set priors on each categorical variable as shown below
m11.5 <- ulam(
alist(
pulled_left ~ dbinom( 1 , p ) ,
logit(p) <- a[actor] + b[treatment] ,
a[actor] ~ dnorm( 0 , 1.5 ),
b[treatment] ~ dnorm( 0 , 0.5 )
) ,
data=d , chains=4 , log_lik=TRUE )
I am trying to do the same in brms.
priors <- c(prior(normal(0, 1.5), class = b, coef = "actor"),
prior(normal(0, 0.5), class = b, coef = "treatment"))
m11.5 <- brm(data = d, family = binomial,
formula = pulled_left | trials(1) ~ 1 + actor + treatment,
sample_prior = T, prior = priors,
cores = 4, chains = 4)
I would like to set priors for all of the actor levels and the treatment levels mentioned once. However the above code doesn't go through with the following error message,
Upon using get_prior, I see the following (implying that the levels are internally split)
I donot want to specify the prior for the each level of the categorical variables. How do I do it? Please advice.

You can just paste and set multiple coefficients:
priors <- c(prior_string("normal(0, 1.5)", class = "b", coef = paste("actor", 2:7, sep="")),
prior_string("normal(0, 0.5)", class = "b", coef = paste("treatment", 2:4, sep="")))

Related

Getting error "invalid type (list) for variable" when running multiple models in a for loop: how to specify outcome/predictors?

For a study I am working on I need to create bootstrapped datasets and inverse probability weights for each dataset and then run a series of models for each of these datasets/weights. I am attempting to do this with a nested for-loop where the first part of the loop creates the weights and the nested loop runs a series of models, each with different outcome variables and/or predictors. I am running about 80 models for each bootstrapped dataset, hence the reason for a more automated way to do this. Below is a example of what I am doing with some mock data:
# Creation of mock data
data <- data.frame("Severity" = as.factor(c(rep("None", 25), rep("Mild", 25), rep("Moderate", 25), rep("Severe", 25))), "Severity2" = as.factor(c(rep("None", 40), rep("Mild", 20), rep("Moderate", 20), rep("Severe", 20))), "Weight" = rnorm(100, mean = 160, sd = 30), "Age" = rnorm(100, mean = 40, sd = 7), "Gender" = as.factor(rbinom(100, size = 1, prob = 0.5)), "Tested" = as.factor(rbinom(100, size = 1, prob = 0.4)))
data$Severity <- ifelse(data$Tested == 0, NA, data$Severity)
data$Severity2 <- ifelse(data$Tested == 0, NA, data$Severity2)
data$Severity <- ordered(data$Severity, levels = c("None", "Mild", "Moderate", "Severe"))
data$Severity2 <- ordered(data$Severity2, levels = c("None", "Mild", "Moderate", "Severe"))
# Creating boostrapped datasets
nboot <- 2
set.seed(10)
boot.samples <- lapply(1:nboot, function(i) {
data[base::sample(1:nrow(data), replace = TRUE),]
})
# Create empty list to store results later
coefs <- list()
# Setting up the outcomes/predictors of each of the models I will run
mod1 <- list("outcome" <- "Severity", "preds" <- c("Weight","Age"))
mod2 <- list("outcome" <- "Severity2", "preds" <- c("Weight", "Age", "Gender"))
models <- list(mod1, mod2)
# Running the for-loop
for(i in 1:length(boot.samples)) {
#Setting up weight creation
null <- glm(formula = Tested ~ 1, family = "binomial", data = boot.samples[[i]])
full <- glm(formula = Tested ~ Age, family = "binomial", data = boot.samples[[i]])
step <- step(null, k = 2, direction = "forward", scope=list(lower = null, upper = full), trace = 0)
pd.combined <- stats::predict(step, type = "response")
numer.combined <- glm(Tested ~ 1, family = "binomial",
data = boot.samples[[i]])
pn.combined <- stats::predict(numer.combined, type = "response")
# Creating stabilized weights
boot.samples[[i]]$ipw <- ifelse(boot.samples[[i]]$Tested==0, ((1-pn.combined)/(1-pd.combined)), (pn.combined)/(pd.combined))
# Now running each model and storing the coefficients
for(j in 1:length(models)) {
outcome <- models[[j]][[1]] # Set the outcome name
predictors <- models[[j]][[2]] # Set the predictor names
model_results <- polr(boot.samples[[i]][,outcome] ~ boot.samples[[i]][, predictors], weights = boot.samples[[i]]$ipw, method = c("logistic"), Hess = TRUE) #Run the model
coefs[[j]] <- model_results$coefficients # Store regression model coefficients in list
}
}
The portion for creating the IPW weights works just fine, but I keep getting an error for the modeling portion that reads:
"Error in model.frame.default(formula = boot.samples[[i]][, outcome] ~ :
invalid type (list) for variable 'boot.samples[[i]][, predictors]'"
Based on the question asked and answered here: Error in model.frame.default ..... : invalid type (list) for variable I know that the issue is with how I'm calling the outcomes and predictors in the model. I've messed around lots of different ways to handle this to no avail, I need to specify the outcome and predictors as I do because in my actual models the outcomes and predictors changes with each model! Any ideas on how to deal with this would be greatly appreciated!
I've tried something like setting outcome <- boot.samples[[i]][,outcome] outside of the model and then just calling outcome in the model, but that gives me the same error.

Fitting a multinomial glm for a very large dataset

I have multinomial compositional data for 100 categories from two groups, where each is represented by two ages:
set.seed(1)
df <- data.frame(group = c(rep("g1",200),rep("g2",200)),
age = c(rep("a1",100),rep("a2",100),rep("a1",100),rep("a2",100)),
category = rep(paste0("c",1:100),4),
n = c(rmultinom(1,7000,pgamma(shape=0.8,rate=0.1,q=seq(0.01,1,0.01))),
rmultinom(1,5000,pgamma(shape=0.8,rate=0.3,q=seq(0.01,1,0.01))),
rmultinom(1,1800,pgamma(shape=0.5,rate=0.1,q=seq(0.01,1,0.01))),
rmultinom(1,1200,pgamma(shape=0.9,rate=0.1,q=seq(0.01,1,0.01)))),
stringsAsFactors = F)
I want to fit a regression model to estimate the interaction effect of the category * group, while controlling for age.
So far, I'm trying to use a multicategorical glm (with a binomial(link = 'logit')), to a data.frame where I transform the df$n (total counts) to a binomial (binary) form:
binomial.df <- do.call(rbind,lapply(unique(df$group),function(g){
do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$age),function(a){
do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$category),function(t){
sum.non.category <- sum(dplyr::filter(df,group == g & age == a & category != t)$n)
sum.category <- sum(dplyr::filter(df,group == g & age == a & category == t)$n)
data.frame(group = g,age = a,category = t,assigned.category = c(rep(0,sum.non.category),rep(1,sum.category)))
}))
}))
}))
binomial.df$group <- factor(binomial.df$group, levels = c("g1","g2"))
binomial.df$age <- factor(binomial.df$age, levels = c("a1","a2"))
binomial.df$category <- factor(binomial.df$category, levels = paste0("c",1:100))
mm.fit <- glm(assigned.category ~ category * group + age,data = binomial.df, family = binomial(link = 'logit'))
Clearly for this size of data the glm call will run for days or even longer, so I'm looking for a more tractable way.
Any idea?
BTW, I tried using nnet's multinom first:
df$group <- factor(df$group, levels = c("g1","g2"))
df$age <- factor(df$age, levels = c("a1","a2"))
df$category <- factor(df$category, levels = paste0("c",1:100))
mm.fit <- nnet::multinom(n ~ category * group + age, data=df)
But I get:
Error in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE, :
too many (22220) weights
The nnet::multinom issue you can resolve by modifying your multinom call to include the argument MaxNWts=100000 :
mm.fit <- nnet::multinom(n ~ category * group + age, data=df, MaxNWts=100000)
To fit large multinomial models in R you could also look into the h20 package :
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

nlme with correlation structure not fitting and crashes R

I have a mixed model with a non-linear term, so I would like to use the R package nlme instead of lme. However, switching to nlme, even without adding anything to the model, causes Rstudio and R to crash.
I have found that even generated data, which can easily be fitted using lme, causes this behaviour (on my computer).
Let's start by loading the libraries and setting up a data.frame with the grouping id and spatial coordinate x.
library(nlme)
nid <- 300
nx <- 10
data <- expand.grid(
x = seq(nx),
id = seq(nid)
)
Now, let's add correlated error and uncorrelated error as separate columns, as well as a random intercept value per id. The output of arima.sim requires a normalisation step.
data$ec <- c(
replicate(
nid,
as.numeric(
arima.sim(
model = list(
order = c(1, 0, 0),
ar = 0.5
),
n = nx
)
)
)
)
data$ec <- data$ec / sd(data$ec)
data$eu <- rnorm(nid * nx)
data$random <- rep(rnorm(nid), each = nx)
Now, we can create 3 dependent variables, for 3 models. The first is a mixed model with uncorrelated (regular) error. The second includes an exponential (AR1) correlation structure. The third combines both. I am adding an intercept of 1, an sd of the random effect of 2 and an sd of the total residual error of 3.
data$y1 <- 1 + 2 * data$random + 3 * data$eu
data$y2 <- 1 + 2 * data$random + 3 * data$ec
data$y3 <- 1 + 2 * data$random + sqrt(8) * data$ec + sqrt(1) * data$eu
All of the following lme models fit without problem, giving the expected result.
l1 <- lme(
fixed = y1 ~ 1,
random = ~ 1 | id,
data = data,
method = "ML"
)
l2 <- lme(
fixed = y2 ~ 1,
random = ~ 1 | id,
correlation = corExp(
form = ~ x | id
),
data = data,
method = "ML"
)
l3 <- lme(
fixed = y3 ~ 1,
random = ~ 1 | id,
correlation = corExp(
form = ~ x | id,
nugget = TRUE
),
data = data,
method = "ML"
)
As far as I know, the following nlme code specifies exactly the same models as above. The first runs without issues. But the ones with a correlation structure crash R / RStudio. No warning or error message is provided. Fiddling with the arguments and with nlmeControl does not help, though I do think nlmeControl could be the place to search for a solution.
nlme(
model = y1 ~ b0,
fixed = b0 ~ 1,
random = b0 ~ 1,
group = ~ id,
data = data,
start = list(
fixed = fixed.effects(l1),
random = setNames(random.effects(l1), "b0")
),
method = "ML"
)
nlme(
model = y2 ~ b0,
fixed = b0 ~ 1,
random = b0 ~ 1,
group = ~ id,
correlation = corExp(
form = ~ x
),
data = data,
start = list(
fixed = fixed.effects(l2),
random = setNames(random.effects(l2), "b0")
),
method = "ML"
)
nlme(
model = y3 ~ b0,
fixed = b0 ~ 1,
random = b0 ~ 1,
group = ~ id,
correlation = corExp(
form = ~ x,
nugget = TRUE
),
data = data,
start = list(
fixed = fixed.effects(l3),
random = setNames(random.effects(l3), "b0")
),
method = "ML"
)
Has anyone experienced this before? Does my example code give the same problem on your computer? What are good strategies to change nlmeControl to attempt to remedy this?

mgcv::gamm() and MuMIn::dredge() errors

I've been trying to fit multiple GAMs using the package mgcv within a function, and crudely select the most appropriate model through model selection procedures. But my function runs the first model then doesn't seem to recognise the input data dat again.
I get the error
Error in is.data.frame(data) : object 'dat' not found.
I think this is a scoping problem and I've looked here, and here for help but cannot figure it out.
Code and data are as follows (hopefully reproducible):
https://github.com/cwaldock1/Help/blob/master/test_gam.csv
library(mgcv)
# Function to fit multiple models
best.mod <- function(dat) {
# Set up control structure
ctrl <- list(niterEM = 0, msVerbose = TRUE, optimMethod="L-BFGS-B")
# AR(1)
m1 <- get.models(dredge(gamm(Temp ~ s(Month, bs = "cc") + s(Date, bs = 'cr') + Year,
data = dat, correlation = corARMA(form = ~ 1|Year, p = 1),
control = ctrl)), subset=1)[[1]]
# AR(2)
m2 <- get.models(dredge(gamm(Temp ~ s(Month, bs = "cc") + s(Date, bs = 'cr') + Year,
data = dat, correlation = corARMA(form = ~ 1|Year, p = 2),
control = ctrl)), subset=1)[[1]]
# AR(3)
m3 <- get.models(dredge(gamm(Temp ~ s(Month, bs = "cc") + s(Date, bs = 'cr') + Year,
data = dat, correlation = corARMA(form = ~ 1|Year, p = 3),
control = ctrl)), subset = 1)[[1]]
### Select best model to work with based on unselective AIC criteria
if(AIC(m2$lme) > AIC(m1$lme)){mod = m1}else{mod = m2}
if(AIC(mod$lme) > AIC(m3$lme)){mod = m3}else{mod = mod}
return(mod$gam)
}
mod2 <- best.mod(dat = test_gam)
Any help would be greatly appreciated.
Thanks,
Conor
get.models evaluates in model's formula environment, which in gamm is
(always?) .GlobalEnv, while it should be function's environment (i.e.
sys.frames(sys.nframe())).
So, instead of
get.models(ms, 1)
use
eval(getCall(ms, 1))

predicting from flexmix object (R)

I fit some data to a mixture distribution of two gaussian in flexmix:
data("NPreg", package = "flexmix")
mod <- flexmix(yn ~ x, data = NPreg, k = 2,
model = list(FLXMRglm(yn ~ x, family= "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
the model fit is as follows:
> mod
Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, model = list(FLXMRglm(yn ~ x, family = "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
Cluster sizes:
1 2
74 126
convergence after 31 iterations
But how do I actually predict from this model?
when I do
pred <- predict(mod, NPreg)
I get a list with the predictions from each of the two components
To get a single prediction, do I have to add in the cluster sizes like this?
single <- (74/200)* pred$Comp.1[,1] + (126/200)*pred$Comp.2[,2]
I use flexmix for prediction in the following way:
pred = predict(mod, NPreg)
clust = clusters(mod,NPreg)
result = cbind(NPreg,data.frame(pred),data.frame(clust))
plot(result$yn,col = c("red","blue")[result$clust],pch = 16,ylab = "yn")
And the confusion matrix:
table(result$class,result$clust)
For getting the predicted values of yn, I select the component value of the cluster to which a data point belongs.
for(i in 1:nrow(result)){
result$pred_model1[i] = result[,paste0("Comp.",result$clust[i],".1")][i]
result$pred_model2[i] = result[,paste0("Comp.",result$clust[i],".2")][i]
}
The actual vs predicted results show the fit (adding only one of them here as both of your models are same, you would use pred_model2 for the second model).
qplot(result$yn, result$pred_model1,xlab="Actual",ylab="Predicted") + geom_abline()
RMSE = sqrt(mean((result$yn-result$pred_model1)^2))
gives a root mean square error of 5.54.
This answer is based on many SO answers I read through while working with flexmix. It worked well for my problem.
You may also be interested in visualizing the two distributions. My model was the following, which shows some overlap as the ratio of components are not close to 1.
Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2,
model = list(FLXMRglm(yn ~ x, family = "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
prior size post>0 ratio
Comp.1 0.481 102 129 0.791
Comp.2 0.519 98 171 0.573
'log Lik.' -1312.127 (df=13)
AIC: 2650.255 BIC: 2693.133
I also generate a density distribution with histograms to visulaize both components. This was inspired by a SO answer from the maintainer of betareg.
a = subset(result, clust == 1)
b = subset(result, clust == 2)
hist(a$yn, col = hcl(0, 50, 80), main = "",xlab = "", freq = FALSE, ylim = c(0,0.06))
hist(b$yn, col = hcl(240, 50, 80), add = TRUE,main = "", xlab = "", freq = FALSE, ylim = c(0,0.06))
ys = seq(0, 50, by = 0.1)
lines(ys, dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)), col = hcl(0, 80, 50), lwd = 2)
lines(ys, dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)), col = hcl(240, 80, 50), lwd = 2)
# Joint Histogram
p <- prior(mod)
hist(result$yn, freq = FALSE,main = "", xlab = "",ylim = c(0,0.06))
lines(ys, p[1] * dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)) +
p[2] * dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)))
You can pass an additional argument to your prediction call.
pred <- predict(mod, NPreg, aggregate = TRUE)[[1]][,1]

Resources