Consider the following data frame:
set.seed(5678)
sub_df<- data.frame(clustersize= rep(1, 4),
lepsp= c("A", "B", "C", "D"),
dens= round(runif(4, c(0, 1)), 3),
db= sample(1:10, 4, replace=TRUE))
Let's say I wanted to run the following Bayes linear model which returns samples, an mc.array object:
library("rjags")
library("coda")
dataForJags <- list(dens=sub_df$dens, db=sub_df$db, N=length(sub_df$dens))
model<-"model{
for(i in 1:N){
dens[i] ~ dnorm(mu[i], tau)
# identity
mu[i] <- int + beta1*db[i]
}
tau ~ dgamma(0.1,0.1)
int ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}"
##compile
mod1 <- jags.model(textConnection(model),data= dataForJags,n.chains=2)
##samples returns a list of mcarray objects
samples<-jags.samples(model= mod1,variable.names=c("beta1",
"int","mu","tau"),n.iter=100000)
Given that samples$beta1[,,] represents random samples from the posterior distribution of the parameters of the jags model, then to summarize, my next step would be to calculate the mean and the 95% credible intervals of the posterior distribution. So I would use:
coeff_output<- round(quantile(samples$beta1[,,],probs=c(0.5,0.025,0.975)),3)
Now, let's say my actual data frame has multiple levels of clustersize.
set.seed(5672)
df<- data.frame(clustersize= c(rep(1, 4), rep(2,4), rep(3, 3)),
lepsp= c("A", "B", "C", "D", "B", "C", "D", "E", "A", "D", "F"),
dens= round(runif(11, c(0, 1)), 3),
db= sample(1:10, 11, replace=TRUE))
How would I run this model for each level of clustersize separately and compile the output into a single result data frame using a forloop or apply function? For each level of clustersize, the resulting mc.array object samples should be output to result_list and the coeff_output should be output to a data frame result_coeff.
Below I calculate the output for each clustersize separately, to produce the expected result list and data frame.
#clustersize==1
sub_df1<- data.frame(clustersize= rep(1, 4),
lepsp= c("A", "B", "C", "D"),
dens= round(runif(4, c(0, 1)), 3),
db= sample(1:10, 4, replace=TRUE))
dataForJags <- list(dens=sub_df$dens, db=sub_df$db, N=length(sub_df$dens))
model<-"model{
for(i in 1:N){
dens[i] ~ dnorm(mu[i], tau)
mu[i] <- int + beta1*db[i]
}
tau ~ dgamma(0.1,0.1)
int ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}"
mod1 <- jags.model(textConnection(model),data= dataForJags,n.chains=2)
samples1<-jags.samples(model= mod1,variable.names=c("beta1",
"int","mu","tau"),n.iter=100000)
coeff_output1<-
data.frame(as.list(round(quantile(samples1$beta1[,,],probs=c(0.5,0.025,0.975)),3)))
#clustersize==2
sub_df2<- data.frame(clustersize= rep(2,4),
lepsp= c( "B", "C", "D", "E"),
dens= round(runif(4, c(0, 1)), 3),
db= sample(1:10, 4, replace=TRUE))
dataForJags <- list(dens=sub_df$dens, db=sub_df$db, N=length(sub_df$dens))
model<-"model{
for(i in 1:N){
dens[i] ~ dnorm(mu[i], tau)
mu[i] <- int + beta1*db[i]
}
tau ~ dgamma(0.1,0.1)
int ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}"
mod1 <- jags.model(textConnection(model),data= dataForJags,n.chains=2)
samples2<-jags.samples(model= mod1,variable.names=c("beta1",
"int","mu","tau"),n.iter=100000)
coeff_output2<-
data.frame(as.list(round(quantile(samples2$beta1[,,],probs=c(0.5,0.025,0.975)),3)))
#clustersize==3
sub_df3<- data.frame(clustersize= rep(3, 3),
lepsp= c("A", "D", "F"),
dens= round(runif(3, c(0, 1)), 3),
db= sample(1:10, 3, replace=TRUE))
dataForJags <- list(dens=sub_df$dens, db=sub_df$db, N=length(sub_df$dens))
model<-"model{
for(i in 1:N){
dens[i] ~ dnorm(mu[i], tau)
mu[i] <- int + beta1*db[i]
}
tau ~ dgamma(0.1,0.1)
int ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}"
mod1 <- jags.model(textConnection(model),data= dataForJags,n.chains=2)
samples3<-jags.samples(model= mod1,variable.names=c("beta1",
"int","mu","tau"),n.iter=100000)
coeff_output3<-
data.frame(as.list(round(quantile(samples3$beta1[,,],probs=c(0.5,0.025,0.975)),3)))
Desired final output:
result_list<- list(samples1, samples2, samples3)
result_coeff<-rbind(coeff_output1, coeff_output2, coeff_output3)
Here is a link to the actual data frame. The solution should be able to process a large dataframe with clustersizes up to 600.
download.file("https://drive.google.com/file/d/1ZYIQtb_QHbYsInDGkta-5P2EJrFRDf22/view?usp=sharing",temp)
There are a few issues to consider here, which are caused by the scale of what you're trying to do. You are creating over 550 different jags.sample objects with 100000 iterations each, and then trying to store all of them in a single list. On most machines, this will cause memory issues: the output is simply too large.
There are at least two ways that we can deal with this:
Take measures to reduce the memory usage of our input data by as much as possible.
Tune our JAGS output so that it does not save so many iterations from each chain.
I've made a number of modifications to your code that should allow it to work with your actual dataset.
Creating Input Data:
In your original code, clustersize and db both have the data type numeric, even though they only need to be integers. The numeric type takes 8 bytes, while the integer type only takes 4 bytes. If we coerce these two columns to the integer type, we can actually reduce the memory size of the list of dataframes in the next step by about 30%.
library("tidyverse")
#### Load Raw Data ####
df <- read_csv("example.csv") %>%
select(-1) %>%
mutate(clustersize = as.integer(clustersize),
db = as.integer(db))
Initial JAGS Tuning
You are using far too many iterations for each of your chains; niter = 100000 is extremely high. You should also be specifying a burn-in period using n.burn, an adaptation period using n.adapt, and a thinning parameter using thin. The thinning parameter is especially important here - this directly reduces the number of iterations that we are saving from each chain. A thinning parameter of 50 means that we are only saving every 50th result.
There are post-hoc methods for selecting your thinning parameters, burn-in, and adaptation period, but that discussion is beyond the scope of SO. For some basic information on what all of these arguments do, there is an excellent answer here: https://stackoverflow.com/a/38875637/9598813. For now, I've provided values that will allow this code to run on your entire dataset, but I recommend that you carefully select the values that you use for your final analysis.
Using tidybayes
The following solution uses the tidybayes package. This provides a clean output and allows us to neatly row-bind all of the coefficient summaries into a single dataframe. Note that we use coda.samples() instead of jags.samples(), because this provides a more universal MCMC object that we can pass to spread_draws(). We also use dplyr::group_split() which is slightly more computationally efficient than split().
library("rjags")
library("coda")
library("tidybayes")
set.seed(5672)
result <- df %>% group_split(clustersize) %>% map(~{
dataForJags <- list(dens=.x$dens, db=.x$db, N=length(.x$dens))
# Declare model structure
mod1 <- jags.model(textConnection(model),
data=dataForJags,
n.chains=2)
# samples returns a list of mcmc objects
samples<-coda.samples(model=mod1,
variable.names=c("beta1","int","mu","tau"),
n.burn=10000,
n.adapt=5000,
n.iter=25000,
thin=50
)
# Extract individual draws
samp <- spread_draws(samples, beta1)
# Summarize 95% credible intervals
coeff_output <- spread_draws(samples, beta1) %>%
median_qi(beta1)
list(samples = samp, coeff_output = coeff_output)
}) %>% transpose()
# List of sample objects
result$samples
# Dataframe of coefficient estimates and 95% credible intervals
result_coeff <- bind_rows(result$coeff_output, .id = "clustersize")
You can use map from purrr package and split over the different clustersize:
library(rjags)
library(coda)
library(purrr)
set.seed(5678)
set.seed(5672)
df<- data.frame(clustersize= c(rep(1, 4), rep(2,4), rep(3, 3)),
lepsp= c("A", "B", "C", "D", "B", "C", "D", "E", "A", "D", "F"),
dens= round(runif(11, c(0, 1)), 3),
db= sample(1:10, 11, replace=TRUE))
model<-"model{
for(i in 1:N){
dens[i] ~ dnorm(mu[i], tau)
# identity
mu[i] <- int + beta1*db[i]
}
tau ~ dgamma(0.1,0.1)
int ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}"
# split data for different clustersize and calculate result
result <- df %>% split(.$clustersize) %>% map(~{
dataForJags <- list(dens=.x$dens, db=.x$db, N=length(.x$dens))
##compile
mod1 <- jags.model(textConnection(model),data= dataForJags,n.chains=2)
##samples returns a list of mcarray objects
samples<-jags.samples(model= mod1,variable.names=c("beta1","int","mu","tau"),n.iter=100000)
coeff_output<- data.frame(as.list(round(quantile(samples$beta1[,,],probs=c(0.5,0.025,0.975)),3)))
list(samples = samples, coeff_output = coeff_output)
}) %>% transpose()
result$samples
result$coeff_output
Note the use of purrr::transpose to transform the final result in a list for samples and a list for coefs as per you request.
Related
I'm running a glmmTMB model with various truncated count distributions (truncated_poisson, truncated_compois, truncated_nbinom1, truncated_nbinom2). When I predict from the model, the values seem to be lower than expected, as if the prediction is not accounting for the truncation. Where am I going wrong? A toy example is provided, showing that predicted values are lower than observed means.
Any advice would be appreciated. Extra points if the advice can extend to the other truncated count distributions (see above) and if it shows how to correctly get the 95% confidence band around the estimated values in these cases.
library(dplyr)
library( extraDistr)
library(glmmTMB)
set.seed(1)
df <- data.frame(Group = rep(c("a", "b"), each = 20), N = rtpois(40, 1, a = 0), ran = "a") %>%
mutate(N = ifelse(N == 0, 1, N))
m <- glmmTMB(N ~ Group + (1|ran), data = df, family = "truncated_poisson")
df %>% group_by(Group) %>% summarize(mean(N))
predict(m, newdata = data.frame(Group = c("a", "b"), ran = NA), type = "response")
I think the main issue is probably that you're using a slightly older version of glmmTMB (< 1.1.5, where a bug was fixed, see e.g. e.g. https://github.com/glmmTMB/glmmTMB/issues/860).
sample data
streamlined slightly (we don't need to include a random effect for this example), and adding a truncated nbinom2.
library(dplyr)
library(extraDistr)
library(glmmTMB)
set.seed(1)
df <- data.frame(Group = rep(c("a", "b"), each = 20),
Np = rtpois(40, 1, a = 0))
## clunky trunc nbinom generator
tnb <- rep(0, 40)
z <- (tnb==0)
while(any(z)) {
tnb[z] <- rnbinom(sum(z), mu = 1, size = 1)
z <- (tnb==0)
}
df$Nnb <- tnb
## summarize
df %>% group_by(Group) %>% summarize(across(starts_with("N"), mean))
## Group Np Nnb
## 1 a 1.75 1.8
## 2 b 1.45 2.35
fit models
m1 <- glmmTMB(Np ~ Group, data = df, family = "truncated_poisson")
m2 <- update(m1, Nnb ~ ., family = truncated_nbinom2)
Predicting with se.fit = TRUE will give you standard errors for the predictions, from which you can compute confidence intervals (assuming Normality/Wald intervals/blah blah blah ...) ...
pfun <- function(m, level = 0.95) {
pp <- predict(m, newdata = data.frame(Group = c("a", "b")),
type = "response",
se.fit = TRUE)
list(est = unname(pp$fit),
lwr = unname(pp$fit + qnorm((1-level)/2)*pp$se.fit),
upr = unname(pp$fit + qnorm((1+level)/2)*pp$se.fit))
}
pfun(m1)
pfun(m2)
I want to use bs function for numerical variables in my dataset when fitting a logistic regression model.
df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
f= c("m","f"), output = c(0,1))
library(splines)
model <- glm(output~ bs(a, df=2)+ bs(b, df=2)+ bs(d, df=2)+ bs(e, df=2)+
factor(f) ,
data = df,
family = "binomial")
In my actual dataset, I need to apply bs() to way more columns than this example. Is there a way I can do this without writing all the terms?
We can use some string manipulation with sprintf, together with reformulate:
predictors <- c("a", "b", "d", "e")
bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 2) + bs(b, df = 2) + bs(d, df = 2) + bs(e,
# df = 2) + factor(f)
If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).
predictors <- c("a", "b", "d", "e")
dof <- c(3, 4, 3, 6)
degree <- c(2, 2, 2, 3)
bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 3, degree = 2) + bs(b, df = 4, degree = 2) +
# bs(d, df = 3, degree = 2) + bs(e, df = 6, degree = 3) + factor(f)
Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").
Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.
I am trying to simulate data and adjust the model for choice-based conjoint analysis (mlogit). However, I am getting the error (Error in solve.default (H, g [! Fixed])). I believe it is because sometimes I have more than one choice for the same respondent, but I understand that this is one of the possibilities of this model and so I could not find a solution, someone who has worked with discrete choice and used the mlogit package can help me? Below is my code:
rm(list = ls())
cat("\014")
library(dplyr)
library(conjoint)
set.seed(0)
n <- 1000
#- create dummy data
data = expand.grid(Cor = c("black", "white"),
Brand = c("X", "Y"),
Price = c("low", "high"))
levn <- rbind("black", "white","X", "Y", "low", "high")
data$trat <- c("A", "B", "C", "D", "E", "F", "G", "H")
UA <- 7
UB <- 6.5
UC <- 6
UD <- 5.5
UE <- 5
UF <- 4.5
UG <- 4
UH <- 3.5
data$utility <- c(UA, UB, UC, UD, UE, UF, UG, UH)
data <- bind_rows(replicate(n, data, simplify = FALSE))
erro <- rnorm(n)
data$erro <- erro
data$determinist <- floor(rowSums(data[,5:6]))
data$id <- rep(1:n, each = 8)
data <- data %>% group_by(id) %>% mutate(determinist = (determinist == max(determinist)))
data$choice <- ifelse(data$determinist=="TRUE",1,0)
library(mlogit)
cbc.mlogit <- mlogit.data(data=data, choice="choice", shape="long", varying=1:3, alt.levels=paste("pos", 1:8), id.var="id")
cbc.ml <- mlogit(choice ~ 0 + Cor + Brand + Price, data = cbc.mlogit)
Using the nnet package I got a solution, however, I don't think it's correct, because in the choice-based joint analysis the intercept is zero.
library("nnet")
model <- multinom(choice ~ Cor + Brand + Price, data = cbc.mlogit)
summary(model)
Already researched other posts and could not solve, any help is welcome!
I have a model structured as follows, and I would like to extract the predicted values while ignoring the random effect. As specified in ?predict.gam and here, I am using the exclude argument, but I am getting an error. Where is my mistake?
dt <- data.frame(n1 = runif(500, min=0, max=1),
n2 = rep(1:10,50),
n3 = runif(500, min=0, max=2),
n4 = runif(500, min=0, max=2),
c1 = factor(rep(c("X","Y"),250)),
c2 = factor(rep(c("a", "b", "c", "d", "e"), 100)))
mod = gam(n1 ~
s(n2, n3, n4, by=c1) +
s(c2, bs="re"),
data=dt)
newd=data.table(expand.grid(n1=seq(min(dt$n1), max(dt$n1), 0.5),
n2=1:10,
n3=seq(min(dt$n3), max(dt$n3), 0.5),
n4=seq(min(dt$n4), max(dt$n4), 0.5),
c1=c("X", "Y")))
newd$pred <- predict.gam(mod, newd, exclude = "s(c2)")
In predict.gam(mod, newd, exclude = "s(c2)"): not all required variables have been supplied in newdata!
exclude does not work in the way as you assumed. You still need to provide all variables in your newd for predict.gam. See my this answer for what is behind predict.gam.
Here is what you need to do:
## pad newd with an arbitrary value for variable c2
newd$c2 <- "a"
## termwise prediction
pt <- predict.gam(mod, newd, type = "terms", exclude = "s(c2)")
## linear predictor without random effect
lp_no_c2 <- rowSums(pt) + attr(pt, "constant")
I'm currently migrating from matlab to R, and trying to find out if what I want to do is possible.
I want to estimate a non-linear model in R where the observations are US states. The wrinkle is that one of the independent variables is a state-level index over counties, calculated using a parameter to be estimated, i.e. the model looks like this:
log(Y_s) = log(phi) + log(f(theta, X_cs)) + u_s
where Y_s is a state-level variable and X_cs is a vector containing county-level observations of a variable within the state, and f() returns a scalar value of the index calculated for the state.
So far I've tried using R's nls function while transforming the data as it's passed to the function. Abstracting from the details of the index, a simpler version of the code looks like this:
library(dplyr)
state <- c("AK", "AK", "CA", "CA", "MA", "MA", "NY", "NY")
Y <- c(3, 3, 5, 5, 6, 6, 4, 4)
X <- c(4, 5, 2, 3, 3, 5, 3, 7)
Sample <- data.frame(state, Y, X)
f <- function(data, theta) {
output <- data %>%
group_by(state) %>%
summarise(index = mean(X**theta),
Y = mean(Y))
}
model <- nls(Y ~ log(phi) + log(index),
data = f(Sample, theta),
start = list(phi = exp(3), theta = 1.052))
This returns an error, telling me that the gradient is singular. My guess is it's because R can't see how the parameter theta should be used in the formula.
Is there a way to do this using nls? I know I could define the criterion function to be minimised manually, i.e. log(Y_s) - log(phi) - log(f(theta, X_cs)), and use a minimisation routine to estimate the parameter values. But I want to use the postestimation features of nls, like having a confidence interval for the parameter estimates. Any help much appreciated.
Sorry, I refuse to install that ginormous meta package. Thus, I use base R:
state <- c("AK", "AK", "CA", "CA", "MA", "MA", "NY", "NY")
Y <- c(3, 3, 5, 5, 6, 6, 4, 4)
X <- c(4, 5, 2, 3, 3, 5, 3, 7)
Sample <- data.frame(state, Y, X)
f <- function(X, state, theta) {
ave(X, state, FUN = function(x) mean(x^theta))
}
model <- nls(Y ~ log(phi) + log(f(X, state, theta)),
data = Sample, weights = 1/ave(X, state, FUN = length),
start = list(phi = exp(3), theta = 1.052))
summary(model)
#Formula: Y ~ log(phi) + log(f(X, state, theta))
#
#Parameters:
# Estimate Std. Error t value Pr(>|t|)
#phi 2336.867 4521.510 0.517 0.624
#theta -2.647 1.632 -1.622 0.156
#
#Residual standard error: 0.7791 on 6 degrees of freedom
#
#Number of iterations to convergence: 11
#Achieved convergence tolerance: 3.722e-06