I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891
Related
I'm trying to model an estimation of the price elasticity of demand for each customer using GAM model, a model like this:
\ln D = \ln P + \ln P \cdot \sum_{i=1}^{20} f(X_i)
PED = \frac{\partial \ln D {\partial \ln P} = 1 + \sum_{i=1}^{20} f(X_i)
https://latex.codecogs.com/svg.image?$$&space;\ln&space;D&space;=\ln&space;P&space;+&space;\ln&space;P&space;\cdot&space;\sum_{i=1}^{20}&space;f(X_i)\\PED&space;=&space;\frac{\partial&space;\ln&space;D}{\partial&space;\ln&space;P}&space;=&space;1&space;+&space;&space;\sum_{i=1}^{20}&space;f(X_i)
where $D$ is Demand, $P$ is rate, PED is price elasticity of demand and $X_i$ is a set of customer's variable.
Since $PED$ is not observable, i want to estimate PED from the model created for log demand using gam model, but I have trying some difficulty in how to estimate that way.
I tried to get the each splines to calculate PED, but i failed. I know there is a package called gratia with derivatives function, but i dont understand how to use it to calculate ped.
Once the model to estimate demand is created, I will need to estimate the price elasticity of demand for each customer, but for these customers I don't have the rate variable, only the 20 personal variables.
I read some links:
https://stats.stackexchange.com/questions/495775/first-derivative-of-fitted-gam-changes-according-to-specified-model-distribution
https://stats.stackexchange.com/questions/590167/how-can-i-calculate-a-derivative-of-a-global-smooth-and-group-level-smooths-with
https://stats.stackexchange.com/questions/32013/what-is-the-mathematical-model-formula-corresponding-to-this-gam-model-fit-in-r
Really appreciate for any explanation, advices or other way to model my data.
Thanks
EDIT
What i've tried:
#create the dataset
A <- sample(x = 0:1000, size = 5000, replace = TRUE)
B <- sample(x = 0:1000, size = 5000, replace = TRUE)
C <- sample(x = 0:1000, size = 5000, replace = TRUE)
D <- sample(x = 0:1000, size = 5000, replace = TRUE)
log.R <- log(rbeta(5000, 5,10)*10) #log rate
log.Y <- log(rgamma(5000, 10, 20)*10000) #log demand
mydata <- data.frame(A, B, C, D, log.R, log.Y)
#the model
model <- gam(log.Y ~ s(A, by=log.R) + s(B, by=log.R) + s(C, by=log.R) + s(D, by=log.R), data = mydata, method = "REML")
mfx <- marginaleffects(model, variables = "log.R", eps = 10^-5)
head(mfx)
mfx returns a 'dydx' column, is it the elasticity of my data used to model?
And when i will apply this model to newdata, i got an error:
newdat = data.frame(A = 750, B = 500, C = 398, D = 740)
marginaleffects(model, variables = "log.R", eps = 10^-5, newdata= newdat, slope = 'dydx')
Error: There is no valid predictor variable. Please change the `variables` argument or supply a new data frame to the `newdata` argument.
What should I do?
I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)
Hi I'm working on a decision tree.
tree1=tree(League.binary~TME.factor+APM.factor+Wmd.factor,starcraft)
The tree shows a partitioning based solely on the APM.factor and the leaves aren't pure. here's a screenshot:
I tried creating a tree with a subset with 300 of the 3395 observations and it used more than one variable. What went wrong in the first case? Did it not need the extra two variables so it used only one?
Try playing with the tree.control() parameters, for example setting minsize=1 so that you end up with a single observation in each leaf (overfit), e.g:
model = tree(y ~ X1 + X2, data = data, control = tree.control(nobs=n, minsize = 2, mindev=0))
Also, try the same thing with the rpart package, see what results you get, which is the "new" version of tree. You can also plot the importance of the variables. Here a syntax example:
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
## fit tree
### alt1: class
model = rpart(y ~ X1 + X2, data=data, method = "class")
### alt2: reg
model = rpart(y ~ X1 + X2, data=data, control = rpart.control(maxdepth = 30, minsplit = 1, minbucket = 1, cp=0))
## show model
print(model)
rpart.plot(model, cex=0.5)
## importance
model$variable.importance
Note that since trees do binary splits, it is possible that a single variable explains most/all of the SSR (for regression). Try plotting the response for each regressor, see if there's any significant relation to anything but the variable you're getting.
In case you want to run the examples above, here a data simulation (put it at beginning of code):
n = 12000
X1 = runif(n, -100, 100)
X2 = runif(n, -100, 100)
## 1. SQUARE DATA
# y = ifelse( (X1< -50) | (X1>50) | (X2< -50) | (X2>50), 1, 0)
## 2. CIRCLE DATA
y = ifelse(sqrt(X1^2+X2^2)<=50, 0, 1)
## 3. LINEAR BOUNDARY DATA
# y = ifelse(X2<=-X1, 0, 1)
# Create
color = ifelse(y==0,"red","green")
data = data.frame(y,X1,X2,color)
# Plot
data$color = data$color %>% as.character()
plot(data$X2 ~ data$X1, col = data$color, type='p', pch=15)
I have a linear model with the exchange rate as a dependent variable and 7 others independent variables(e.g. inflation, interest rate etc.). I have quarterly data from 1993Q1-2011Q4.
I would like to create a rolling window regression (with the model above) with window size 60(from 1993Q1-2007Q4) and use the estimated regression to forecast the rest sample. Also, I would like to compare this model with the Random Walk model(exchange rate follows a R.W.). In the end, I would like to perform the dm.test and clarkwest test(does not run). Is my code right?
X = embed(data)
X = as.data.frame(X)
install.packages("foreach")
library(foreach)
w_size=60
n_windows = nrow(X) - 60 #until 2007Q4
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
return(c(f1, f2))
}
e1 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,2]
library(tseries)
library(forecast)
dm.test(e1,e2, "l") #p-value is more than 5% for all the cases( two.sided, greater, less)
clarkwest(e1,e2)
It seems like the clarkwest() function is not supported anymore. I recently wrote my own function: CW Note that I used normal standard errors and not Newey-West corrected.
To investigate your loop you could try:
i=1
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
Here you can see the composition the loop creates when i=1
There are two things I need to do. Firstly I would like to be able to create new variables in a coda mcmc object that have been calculated from existing variables so that I can run chain diagnostics on the new variable. Secondly I would like to be able to index single variables in some of the coda plot functions while still viewing all chains.
Toy data. Bayesian t-test on the sleep data using JAGS and rjags.
data(sleep)
# read in data
y <- sleep$extra
x <- as.numeric(as.factor(sleep$group))
nTotal <- length(y)
nGroup <- length(unique(x))
mY <- mean(y)
sdY <- sd(y)
# make dataList
dataList <- list(y = y, x = x, nTotal = nTotal, nGroup = nGroup, mY = mY, sdY = sdY)
# model string
modelString <- "
model{
for (oIdx in 1:nTotal) {
y[oIdx] ~ dnorm(mu[x[oIdx]], 1/sigma[x[oIdx]]^2)
}
for (gIdx in 1:nGroup) {
mu[gIdx] ~ dnorm(mY, 1/sdY)
sigma[gIdx] ~ dunif(sdY/10, sdY*10)
}
}
"
writeLines(modelString, con = "tempModel.txt")
# chains
# 1. adapt
jagsModel <- jags.model(file = "tempModel.txt",
data = dataList,
n.chains = 3,
n.adapt = 1000)
# 2. burn-in
update(jagsModel, n.iter = 1000)
# 3. generate
codaSamples <- coda.samples(model = jagsModel,
variable.names = c("mu", "sigma"),
thin = 15,
n.iter = 10000*15/3)
Problem one
If I convert the coda object to a dataframe I can calculate the difference between the estimates for the two groups and plot this new variable, like so...
df <- as.data.frame(as.matrix(codaSamples))
names(df) <- gsub("\\[|\\]", "", names(df), perl = T) # remove brackets
df$diff <- df$mu1 - df$mu2
ggplot(df, aes(x = diff)) +
geom_histogram(bins = 100, fill = "skyblue") +
geom_vline(xintercept = mean(df$diff), colour = "red", size = 1, linetype = "dashed")
...but how do I get a traceplot? I can get one for existing variables within the coda object like so...
traceplot(codaSamples[[1]][,1])
...but I would like to be able to get them for the the new diff variable.
Problem Two
Which brings me to the second problem. I would like to be able to get a traceplot (among other things) for individual variables. As I have shown above I can get them for a single variable if I only want to see one chain but I'd like to see all chains. I can see all chains for all variables in the model with the simple
plot(codaSamples)
...but what if I don't want or need to see all variables? What if I just want to see the trace and/or desnity plots for one, or even two, variables (but not all variables) but with all chains in the plot?