I'm using Stan (specifically, rstan) with a Bayesian univariate linear regression $y = \beta_0 + \beta_1 x + \varepsilon$. And I'm trying to use the Coda package to visualize the resulting trajectories and distributions for the $\beta$s. However, this produces the error Error in plot.new() : figure margins too large. traceplot and densplot seem to work fine. The problem seems to be with plot.mcmc, which is supposed to produce a nice panel output. You can see an example of the expected output here, on slide "Traceplots and Density Plots".
Here's a minimum (non-)working example using the mtcars dataset:
library(rstan)
library(coda)
stanmodel <- "
data { // Data block: Exogenously given information
int<lower=1> N; // Sample size
vector<lower=1>[N] y; // Response or output.
// [N] means this is a vector of length N
vector<lower=0, upper=1>[N] x; // The single regressor; either 0 or 1
}
parameters { // Parameter block: Unobserved variables to be estimated
vector[2] beta; // Regression coefficients
real<lower=0> sigma; // Standard deviation of the error term
}
model { // Model block: Connects data to parameters
vector[N] yhat; // Regression estimate for y
yhat <- beta[1] + x*beta[2];
// Priors
beta ~ normal(0, 10);
// To plot in R: plot(function (x) {dnorm(x, 0, 10)}, -30, 30)
sigma ~ cauchy(0, 5); // With sigma bounded at 0, this is half-cauchy
// http://en.wikipedia.org/wiki/Cauchy_distribution
// To plot in R: plot(function (x) {dcauchy(x, 0, 5)}, 0, 10)
// Likelihood
y ~ normal(yhat, sigma); // yhat is the estimator, plus the N(0, sigma^2) error
// Note that Stan uses standard deviation
}
"
# Designate data
nobs <- nrow(mtcars)
y <- mtcars$mpg
x <- mtcars$am # Simple regression version doesn't include constant
data <- list(
N = nobs, # Sample size or number of observations
y = y, # The response or output
x = x # The single variable regressor, transmission type
)
# Set a seed for the random number generator
set.seed(123456)
# Run the model
bayes = stan(
model_code = stanmodel,
data = data, # Use the model and data we just defined
iter = 12000, # We're going to take 12,000 draws from the posterior,
warmup = 2000, # But throw away the first 2,000
thin = 10, # And only keep every tenth draw.
chains = 3 # But we'll do these 12,000 draws 4 times.
)
# Use the coda library to visualize parameter trajectories and distributions
param_samples <-
as.data.frame(bayes)[,c('beta[1]', 'beta[2]')]
plot(as.mcmc(param_samples))
Related
I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?
Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.
I am fitting a weibull model to discrete values using JAGS in R. I have no problem fitting a weibull to continuous data, but I run in to trouble when I switch to discrete values.
Here is some data, and code to fit a weibull model in JAGS:
#draw data from a weibull distribution
y <- rweibull(200, shape = 1, scale = 0.9)
#y <- round(y)
#load jags, specify a jags model.
library(runjags)
j.model ="
model{
for (i in 1:N){
y[i] ~ dweib(shape[i], scale[i])
shape[i] <- b1
scale[i] <- b2
}
#priors
b1 ~ dnorm(0, .0001) I(0, )
b2 ~ dnorm(0, .0001) I(0, )
}
"
#load data as list
data <- list(y=y, N = length(y))
#run jags model.
jags.out <- run.jags(j.model,
data=data,
n.chains=3,
monitor=c('b1','b2')
)
summary(jags.out)
This model fits fine. However, if I transform y values to discrete values using y <- round(y), and run the same model, it fails with the error Error in node y[7], Node inconsistent with parents. The particular number of the node changes every time I try, but its always a low number.
I know I can make this run by adding a very small number to all of my values, however, this does not account for the fact that the data are discrete. I know discrete weibull distributions exists, but how can I implement one in JAGS?
You can use the 'ones trick' to implement a discrete weibull distribution in JAGS. Using the pmf here we can make a function to generate some data:
pmf_weib <- function(x, scale, shape){
exp(-(x/scale)^shape) - exp(-((x+1)/scale)^shape)
}
# probability of getting 0 through 200 with scale = 7 and shape = 4
probs <- pmf_weib(seq(0,200), 7, 4)
y <- sample(0:200, 100, TRUE, probs ) # sample from those probabilities
For the 'ones trick' to work you generally have to divide your new pmf by some large constant to ensure that the probability is between 0 and 1. While it appears that the pmf of the discrete weibull already ensures this, we have still added some large constant in the model anyways. So, here is what the model looks like now:
j.model ="
data{
C <- 10000
for(i in 1:N){
ones[i] <- 1
}
}
model{
for (i in 1:N){
discrete_weib[i] <- exp(-(y[i]/scale)^shape) - exp(-((y[i]+1)/scale)^shape)
ones[i] ~ dbern(discrete_weib[i]/C)
}
#priors
scale ~ dnorm(0, .0001) I(0, )
shape ~ dnorm(0, .0001) I(0, )
}
"
Note that we added 1) a vector of ones and a large constant in the data argument, 2) the pmf of the discrete weibull, and 3) we run that probability through a Bernoulli trial.
You can fit the model with the same code you have above, here is the summary which shows that the model successfully recovered the parameter values (scale = 7 and shape = 4).
Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff
scale 6.968277 7.289216 7.629413 7.290810 0.1695400 NA 0.001364831 0.8 15431
shape 3.843055 4.599420 5.357713 4.611583 0.3842862 NA 0.003124576 0.8 15126
This is practically a repeat of this question. However, I want to ask a very specific question regarding plotting of the decision boundary line based on the perceptron coefficients I got with a rudimentary "manual" coding experiment. As you can see the coefficients extracted from a logistic regression result in a nice decision boundary line:
based on the glm() results:
(Intercept) test1 test2
1.718449 4.012903 3.743903
The coefficients on the perceptron experiment are radically different:
bias test1 test2
9.131054 19.095881 20.736352
To facilitate an answer, here is the data, and here is the code:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
QUESTION: How to plot the boundary line (as I did above using the logistic regression coefficients) if we only have the perceptron coefficients?
Well... It turns out that it is exactly the same as in the case of logistic regression, and despite the widely different coefficients: pick the minimum and maximum of the abscissa (test 1), add a slight margin, and calculate the corresponding test 2 values at the decision boundary (when 0 = theta_o + theta_1 test1 + theta_2 test2), and draw the line between the points:
palette(c("tan3","purple4"))
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=data,
main="College admissions")
(x = c(min(data[,2])-.2, max(data[,2])+ .2))
(y = c((-1/theta[3]) * (theta[2] * x + theta[1])))
lines(x, y, lwd=3, col=rgb(.7,0,.2,.5))
Perceptron weights are calculated so that when theta^T X > 0, it classifies as positive, and when theta^T X < 0 it classifies as negative. This means the equation theta^T X is your decision boundary for the perceptron.
The same logic applies to logistic regression except its now sigmoid(theta^T X) > 0.5.
I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.
I'm trying a simple Gamma GLM in STAN and R, but it crashes immediately
generate data:
set.seed(1)
library(rstan)
N<-500 #sample size
dat<-data.frame(x1=runif(N,-1,1),x2=runif(N,-1,1))
#the model
X<-model.matrix(~.,dat)
K<-dim(X)[2] #number of regression params
#the regression slopes
betas<-runif(K,-1,1)
shape <- 10
#simulate gamma data
mus<-exp(X%*%betas)
y<-rgamma(500,shape=shape,rate=shape/mus)
this is my STAN model:
model_string <- "
data {
int<lower=0> N; //the number of observations
int<lower=0> K; //the number of columns in the model matrix
matrix[N,K] X; //the model matrix
vector[N] y; //the response
}
parameters {
vector[K] betas; //the regression parameters
real<lower=0, upper=1000> shape; //the shape parameter
}
model {
y ~ gamma(shape, (shape/exp(X * betas)));
}"
when I run this model, R immediately crashes:
m <- stan(model_code = model_string, data = list(X=X, K=3, N=500, y=y), chains = 1, cores=1)
update : I think the problem is somewhere in the vectorization as I can get a running model where I pass every column of X as a vector.
update2: this also works
for(i in 1:N)
y[i] ~ gamma(shape, (shape / exp(X[i,] * betas)));
The problem with the original code is that there is no operator currently defined in Stan for a scalar divided by a vector. In this case,
shape / exp(X * betas)
You might be able to do
shape[1:N] ./ exp(X * betas)
or failing that,
(shape * ones_vector) ./ exp(X * betas)