R: implementing my own gradient boosting algorithm - r

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?

Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.

Related

How to specify zero-inflated negative binomial model in JAGS

I'm currently working on constructing a zero-inflated negative binomial model in JAGS to model yearly change in abundance using count data and am currently a bit lost on how best to specify the model. I've included an example of the base model I'm using below. The main issue I'm struggling with is that in the model output I'm getting poor convergence (high Rhat values, low Neff values) and the 95% credible intervals are huge. I realize that without seeing/running the actual data there's probably not much anyone can help with but I thought I'd at least try and see if there are any obvious errors in the way I have the basic model specified. I also tried fitting a variety of other model types (regular negative binomial, Poisson, and zero-inflated Poisson) but decided to go with the ZINB since it had the lowest DIC scores of all the models and also makes the most intuitive sense to me, given my data structure.
library(R2jags)
# Create example dataframe
years <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
sites <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
months <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
# Count data
day1 <- floor(runif(18,0,7))
day2 <- floor(runif(18,0,7))
day3 <- floor(runif(18,0,7))
day4 <- floor(runif(18,0,7))
day5 <- floor(runif(18,0,7))
df <- as.data.frame(cbind(years, sites, months, day1, day2, day3, day4, day5))
# Put count data into array
y <- array(NA,dim=c(2,3,3,5))
for(m in 1:2){
for(k in 1:3){
sel.rows <- df$years == m &
df$months==k
y[m,k,,] <- as.matrix(df)[sel.rows,4:8]
}
}
# JAGS model
sink("model1.txt")
cat("
model {
# PRIORS
for(m in 1:2){
r[m] ~ dunif(0,50)
}
t.int ~ dlogis(0,1)
b.int ~ dlogis(0,1)
p.det ~ dunif(0,1)
# LIKELIHOOD
# ECOLOGICAL SUBMODEL FOR TRUE ABUNDANCE
for (m in 1:2) {
zero[m] ~ dbern(pi[m])
pi[m] <- ilogit(mu.binary[m])
mu.binary[m] <- t.int
for (k in 1:3) {
for (i in 1:3) {
N[m,k,i] ~ dnegbin(p[m,k,i], r)
p[m,k,i] <- r[m] / (r[m] + (1 - zero[m]) * lambda.count[m,k,i]) - 1e-10 * zero[m]
lambda.count[m,k,i] <- exp(mu.count[m,k,i])
log(mu.count[m,k,i]) <- b.int
# OBSERVATIONAL SUBMODEL FOR DETECTION
for (j in 1:5) {
y[m,k,i,j] ~ dbin(p.det, N[m,k,i])
}#j
}#i
}#k
}#m
}#END", fill=TRUE)
sink()
win.data <- list(y = y)
Nst <- apply(y,c(1,2,3),max)+1
inits <- function()list(N = Nst)
params <- c("N")
nc <- 3
nt <- 1
ni <- 50000
nb <- 5000
out <- jags(win.data, inits, params, "model1.txt",
n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb,
working.directory = getwd())
print(out)
Tried fitting a ZINB model in JAGS using the code specified above but am having issues with model convergence.
The way that I have tended to specify zero-inflated models is to model the data as being Poisson distributed with mean that is either zero if that individual is part of the zero-inflated group, or distributed according to a gamma distribution otherwise. Something like:
Obs[i] ~ dpois(lambda[i] * is_zero[i])
is_zero[i] ~ dbern(zero_prob)
lambda[i] ~ dgamma(k, k/mean)
Something similar to this was first used in this paper: https://www.researchgate.net/publication/5231190_The_distribution_of_the_pathogenic_nematode_Nematodirus_battus_in_lambs_is_zero-inflated
These models usually converge OK, although the performance is not as good as for simpler models of course. You also need to make sure to supply initial values for is_zero so that the model starts with all individuals with positive counts in the appropriate group.
In your case, you have multiple timepoints, so you need to decide if the zero-inflation is fixed over time points (i.e. an individual cannot switch to or from zero-inflated group over time), or if each observation is completely independent with respect to zero-inflation status. You also need to decide if you want to have co-variates of year/month/site affecting the mean count (i.e. the gamma part) or the probability of a positive count (i.e. the zero-inflation part). For the former, you need to index mean (in my formulation) by i and then use a GLM-like formula (probably using log link) to relate this to the appropriate covariates. For the latter, you need to index zero_prob by i and then use a GLM-like formula (probably using logit link) to relate this to the appropriate covariates. It is also possible to do both, but if you try to use the same covariates in both parts then you can expect convergence problems!
It would arguably be better to replace the separate Poisson-Gamma distributions with a single Negative Binomial distribution using the 'ecology parameterisation' with mean and k. This is not currently implemented in JAGS, but I will add it for the next update.

How to convert one-fold cross-validation to K-fold cross-validation in R

I have a GAM model for which I would like to calculate AUC, TSS (True Skill Statistic) and RMSE through 5-fold cross-validation in R. Unfortunately, the caret package does not support GAM and therefore cannot be used. As I didn’t find any alternative, I tried to build the code for cross-validation myself, and it works well, with the only problem that it is only one-fold cross-validation. Could anybody help me to make this 5-fold? Sorry if this is an elementary question, I am new to R.
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.2))
train <- DF[sample, ]
test <- DF[!sample, ]
predicted <- predict(GAM, test, type="response")
# Calculating RMSE
RMSE(test$Y, predicted)
# Calculating AUC
auc(test$Y, predicted)
GAM_TSS <- gam(Y ~ X1 + X2 + X3 + X4 + s(X5, k = 3), train, family = "binomial")
test$pred <- predict(GAM_TSS, type="response", newdata=test)
roc.curve <- roc(test$Y, test$pred, ci=T)
plot(roc.curve)
threshold <- 0.1
CM <- confusionMatrix(factor(test$pred>threshold), factor(test$P_A==1), positive="TRUE")
CM <- CM$byClass
Sensitivity <- CM[['Sensitivity']]
Specificity <- CM[['Specificity']]
# Calculating TSS
TSS = Sensitivity + Specificity - 1
TSS
I have come across precisely this problem with GAM in the past. My approach was to create a vector to split data randomly into parts as equally sized as possible, then loop through the fold ids as follows:
k <- 5
FoldID <- rep(1:k, ceiling(nrow(modelData)/k))
length(FoldID) <- nrow(modelData)
FoldID <- sample(FoldID, replace = FALSE)
for(fold in 1:k){
train_data <- modelData[FoldID != fold, ]
val_data <- modelData[FoldID == fold, ]
# Create training model and predictions
# Calculate RMSE data etc.
# Add a line with fold validation results to a dataframe
}
# Calculate column means of your validation results frame
I will leave you to fill in the gaps to suit your own requirements. It would also be a good idea to add an outer loop (outside the FoldID creation) for repeats.

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Issues with predict function in R

I'm having issues with using the predict() function in R and I hope that I can get some help. Consider a dataset with two columns - 1) Y, 2) X
My goal is to fit a natural spline fit and get a 95% CI and to mark points outside of the 95% CI as outlier. Here is what I do:
1) Initially no point in the dataset is marked as outlier.
2) I fit my ns fit and using its 95% CI, I mark the points outside of the CI as outlier
3) I, then, exclude the initially marked outliers, and fit another ns and using it's 95% CI, I mark outliers.
* Issue: *
Suppose my initial dataset has 1000 obs. I mark some outliers in the first round and I get 23 outliers. Then I fit another ns (call it fit.ns) using the remaining 977 non-outliers. I then use ALL X's (all 1000) to get predicted values based on this new fit but I get warning AND error that newdata in my predict function has 1000 obs but fit has 977. The predicted values returned has also 977 values and NOT 1000.
* My predict() code *
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time),
interval="prediction", level = 1 - 0.05) # ??? PROBLEM
I really appreciate your help.
Seems that I cannot upload the dataset, but my code is:
library(splines)
ns.knot <- 10
for (i in 1:2){
# I exclude outliers so that my ns.fit does not get affected my outliers
data.ns <- data.temp[data.temp$OutlierInd == 0,]
data.ns$BeatNum <- 1:nrow(data.ns) # BeatNum is like a row number for me and is an auxilary variable
# Place Holder for Natural Spline results:
data.temp$IBI.NSfit <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.L95 <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.U95 <- rep(NA, nrow(data.temp))
# defining the knots in n.s.:
knots <- (data.ns$BeatNum)[seq(ns.knot, (length(data.ns$BeatNum) - ns.knot), by = ns.knot)]
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), interval="prediction", level = 1 - 0.05) # ??? PROBLEM
data.temp$IBI.NSfit <- fit.ns.values[,1]
data.temp$IBI.NSfit.L95 <- fit.ns.values[,2]
data.temp$IBI.NSfit.U95 <- fit.ns.values[,3]
# Updating OutlierInd based on Natural Spline 95% CI:
data.temp$OutlierInd <- ifelse(data.temp$IBI < data.temp$IBI.NSfit.U95 & data.temp$IBI > data.temp$IBI.NSfit.L95, 0, 1)
}
Finally, I found the solution:
When I fit the model, I should use the "data =" option. In other words, instead of the command below,
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
I should use the command below instead:
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(IBI ~ ns(Time, knots = Time[knots]), data = data.ns)
Then the predict function will work.
I wanted to add a comment but my rep level doesnt allow that.
Anyways, I think this is a well documented point that predict uses the exact variables names used in the fit function. So, naming your variables is the best way to get around this error in my experience.
So, in the case above, please redefine a data frame just for your fit purposes like this
library(splines)
#Fit part
fit.data <- data.frame(y=rnorm(30),x=rnorm(30))
fit.ns <- lm(y ~ ns(x,3),data=fit.data)
#Predict
pred.data <- data.frame(y=rnorm(10),x=rnorm(10))
pred.fit <- predict(fit.ns,interval="confidence",limit=0.95,data.frame(x=pred.data$x))
IMHO, this should get rid of your error

Resources