How to recover fitted values from BSTS poisson model (in R)? - r

I am trying to recover in-sample predictions (fitted values) from a bsts model with a specified poisson response using the bsts package in R. The following results in an error: Prediction errors are not supported for Poisson or logit models.
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log10(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(Y, state.specification=ss, niter=150, family='poisson')
bsts.prediction.errors(bsts.model)
Is there a way to retrieve predictions on model-training data with a poisson model in bsts?

One way to do it is to extract the contribution of each model component at time t and sum them.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
get_yhats2(bsts.model)
But I also posted here, showing that this method didn't necessarily match expectations I had even in the Gaussian case.

Related

How to convert one-fold cross-validation to K-fold cross-validation in R

I have a GAM model for which I would like to calculate AUC, TSS (True Skill Statistic) and RMSE through 5-fold cross-validation in R. Unfortunately, the caret package does not support GAM and therefore cannot be used. As I didn’t find any alternative, I tried to build the code for cross-validation myself, and it works well, with the only problem that it is only one-fold cross-validation. Could anybody help me to make this 5-fold? Sorry if this is an elementary question, I am new to R.
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.2))
train <- DF[sample, ]
test <- DF[!sample, ]
predicted <- predict(GAM, test, type="response")
# Calculating RMSE
RMSE(test$Y, predicted)
# Calculating AUC
auc(test$Y, predicted)
GAM_TSS <- gam(Y ~ X1 + X2 + X3 + X4 + s(X5, k = 3), train, family = "binomial")
test$pred <- predict(GAM_TSS, type="response", newdata=test)
roc.curve <- roc(test$Y, test$pred, ci=T)
plot(roc.curve)
threshold <- 0.1
CM <- confusionMatrix(factor(test$pred>threshold), factor(test$P_A==1), positive="TRUE")
CM <- CM$byClass
Sensitivity <- CM[['Sensitivity']]
Specificity <- CM[['Specificity']]
# Calculating TSS
TSS = Sensitivity + Specificity - 1
TSS
I have come across precisely this problem with GAM in the past. My approach was to create a vector to split data randomly into parts as equally sized as possible, then loop through the fold ids as follows:
k <- 5
FoldID <- rep(1:k, ceiling(nrow(modelData)/k))
length(FoldID) <- nrow(modelData)
FoldID <- sample(FoldID, replace = FALSE)
for(fold in 1:k){
train_data <- modelData[FoldID != fold, ]
val_data <- modelData[FoldID == fold, ]
# Create training model and predictions
# Calculate RMSE data etc.
# Add a line with fold validation results to a dataframe
}
# Calculate column means of your validation results frame
I will leave you to fill in the gaps to suit your own requirements. It would also be a good idea to add an outer loop (outside the FoldID creation) for repeats.

Generating a dependent binary outcome variable that is affected by the independent variable values in R

I derived data as follows and classified it by the extreme learning machine method. I evaluated the classification performance with AUC and Accuracy. I want the AUC and Accuracy values I get here to be highly affected by the first 5 variables of the X independent variables matrix. In other words, first I will derive the Y dependent binary outcome variable and get the accuracy and AUC by classifying. Then the Y variable will remain the same (will be the same as originally derived), and when I change the values of the first 5 variables of the X independent variables matrix and reclassify, there will be significant changes in the AUC and accuracy values. I want to provide a derivation in this way. This is how I want to derive the Y dependent outcome variable. The Y values I get here are not sufficiently affected by the changes in the first 5 variables in the X independent variables matrix. How can I do that?
install.packages("MASS")
install.packages("elmNNRcpp")
install.packages("pROC")
library(MASS)
library(elmNNRcpp)
library(pROC)
# Data gen
p=30
n=50
pr <- seq(0.7, 0.4, length.out = p)
pr[1] <- 1
covmat <- toeplitz(pr)
mu= rep(0,p)
X_ <- data.frame(mvrnorm(n, mu = mu, Sigma = covmat))
X <- unname(as.matrix(sample(X_)))
vCoef = rnorm(ncol(X))
vProb =exp(X%*%vCoef)/(1+exp(X%*%vCoef))
Y <- rbinom(nrow(X), 1, vProb)
mydata= data.frame(cbind(X,Y))
# Classification
trainIndex <- sample(1:nrow(mydata), size=0.7*nrow(mydata))
trainSet <- mydata[trainIndex,]
testSet <-mydata[-trainIndex,]
xtrain <- as.matrix(trainSet[, 1:(length(trainSet)-1)])
ytrain <- as.matrix(trainSet[, length(trainSet)])
xtest <- as.matrix(testSet[, 1:(length(testSet)-1)])
ytest <- as.matrix(testSet[, length(testSet)])
model=elm_train(xtrain, ytrain, nhid=50 , actfun='relu')
pred.class=elm_predict(model,xtest, normalize=TRUE)
roc.model=roc(as.factor(ytest),as.numeric(pred.class), direction=c("auto"))
#best cut off
cut.opt=InformationValue::optimalCutoff(actuals=ytest, predictedScores= pred.class)
class.model=ifelse(pred.class<cut.opt,c("0"),c("1")) #short:1, long:0 kodlu sınıflarım
crosstab=table(factor(class.model),factor(ytest))
conf.matrix=caret::confusionMatrix(crosstab, positive="1")
#Predictive performance
AUC = roc.model$auc
Accuracy = conf.matrix$overall["Accuracy"]
AUC
Accuracy

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

R: implementing my own gradient boosting algorithm

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?
Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.

Nonlinear regression with sampling weights (package survey)

I would like to estimate the coefficients of a nonlinear model with a binary dependent variable. The nonlinearity arises because two regressors, A and B, depend on a subset of the dataset and on the two parameters lambda1 and lambda2 respectively:
y = alpha + beta1 * A(lambda1) + beta2 * B(lambda2) + delta * X + epsilon
where for each observation i, we have
Where a and Rs are variables in the data.frame. The regressor B(lambda2) is defined in a similar way.
Moreover, I need to include what in Stata are known as pweights, i.e. survey weights or sampling weights. For this reason, I'm working with the R package survey by Thomas Lumley.
First, I create a function for A (and B), i.e.:
A <- function(l1){
R <- as.matrix(data[,1:(80)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l1
den <- sum((ai-k)^l1)
w <- num/den
w <- c(w,rep(0,dim(R)[2]-length(w)))
var[i] <- R[i,] %*% w
}
return(var)
}
B <- function(l2){
C <- as.matrix(data[,82:(161-1)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l2
den <- sum((ai-k)^l2)
w <- num/den
w <- c(w,rep(0,dim(C)[2]-length(w)))
var[i] <- C[i,] %*% w
}
return(var)
}
But the problem is that I don't know how to include the nonlinear regressors in the model (or in the survey design, using the function svydesign):
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
Because, when I try to estimate the model:
# loglikelihood function:
LLsvy <- function(y, model, lambda1, lambda2){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fit <- svymle(loglike=LLsvy,
formulas=list(~y, model = ~ A(lambda1)+B(lambda2)+X,lambda1=~1,lambda2=~1),
design=d_test,
start=list(c(0,0,0,0),c(lambda1=11),c(lambda2=8)),
na.action="na.exclude")
I get the error message:
Error in eval(expr, envir, enclos) : object 'lambda1' not found
I think that the problem is in including the nonlinear part, because everything works fine if I fix A and B for some lambda1 and lambda2 (so that the model becomes linear):
lambda1=11
lambda2=8
data$A <- A(lambda1)
data$B <- B(lambda2)
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
LLsvylin <- function(y, model){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fitlin <- svymle(loglike=LLsvylin,
formulas=list(~y, model = ~A+B+X),
design=d_test,
start=list(0,0,0,0),
na.action="na.exclude")
On the contrary, if I don't use the sampling weights, I can easily estimate my nonlinear model using the function mle from package stats4 or the function mle2 from package bbmle.
To sum up,
how can I combine sampling weights (svymle) while estimating a nonlinear model (which I can do using mle or mle2)?
=========================================================================
A problem with the nonlinear part of the model arises also when using the function svyglm (with fixed lambda1 and lambda2, in order to get good starting values for svymle):
lambda1=11
lambda2=8
model0 = y ~ A(lambda1) + B(lambda2) + X
probit1 = svyglm(formula = model0,
data = data,
family = binomial(link=probit),
design = d_test)
Because I get the error message:
Error in svyglm.survey.design(formula = model0, data = data, family = binomial(link = probit), :
all variables must be in design= argument
This isn't what svymle does -- it's for generalised linear models, which have linear predictors and a potentially complicated likelihood or loss function. You want non-linear weighted least squares, with a simple loss function but complicated predictors.
There isn't an implementation of design-weighted nonlinear least squares in the survey package, probably because no-one has previously asked for one. You could try emailing the package author.
The upcoming version 4 of the survey package will have a function svynls, so if you know how to fit your model without sampling weights using nls you will be able to fit it with sampling weights.

Resources