GBM poisson regression confidence intervals - r

If I have a gbm poisson regression model as follows:
# My data
set.seed(0)
df <- data.frame(count = rpois(100,1),
pred1 = rnorm(100, 10, 1),
pred2 = rnorm(100, 0, 1),
pred3 = rnorm(100, 0, 1))
# My Split
split <- initial_split(df)
# My model
library(gbm)
m <- gbm(
formula = count ~ .,
distribution ="poisson",
data = training(split))
And I make a prediction:
# My prediction
p <- predict(m,
n.trees=m$n.trees,
testing(split),
type="response")
I'd like to generate some confidence intervals around the values of p. I cannot seem to find a way of doing this when I use m to predict on the test data set or a new dataset (where the predictor variables have identical underlying distributions).

Related

Cross validation PCA with different correlation matrix in R

I want to evaluate in terms of MSE, AIC, and Adjusted R squared, various Principal Component models based on different correlation coefficients (e.g., Pearson, Kendall) in R. I have created the following function however I can't find the way to "force" the function to perform principal component regression based on the given correlation matrix cor1 and cor2. Therefore, I end up with the exact same results. Can someone help me?
library(caret)
set.seed(123)
df <- data.frame(Y = rnorm(100), X1 = rnorm(100), X2 = rnorm(100), X3 = rnorm(100), X4 = rnorm(100), X5 = rnorm(100))
X <- df[,-1]
Y <- df[,1]
# compute Pearson's and Kendall's correlation matrices
cor1 <- cor(X, method = "pearson")
cor2 <- cor(X, method = "kendall")
# define function to compute PCA with cross-validation and return MSE, AIC, and adjusted R-squared
pca_cv_mse_aic_r2 <- function(X, Y, cor_mat, ncomp, nfolds) {
# create empty vectors to store results
mse <- rep(0, ncomp)
aic <- rep(0, ncomp)
adj_r2 <- rep(0, ncomp)
# loop over the number of components
for (i in 1:ncomp) {
# perform PCA with cross-validation
pca <- caret::train(X, Y, method = "pcr", preProc = c("center", "scale"),
tuneLength = nfolds, trControl = trainControl(method = "cv", number = nfolds),
tuneGrid = data.frame(ncomp = i))
# compute MSE, AIC, and adjusted R-squared
pred <- predict(pca, newdata = X)
mse[i] <- mean((pred - Y)^2)
aic[i] <- AIC(lm(Y ~ pred + 1))
adj_r2[i] <- summary(lm(Y ~ pred))$adj.r.squared
}
# return a list of results
return(list(mse = mse, aic = aic, adj_r2 = adj_r2))
}
# compute the MSE, AIC, and adjusted R-squared of PCA models with different correlation matrices and numbers of components
results1 <- pca_cv_mse_aic_r2(X, Y, cor1, 5, 10)
results2 <- pca_cv_mse_aic_r2(X, Y, cor2, 5, 10)

Poisson regression model - how to add regression line with specific value of coefficient?

So, I have my poisson regression model: (mvdiff = market value diff, participations = participations in world cup)
mod <- glm(goals~participations+MVdiff, family = poisson)
plot(MVdiff, jitter(goals , 0.2), pch=17)
Now I would like to include a regression function line into my plot: as a function of MVdiff, with the value of 9.2 for participations (as in 9.2 is the mean of the participations in the world cup).
Here my try:
curve(exp(mod$coefficients[1]+mod$coefficients[2]*x+9.2+mod$coefficients[3]),
lwd=3,col="red",add=TRUE)
But this doesn't quite work out. Is there a way to properly add the value of 9.2 into my coefficient variable participation?
Plot:
To get the model output at a fixed value of an independent variable, use predict with the value of that variable fixed:
pred_df <- data.frame(MVdiff = seq(-1500, 1500), participations = 9.2)
predictions <- predict(mod, newdata = pred_df , type = "response")
Now plot this as a line over your data:
plot(df$MVdiff, jitter(df$goals , 0.2), pch = 16)
lines(pred_df$MVdiff, predictions, col = "red")
Data used
Obviously, we don't have your data, so I had to create my own for the example with the following code:
set.seed(123)
participations <- rpois(500, 9.2)
MVdiff <- rnorm(500, 0, 600)
goals <- rpois(500, 1 + MVdiff/3000)

Is there a function to simulate data from a robust linear regression or quantile regression model?

Are there any functions that can simulate the response variable from a robust linear model or a quantile regression model like there is for linear models (i.e. stats::simulate.lm)?
If not, is there a way to adapt code to do this for either model?
Here is an example of the kind of data and models I am dealing with:
#Data
df <- data.frame(Response = c(1:30 + rnorm(n = 30)), Covariate = c(seq(from = 10, to = 1.3, by = -0.3)))
#Robust linear regression
fit.rlm <- MASS::rlm(Response ~ Covariate, data = df)
#Quantile regression
fit.qr <- quantreg::rq(Response ~ Covariate, data = df, tau = c(0.025,0.5,0.975)

Confidence intervals for predictions from logistic regression

In R predict.lm computes predictions based on the results from linear regression and also offers to compute confidence intervals for these predictions. According to the manual, these intervals are based on the error variance of fitting, but not on the error intervals of the coefficient.
On the other hand predict.glm which computes predictions based on logistic and Poisson regression (amongst a few others) doesn't have an option for confidence intervals. And I even have a hard time imagining how such confidence intervals could be computed to provide a meaningful insight for Poisson and logistic regression.
Are there cases in which it is meaningful to provide confidence intervals for such predictions? How can they be interpreted? And what are the assumptions in these cases?
The usual way is to compute a confidence interval on the scale of the linear predictor, where things will be more normal (Gaussian) and then apply the inverse of the link function to map the confidence interval from the linear predictor scale to the response scale.
To do this you need two things;
call predict() with type = "link", and
call predict() with se.fit = TRUE.
The first produces predictions on the scale of the linear predictor, the second returns the standard errors of the predictions. In pseudo code
## foo <- mtcars[,c("mpg","vs")]; names(foo) <- c("x","y") ## Working example data
mod <- glm(y ~ x, data = foo, family = binomial)
preddata <- with(foo, data.frame(x = seq(min(x), max(x), length = 100)))
preds <- predict(mod, newdata = preddata, type = "link", se.fit = TRUE)
preds is then a list with components fit and se.fit.
The confidence interval on the linear predictor is then
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
critval is chosen from a t or z (normal) distribution as required (I forget exactly now which to use for which type of GLM and what the properties are) with the coverage required. The 1.96 is the value of the Gaussian distribution giving 95% coverage:
> qnorm(0.975) ## 0.975 as this is upper tail, 2.5% also in lower tail
[1] 1.959964
Now for fit, upr and lwr we need to apply the inverse of the link function to them.
fit2 <- mod$family$linkinv(fit)
upr2 <- mod$family$linkinv(upr)
lwr2 <- mod$family$linkinv(lwr)
Now you can plot all three and the data.
preddata$lwr <- lwr2
preddata$upr <- upr2
ggplot(data=foo, mapping=aes(x=x,y=y)) + geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial)) +
geom_line(data=preddata, mapping=aes(x=x, y=upr), col="red") +
geom_line(data=preddata, mapping=aes(x=x, y=lwr), col="red")
I stumbled upon Liu WenSui's method that uses bootstrap or simulation approach to solve that problem for Poisson estimates.
Example from the Author
pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)
data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]
boot_pi <- function(model, pdata, n, p) {
odata <- model$data
lp <- (1 - p) / 2
up <- 1 - lp
set.seed(2016)
seeds <- round(runif(n, 1, 1000), 0)
boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
set.seed(seeds[i])
bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ]
bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata)
rpois(length(bpred), lambda = bpred)
}
boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up)))
return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2]))
}
boot_pi(mdl, new_fake, 1000, 0.95)
sim_pi <- function(model, pdata, n, p) {
odata <- model$data
yhat <- predict(model, type = "response")
lp <- (1 - p) / 2
up <- 1 - lp
set.seed(2016)
seeds <- round(runif(n, 1, 1000), 0)
sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
set.seed(seeds[i])
sim_y <- rpois(length(yhat), lambda = yhat)
sdata <- data.frame(y = sim_y, odata[names(model$x)])
refit <- glm(y ~ ., data = sdata, family = poisson)
bpred <- predict(refit, type = "response", newdata = pdata)
rpois(length(bpred),lambda = bpred)
}
sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}
sim_pi(mdl, new_fake, 1000, 0.95)

Linear regression in R (normal and logarithmic data)

I want to carry out a linear regression in R for data in a normal and in a double logarithmic plot.
For normal data the dataset might be the follwing:
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
plot (lin$x, lin$y)
There I want to calculate draw a line for the linear regression only of the datapoints 2, 3 and 4.
For double logarithmic data the dataset might be the following:
data = data.frame(
x=c(1:15),
y=c(
1.000, 0.742, 0.623, 0.550, 0.500, 0.462, 0.433,
0.051, 0.043, 0.037, 0.032, 0.028, 0.025, 0.022, 0.020
)
)
plot (data$x, data$y, log="xy")
Here I want to draw the regression line for the datasets 1:7 and for 8:15.
Ho can I calculate the slope and the y-offset als well as parameters for the fit (R^2, p-value)?
How is it done for normal and for logarithmic data?
Thanks for you help,
Sven
In R, linear least squares models are fitted via the lm() function. Using the formula interface we can use the subset argument to select the data points used to fit the actual model, for example:
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
linm <- lm(y ~ x, data = lin, subset = 2:4)
giving:
R> linm
Call:
lm(formula = y ~ x, data = lin, subset = 2:4)
Coefficients:
(Intercept) x
-1.633 1.500
R> fitted(linm)
2 3 4
-0.1333333 1.3666667 2.8666667
As for the double log, you have two choices I guess; i) estimate two separate models as we did above, or ii) estimate via ANCOVA. The log transformation is done in the formula using log().
Via two separate models:
logm1 <- lm(log(y) ~ log(x), data = dat, subset = 1:7)
logm2 <- lm(log(y) ~ log(x), data = dat, subset = 8:15)
Or via ANCOVA, where we need an indicator variable
dat <- transform(dat, ind = factor(1:15 <= 7))
logm3 <- lm(log(y) ~ log(x) * ind, data = dat)
You might ask if these two approaches are equivalent? Well they are and we can show this via the model coefficients.
R> coef(logm1)
(Intercept) log(x)
-0.0001487042 -0.4305802355
R> coef(logm2)
(Intercept) log(x)
0.1428293 -1.4966954
So the two slopes are -0.4306 and -1.4967 for the separate models. The coefficients for the ANCOVA model are:
R> coef(logm3)
(Intercept) log(x) indTRUE log(x):indTRUE
0.1428293 -1.4966954 -0.1429780 1.0661152
How do we reconcile the two? Well the way I set up ind, logm3 is parametrised to give more directly values estimated from logm2; the intercepts of logm2 and logm3 are the same, as are the coefficients for log(x). To get the values equivalent to the coefficients
of logm1, we need to do a manipulation, first for the intercept:
R> coefs[1] + coefs[3]
(Intercept)
-0.0001487042
where the coefficient for indTRUE is the difference in the mean of group 1 over the mean of group 2. And for the slope:
R> coefs[2] + coefs[4]
log(x)
-0.4305802
which is the same as we got for logm1 and is based on the slope for group 2 (coefs[2]) modified by the difference in slope for group 1 (coefs[4]).
As for plotting, an easy way is via abline() for simple models. E.g. for the normal data example:
plot(y ~ x, data = lin)
abline(linm)
For the log data we might need to be a bit more creative, and the general solution here is to predict over the range of data and plot the predictions:
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1))
pdat <- transform(pdat, yhat = c(predict(logm1, pdat[1:70,, drop = FALSE]),
predict(logm2, pdat[71:141,, drop = FALSE])))
Which can plot on the original scale, by exponentiating yhat
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
or on the log scale:
plot(log(y) ~ log(x), data = dat)
lines(yhat ~ log(x), dat = pdat, subset = 1:70, col = "red")
lines(yhat ~ log(x), dat = pdat, subset = 71:141, col = "blue")
For example...
This general solution works well for the more complex ANCOVA model too. Here I create a new pdat as before and add in an indicator
pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1),
by = 0.1)[1:140],
ind = factor(rep(c(TRUE, FALSE), each = 70))))
pdat <- transform(pdat, yhat = predict(logm3, pdat))
Notice how we get all the predictions we want from the single call to predict() because of the use of ANCOVA to fit logm3. We can now plot as before:
plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
#Split the data into two groups
data1 <- data[1:7, ]
data2 <- data[8:15, ]
#Perform the regression
model1 <- lm(log(y) ~ log(x), data1)
model2 <- lm(log(y) ~ log(x), data2)
summary(model1)
summary(model2)
#Plot it
with(data, plot(x, y, log="xy"))
lines(1:7, exp(predict(model1, data.frame(x = 1:7))))
lines(8:15, exp(predict(model2, data.frame(x = 8:15))))
In general, splitting the data into different groups and running different models on different subsets is unusual, and probably bad form. You may want to consider adding a grouping variable
data$group <- factor(rep(letters[1:2], times = 7:8))
and running some sort of model on the whole dataset, e.g.,
model_all <- lm(log(y) ~ log(x) * group, data)
summary(model_all)

Resources