Using `ordinal::clmm` model to make predictions on new data - r

I have some repeated measures, ordinal response data:
dat <- data.frame(
id = factor(sample(letters[1:5], 50, replace = T)),
response = factor(sample(1:7, 50, replace = T), ordered = T),
x1 = runif(n = 50, min = 1, max = 10),
x2 = runif(n = 50, min = 100, max = 1000)
)
I have built the following model:
library(ordinal)
model <- clmm(response ~ x1 + x2 + (1|id), data = dat)
I have some new data:
new_dat <- data.frame(
id = factor(sample(letters[1:5], 5, replace = T)),
x1 = runif(n = 5, min = 1, max = 10),
x2 = runif(n = 5, min = 100, max = 1000)
)
I want to be able to use the model to predict the probability of each level of dat$response occurring for new_dat, whilst still also accounting for id.
Unfortunately predict() does not work for clmm objects. predict() does work for clmm2 objects but it ignores any random effects included.
What I want to achieve is something similar to what has been done in Figure 3 of the following using this code:
library(ordinal)
fm2 <- clmm2(rating ~ temp + contact, random=judge, data=wine, Hess=TRUE, nAGQ=10)
pred <- function(eta, theta, cat = 1:(length(theta)+1), inv.link = plogis){
Theta <- c(-1e3, theta, 1e3)
sapply(cat, function(j)
inv.link(Theta[j+1] - eta) - inv.link(Theta[j] - eta))
}
mat <- expand.grid(judge = qnorm(0.95) * c(-1, 0, 1) * fm2$stDev,
contact = c(0, fm2$beta[2]),
temp = c(0, fm2$beta[1]))
pred.mat <- pred(eta=rowSums(mat), theta=fm2$Theta)
lab <- paste("contact=", rep(levels(wine$contact), 2), ", ", "temp=", rep(levels(wine$temp), each=2), sep="")
par(mfrow=c(2, 2))
for(k in c(1, 4, 7, 10)) {
plot(1:5, pred.mat[k,], lty=2, type = "l", ylim=c(0,1),
xlab="Bitterness rating scale", axes=FALSE,
ylab="Probability", main=lab[ceiling(k/3)], las=1)
axis(1); axis(2)
lines(1:5, pred.mat[k+1, ], lty=1)
lines(1:5, pred.mat[k+2, ], lty=3)
legend("topright",
c("avg. judge", "5th %-tile judge", "95th %-tile judge"),
lty=1:3, bty="n")
}
Except, my model contains multiple continuous covariates (as opposed to binary factors).
How can I use the model data to predict the probability of each level of dat$response occurring for new_dat, whilst still also accounting for id?
Many thanks.

Related

adding labels and legend in graph_model function, reghelper package

I would like to add a legend and labels to my graph created using a reghelper package.
This is the code:
dv1 = runif(n = 100, min = 1, max = 7)
dv2 = runif(n = 100, min = 1, max = 7)
dv3 = runif(n = 100, min = 1, max = 7)
country <- rep(c("India", "US", "Poland"), length.out = 100)
df <- data.frame(country, dv1, dv2, dv3)
library(reghelper)
dv1 <- as.numeric(dv1)
dv2 <- as.numeric(dv2)
dv3 <- as.numeric(dv3)
country <- as.factor(country)
lm0 <- lm(dv1 ~ dv2 * dv3 + country, data = df, na.action = na.exclude)
summary(lm0)
graph_model(lm0, y=dv1, x=dv2, lines=country, split=dv3)
What should I add to the code to add a title, x and y labels, and legend labels?
Thank you in advance!
graph_model() produces ggplot2 object, so you can manipulate it with usual ggplot2 functions.
For example:
library(ggplot2)
p <- graph_model(lm0, y=dv1, x=dv2, lines=country, split=dv3) +
labs(x='Nowy X', y='Nowy Y', title='Tytulik')
update_labels(p, list(colour="Legenda"))

show x and y label in corrplot given residuals of chisq.test

Here is some reproducible code:
set.seed(42)
df <- data.frame(
x = round(runif(n = 100, min = 1, max = 3), 0)
, y = round(runif(n = 100, min = 1, max = 3), 0)
)
chisq <- chisq.test(
df$x
,df$y
, simulate.p.value = TRUE
)
chisq
options(repr.plot.width = 5, repr.plot.height = 5)
corrplot(chisq$residuals, is.cor = FALSE, xlab="x", ylab="y")
I would like the corrplot to show the label x and y. I think xlab and ylab should work but it does not. Any ideas? Thanks!

R GLM: Modify coefficients of an existing glm model

I have been trying to adjust the coefficients of an existing glm model but the predictions don't seem to change. The idea is to enhance an existing logistic model by incorporating 'qualitative' parameters in the quantitative coefficients (see 'adj model' block). I replicated the problem below.
I really appreciate any. Thank you!
set.seed(100)
#create sim data (correlated)
input_size <- 200
scale <- 10000
y_var = sample(0:1, input_size, replace = TRUE)
input_data <- cbind.data.frame(y_var, x1 = sample(1:1000, input_size, replace = TRUE) + (y_var*200), x2 = sample(1:50, input_size, replace = TRUE) - (y_var*30))
cor(input_data)
#build log-reg model
reg1 <- glm(input_data$y ~ input_data$x1 + input_data$x2, data = input_data, family = "binomial")
reg1$coefficients
#test log-reg model
input_test <- cbind.data.frame(x1 = sample(1:1000, input_size, replace = TRUE) + (y_var*400), x2 = sample(1:50, input_size, replace = TRUE) - (y_var*10))
y_predict <- predict(reg1, input_test, type="response")
#adjust log-reg model
adj_coeff <- round(c(intercept = reg1$coefficients[1], x1 = reg1$coefficients[2] * 3, x2 = -reg1$coefficients[3] * 0.5), 4)
reg2 <- reg1
reg2$coefficients <- as.numeric(adj_coeff)
reg2$coefficients
#visualize predication of the log-reg models
y2_predict <- predict(reg1, input_test, type="response")
plot(y_predict, type = "p", lwd = 2)
lines(y2_predict, type = "p", pch = 3, col = "orange")

How to extract the Prediction Intervals of a Gaussian Process Regression via caret kernlab package?

I am trying to use a Gaussian Process Regression (GPR) model to predict hourly streamflow discharges in a river. I've got good results applying the caret::kernlab train () function (thanks Kuhn!).
Since the uncertainty idea is one of the main inherent ones advantages of the GPR, I would like to know if anyone could help me to access the results related to the prediction inteval of the test dataset.
I'll put an extract of the code I've been working. Since my real data are huge (and sincerely, I don't know how to put it here), I'll example with the data(airquality). The main goal in this particular example is to predict airquality$Ozone, using as predictos the lag-variables of airquality$Temperature.
rm(list = ls())
data(airquality)
airquality = na.omit(as.data.frame(airquality)); str(airquality)
library(tidyverse)
library(magrittr)
airquality$Ozone %>% plot(type = 'l')
lines(airquality$Temp, col = 2)
legend("topleft", legend = c("Ozone", "Temperature"),
col=c(1, 2), lty = 1:1, cex = 0.7, text.font = 4, inset = 0.01,
box.lty=0, lwd = 1)
attach(airquality)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train = data.frame(df_lags[1:81, ]) # Training Observed 75% dataset
ESM_test = data.frame(df_lags[82:nrow(df_lags), ]) # Testing Observed 25% dataset
grid_gaussprRadial = expand.grid(.sigma = c(0.001, 0.01, 0.05, 0.1, 0.5, 1, 2)) # Sigma parameters searching for GPR
# TRAIN MODEL ############################
# Tuning set
library(caret)
set.seed(111)
cvCtrl <- trainControl(
method ="repeatedcv",
repeats = 1,
number = 20,
allowParallel = TRUE,
verboseIter = TRUE,
savePredictions = "final")
# Train (aprox. 4 seconds time-simulation)
attach(ESM_train)
set.seed(111)
system.time(Model_train <- caret::train(Ozone ~ Temp + Temp_lag1,
trControl = cvCtrl,
data = ESM_train,
metric = "MAE", # Using MAE since I intend minimum values are my focus
preProcess = c("center", "scale"),
method = "gaussprRadial", # Setting RBF kernel function
tuneGrid = grid_gaussprRadial,
maxit = 1000,
linout = 1)) # Regression type
plot(Model_train)
Model_train
ESM_results_train <- Model_train$resample %>% mutate(Model = "") # K-fold Training measures
# Select the interested TRAIN data and arrange them as dataframe
Ozone_Obs_Tr = Model_train$pred$obs
Ozone_sim = Model_train$pred$pred
Resid = Ozone_Obs_Tr - Ozone_sim
train_results = data.frame(Ozone_Obs_Tr,
Ozone_sim,
Resid)
# Plot Obs x Simulated train results
library(ggplot2)
ggplot(data = train_results, aes(x = Ozone_Obs_Tr, y = Ozone_sim)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black")
# TEST MODEL ############################
# From "ESM_test" dataframe, we predict ESM Ozone time series, adding it in "ESM_forecasted" dataframe
ESM_forecasted = ESM_test %>%
mutate(Ozone_Pred = predict(Model_train, newdata = ESM_test, variance.model = TRUE))
str(ESM_forecasted)
# Select the interested TEST data and arrange them as a dataframe
Ozone_Obs = ESM_forecasted$Ozone
Ozone_Pred = ESM_forecasted$Ozone_Pred
# Plot Obs x Predicted TEST results
ggplot(data = ESM_forecasted, aes(x = Ozone_Obs, y = Ozone_Pred)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black")
# Model performance #####
library(hydroGOF)
gof_TR = gof(Ozone_sim, Ozone_Obs_Tr)
gof_TEST = gof(Ozone_Pred,Ozone_Obs)
Performances = data.frame(
Train = gof_TR,
Test = gof_TEST
); Performances
# Plot the TEST prediction
attach(ESM_forecasted)
plot(Ozone_Obs, type = "l", xlab = "", ylab = "", ylim = range(Ozone_Obs, Ozone_Pred))
lines(Ozone_Pred , col = "coral2", lty = 2, lwd = 2)
legend("top", legend = c("Ozone Obs Test", "Ozone Pred Test"),
col=c(1, "coral2"), lty = 1:2, cex = 0.7, text.font = 4, inset = 0.01, box.lty=0, lwd = 2)
These last lines generate the following plot:
The next, and last, step would be to extract the prediction intervals, which is based on a gaussian distribution around each prediction point, to plot it together with this last plot.
The caret::kernlab train() appliance returned better prediction than, for instance, just kernlab::gaussprRadial(), or even tgp::bgp() packages. For both of them I could find the prediction interval.
For example, to pick up the prediction intervals via tgp::bgp(), it could be done typing:
Upper_Bound <- Ozone_Pred$ZZ.q2 #Ozone_Pred - 2 * sigma^2
Lower_Bound <- Ozone_Pred$ZZ.q1 #Ozone_Pred + 2 * sigma^2
Therefore, via caret::kernlab train(), I hope the required standard deviations could be found typing something as
Model_train$...
or maybe, with
Ozone_Pred$...
Moreover, at link: https://stats.stackexchange.com/questions/414079/can-mad-median-absolute-deviation-or-mae-mean-absolute-error-be-used-to-calc,
Stephan Kolassa author explained that we could estimate the prediction intervals through MAE, or even RMSE. But I didn't understand if this is my point, since the MAE I got is just the comparison between Obs x Predicted Ozone data, in this example.
Please, this solution is very important to me! I think I am near to obtain my main results, but I don't know anymore how to try.
Thanks a lot, friends!
I don't really know how the caret framework works, but getting a prediction interval for a GP regression with a Gaussian likelihood is easy enough to do manually.
First we just need a function for the squared exponential kernel, also called the radial basis function kernel, which is what you were using. sf here is the scale factor (unused in the kernlab implementation), and ell is the length scale, called sigma in the kernlab implementation:
covSEiso <- function(x1, x2 = x1, sf = 1.0, ell = 1.0) {
sf <- sf^2
ell <- -0.5 * (1 / (ell^2))
n <- nrow(x1)
m <- nrow(x2)
d <- ncol(x1)
result <- matrix(0, nrow = n, ncol = m)
for ( j in 1:m ) {
for ( i in 1:n ) {
result[i, j] <- sf * exp(ell * sum((x1[i, ] - x2[j, ])^2))
}
}
return(result)
}
I'm not sure what your code says about which length scale to use; below I will use a length scale of 25 and scale factor of 50 (obtained via GPML's hyperparameter optimization routines). Then we use the covSEiso() function above to get the relevant covariances, and the rest is application of basic Gaussian identities. I would refer you to Chapter 2 of Rasmussen and Williams (2006) (graciously provided for free online).
data(airquality)
library(tidyverse)
library(magrittr)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train <- data.frame(df_lags[1:81, ]) # Training Data 75% dataset
ESM_test <- data.frame(df_lags[82:nrow(df_lags), ]) # Testing Data 25% dataset
## For convenience I'll define separately the training and test inputs
X <- ESM_train[ , c("Temp", "Temp_lag1")]
Xstar <- ESM_test[ , c("Temp", "Temp_lag1")]
## Get the kernel manually
K <- covSEiso(X, ell = 25, sf = 50)
## We also need covariance between the test cases
Kstar <- covSEiso(Xstar, X, ell = 25, sf = 50)
Ktest <- covSEiso(Xstar, ell = 25, sf = 50)
## Now the 95% credible region for the posterior is
predictive_mean <- Kstar %*% solve(K + diag(nrow(K))) %*% ESM_train$Ozone
predictive_var <- Ktest - (Kstar %*% solve(K + diag(nrow(K))) %*% t(Kstar))
## Then for the prediction interval we only need to add the observation noise
z <- sqrt(diag(predictive_var)) + 25
interval_high <- predictive_mean + 2 * z
interval_low <- predictive_mean - 2 * z
Then we can check out the prediction intervals
This all is pretty easy to do via my gplmr package (available on GitHub) which can call GPML from R if you have Octave installed:
data(airquality)
library(tidyverse)
library(magrittr)
library(gpmlr)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train <- data.frame(df_lags[1:81, ]) # Training Data 75% dataset
ESM_test <- data.frame(df_lags[82:nrow(df_lags), ]) # Testing Data 25% dataset
X <- as.matrix(ESM_train[ , c("Temp", "Temp_lag1")])
y <- ESM_train$Ozone
Xs <- as.matrix(ESM_test[ , c("Temp", "Temp_lag1")])
ys <- ESM_test$Ozone
hyp0 <- list(mean = numeric(), cov = c(0, 0), lik = 0)
hyp <- set_hyperparameters(hyp0, "infExact", "meanZero", "covSEiso","likGauss",
X, y)
gp_res <- gp(hyp, "infExact", "meanZero", "covSEiso", "likGauss", X, y, Xs, ys)
predictive_mean <- gp_res$YMU
interval_high <- gp_res$YMU + 2 * sqrt(gp_res$YS2)
interval_low <- gp_res$YMU - 2 * sqrt(gp_res$YS2)
Then just plot the predictions, as above:
plot(NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "n",
xlim = range(ESM_test$Temp), ylim = range(c(interval_high, interval_low)))
axis(1, tick = FALSE, line = -0.75)
axis(2, tick = FALSE, line = -0.75)
mtext("Temp", 1, 1.5)
mtext("Ozone", 2, 1.5)
idx <- order(ESM_test$Temp)
polygon(c(ESM_test$Temp[idx], rev(ESM_test$Temp[idx])),
c(interval_high[idx], rev(interval_low[idx])),
border = NA, col = "#80808080")
lines(ESM_test$Temp[idx], predictive_mean[idx])
points(ESM_test$Temp, ESM_test$Ozone, pch = 19)
plot(NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "n",
xlim = range(ESM_test$Temp_lag1), ylim = range(c(interval_high, interval_low)))
axis(1, tick = FALSE, line = -0.75)
axis(2, tick = FALSE, line = -0.75)
mtext("Temp_lag1", 1, 1.5)
mtext("Ozone", 2, 1.5)
idx <- order(ESM_test$Temp_lag1)
polygon(c(ESM_test$Temp_lag1[idx], rev(ESM_test$Temp_lag1[idx])),
c(interval_high[idx], rev(interval_low[idx])),
border = NA, col = "#80808080")
lines(ESM_test$Temp_lag1[idx], predictive_mean[idx])
points(ESM_test$Temp_lag1, ESM_test$Ozone, pch = 19)

lme4::glmer.nb function produces "Error in family$family : $ operator not defined for this S4 class" depending on the order I run models

library(lme4)
dummy <- as.data.frame(cbind(speed = rpois(100, 10), pop = rep(1:4, each = 25), season = rep(1:2, each = 50), id = seq(1, 100, by = 1)))
dummy2 <- as.data.frame(cbind(speed = c(rnbinom(50, 10, 0.6), rnbinom(50, 10, 0.1)), pop = rep(1:4, each = 25), season = rep(1:2, each = 50), id = seq(1, 100, by = 1)))
poisson <- glmer(speed~pop*season + (1|id),
data=dummy, family="poisson")
neg.bin <- glmer.nb(speed ~ pop*season + (1|id),
data=dummy2, control=glmerControl(optimizer="bobyqa"))
When I run a script creating a Poisson model before a negative binomial model using the lme4 package, I get the following error when running the neg.bin model:
Error in family$family : $ operator not defined for this S4 class
However, if I run the models in the opposite order, I don't the error message.
library(lme4)
dummy <- as.data.frame(cbind(speed = rpois(100, 10), pop = rep(1:4, each = 25), season = rep(1:2, each = 50), id = seq(1, 100, by = 1)))
dummy2 <- as.data.frame(cbind(speed = c(rnbinom(50, 10, 0.6), rnbinom(50, 10, 0.1)), pop = rep(1:4, each = 25), season = rep(1:2, each = 50), id = seq(1, 100, by = 1)))
neg.bin <- glmer.nb(speed ~ pop*season + (1|id),
data=dummy2, control=glmerControl(optimizer="bobyqa"))
poisson <- glmer(speed~pop*season + (1|id),
data=dummy, family="poisson")
The neg.bin model example does have convergence warnings, but the same pattern is happening with my actual models which are converging fine. How is running the Poisson model first affecting the neg.bin model?
Because you have masked R function poisson. The following would work fine (except that there is convergence warning for neg.bin):
library(lme4)
set.seed(0)
dummy <- as.data.frame(cbind(speed = rpois(100, 10), pop = rep(1:4, each = 25), season = rep(1:2, each = 50), id = seq(1, 100, by = 1)))
dummy2 <- as.data.frame(cbind(speed = c(rnbinom(50, 10, 0.6), rnbinom(50, 10, 0.1)), pop = rep(1:4, each = 25), season = rep(1:2, each = 50), id = seq(1, 100, by = 1)))
## use a different name for your model, say `poisson_fit`
poisson_fit <- glmer(speed~pop*season + (1|id),
data=dummy, family="poisson")
negbin_fit <- glmer.nb(speed ~ pop*season + (1|id),
data=dummy2, control=glmerControl(optimizer="bobyqa"))
Here is the issue. Among the very first few lines of glmer.nb there is one line:
mc$family <- quote(poisson)
So, if you mask poisson, correct function poisson from stats package can not be found.
Ben has just fixed this issue, by replacing this to:
mc$family <- quote(stats::poisson)
My original observation on family = "poisson" and match.fun stuff is not the real issue here. It only explains why in routines like glm and mgcv::gam, it is legitimate to pass a string of family.

Resources