predicting from flexmix object (R) - r

I fit some data to a mixture distribution of two gaussian in flexmix:
data("NPreg", package = "flexmix")
mod <- flexmix(yn ~ x, data = NPreg, k = 2,
model = list(FLXMRglm(yn ~ x, family= "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
the model fit is as follows:
> mod
Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, model = list(FLXMRglm(yn ~ x, family = "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
Cluster sizes:
1 2
74 126
convergence after 31 iterations
But how do I actually predict from this model?
when I do
pred <- predict(mod, NPreg)
I get a list with the predictions from each of the two components
To get a single prediction, do I have to add in the cluster sizes like this?
single <- (74/200)* pred$Comp.1[,1] + (126/200)*pred$Comp.2[,2]

I use flexmix for prediction in the following way:
pred = predict(mod, NPreg)
clust = clusters(mod,NPreg)
result = cbind(NPreg,data.frame(pred),data.frame(clust))
plot(result$yn,col = c("red","blue")[result$clust],pch = 16,ylab = "yn")
And the confusion matrix:
table(result$class,result$clust)
For getting the predicted values of yn, I select the component value of the cluster to which a data point belongs.
for(i in 1:nrow(result)){
result$pred_model1[i] = result[,paste0("Comp.",result$clust[i],".1")][i]
result$pred_model2[i] = result[,paste0("Comp.",result$clust[i],".2")][i]
}
The actual vs predicted results show the fit (adding only one of them here as both of your models are same, you would use pred_model2 for the second model).
qplot(result$yn, result$pred_model1,xlab="Actual",ylab="Predicted") + geom_abline()
RMSE = sqrt(mean((result$yn-result$pred_model1)^2))
gives a root mean square error of 5.54.
This answer is based on many SO answers I read through while working with flexmix. It worked well for my problem.
You may also be interested in visualizing the two distributions. My model was the following, which shows some overlap as the ratio of components are not close to 1.
Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2,
model = list(FLXMRglm(yn ~ x, family = "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
prior size post>0 ratio
Comp.1 0.481 102 129 0.791
Comp.2 0.519 98 171 0.573
'log Lik.' -1312.127 (df=13)
AIC: 2650.255 BIC: 2693.133
I also generate a density distribution with histograms to visulaize both components. This was inspired by a SO answer from the maintainer of betareg.
a = subset(result, clust == 1)
b = subset(result, clust == 2)
hist(a$yn, col = hcl(0, 50, 80), main = "",xlab = "", freq = FALSE, ylim = c(0,0.06))
hist(b$yn, col = hcl(240, 50, 80), add = TRUE,main = "", xlab = "", freq = FALSE, ylim = c(0,0.06))
ys = seq(0, 50, by = 0.1)
lines(ys, dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)), col = hcl(0, 80, 50), lwd = 2)
lines(ys, dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)), col = hcl(240, 80, 50), lwd = 2)
# Joint Histogram
p <- prior(mod)
hist(result$yn, freq = FALSE,main = "", xlab = "",ylim = c(0,0.06))
lines(ys, p[1] * dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)) +
p[2] * dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)))

You can pass an additional argument to your prediction call.
pred <- predict(mod, NPreg, aggregate = TRUE)[[1]][,1]

Related

How to select appropriate sin() terms to fit a time series using R

I want to fit a time series with sin() function because it has a form of some periods (crests and troughs). However, for now I only guessed it, e.g., 1 month, two months, ..., 1 year, 2 year. Is there some function in R to estimate the multiple periods in a data series?
Below is an example which I want to fit it using the combination of sin() functions. The expression in lm() is a try after several guesses (red line in the Figure below). How can I find the sin() terms with appropriate periods?
t <- 1:365
y <- c(-1,-1.3,-1.6,-1.8,-2.1,-2.3,-2.5,-2.7,-2.9,-3,-2,-1.1,-0.3,0.5,1.1,1.6,2.1,2.5,2.8,3.1,3.4,3.7,4.2,4.6,5,5.3,5.7,5.9,6.2,5.8,5.4,5,4.6,4.2,3.9,3.6,3.4,3.1,2.9,2.8,2.6,2.5,2.3,1.9,1.5,1.1,0.8,0.5,0.2,0,-0.1,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.8,-0.9,-0.8,-0.6,-0.3,-0.1,0.1,0.4,0.6,0.9,1.1,1.3,1.5,1.7,2.1,2.4,2.7,3,3.3,3.5,3.8,4.3,4.7,5.1,5.5,5.9,6.2,6.4,6.6,6.7,6.8,6.8,6.9,7,6.9,6.8,6.7,
6.5,6.4,6.4,6.3,6.2,6,5.9,5.7,5.6,5.5,5.4,5.4,5.1,4.9,4.8,4.6,4.5,4.4,4.3,3.9,3.6,3.3,3,2.8,2.6,2.4,2.6,2.5,2.4,2.3,2.3,2.2,2.2,2.3,2.4,2.4,2.5,2.5,2.6,2.6,2.4,2.1,1.9,1.8,1.6,1.4,1.3,1,0.7,0.5,0.2,0,-0.2,-0.4,-0.2,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,0,0,-0.1,-0.1,-0.2,-0.2,-0.3,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.7,-0.8,-0.8,-0.8,-0.9,-0.9,-0.9,-1.3,-1.6,-1.9,-2.1,-2.3,-2.6,-2.9,-2.9,-2.9,-2.9,
-2.9,-3,-3,-3,-2.8,-2.7,-2.5,-2.4,-2.3,-2.2,-2.1,-2,-2,-1.9,-1.9,-1.8,-1.8,-1.8,-1.9,-1.9,-2,-2.1,-2.2,-2.2,-2.3,-2.4,-2.5,-2.6,-2.7,-2.8,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,-2.8,-2.7,-2.7,-2.6,-2.6,-2.8,-3,-3.1,-3.3,-3.4,-3.5,-3.6,-3.5,-3.4,-3.3,-3.3,-3.2,-3,-2.9,-2.8,-2.8,-2.7,-2.6,-2.6,-2.6,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3,-3,-3,-3,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,
-2.7,-2.6,-2.5,-2.4,-2.3,-2.3,-2.1,-1.9,-1.8,-1.7,-1.5,-1.4,-1.3,-1.5,-1.7,-1.8,-1.9,-2,-2.1,-2.2,-2.4,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3.1,-3.2,-3.3,-3.4,-3.5,-3.5,-3.6,-3.6,-3.5,-3.4,-3.3,-3.2,-3.1,-3,-2.7,-2.3,-2,-1.8,-1.5,-1.3,-1.1,-0.9,-0.7,-0.6,-0.5,-0.3,-0.2,-0.1,-0.3,-0.5,-0.6,-0.7,-0.8,-0.9,-1,-1.1,-1.1,-1.2,-1.2,-1.2,-1.2,-1.2,-0.8,-0.4,-0.1,0.2,0.5,0.8,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0.6,0.3,0,-0.2,-0.5,-0.7,-0.8)
dt <- data.frame(t = t, y = y)
plot(x = dt$t, y = dt$y)
lm <- lm(y ~ sin(2*3.1416/365*t)+cos(2*3.1416/365*t)+
sin(2*2*3.1416/365*t)+cos(2*2*3.1416/365*t)+
sin(2*4*3.1416/365*t)+cos(2*4*3.1416/365*t)+
sin(2*5*3.1416/365*t)+cos(2*5*3.1416/365*t)+
sin(2*6*3.1416/365*t)+cos(2*6*3.1416/365*t)+
sin(2*0.5*3.1416/365*t)+cos(2*0.5*3.1416/365*t),
data = dt)
summary(lm)$adj.r.squared
plot(dt$y); lines(predict(lm), type = "l", col = "red")
Package forecast has the fourier function (see here), which allows you to model fourier series terms based on time series objects.
For example:
library(forecast)
dt$y <- ts(dt$y, frequency = 365)
lm<- lm(y ~ fourier(y, K=6), dt)
plot(dt$t, dt$y); lines(predict(lm), type = "l", col = "red")
Following my comment to the question,
In catastrophic-failure's answer replace Mod by Re as in SleuthEye's answer. Then call nff(y, 20, col = "red").
I realized that there is another correction to function nff to be made:
substitute length(x) or xlen for the magical number 73.
Here is the function corrected.
nff = function(x = NULL, n = NULL, up = 10L, plot = TRUE, add = FALSE, main = NULL, ...){
#The direct transformation
#The first frequency is DC, the rest are duplicated
dff = fft(x)
#The time
xlen <- length(x)
t = seq_along(x)
#Upsampled time
nt = seq(from = 1L, to = xlen + 1L - 1/up, by = 1/up)
#New spectrum
ndff = array(data = 0, dim = c(length(nt), 1L))
ndff[1] = dff[1] #Always, it's the DC component
if(n != 0){
ndff[2:(n+1)] <- dff[2:(n+1)] #The positive frequencies always come first
#The negative ones are trickier
ndff[(length(ndff) - n + 1):length(ndff)] <- dff[(xlen - n + 1L):xlen]
}
#The inverses
indff = fft(ndff/xlen, inverse = TRUE)
idff = fft(dff/xlen, inverse = TRUE)
if(plot){
if(!add){
plot(x = t, y = x, pch = 16L, xlab = "Time", ylab = "Measurement",
main = ifelse(is.null(main), paste(n, "harmonics"), main))
lines(y = Re(idff), x = t, col = adjustcolor(1L, alpha = 0.5))
}
lines(y = Re(indff), x = nt, ...)
}
ret = data.frame(time = nt, y = Mod(indff))
return(ret)
}
y <- c(-1,-1.3,-1.6,-1.8,-2.1,-2.3,-2.5,-2.7,-2.9,-3,-2,-1.1,-0.3,0.5,1.1,1.6,2.1,2.5,2.8,3.1,3.4,3.7,4.2,4.6,5,5.3,5.7,5.9,6.2,5.8,5.4,5,4.6,4.2,3.9,3.6,3.4,3.1,2.9,2.8,2.6,2.5,2.3,1.9,1.5,1.1,0.8,0.5,0.2,0,-0.1,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.8,-0.9,-0.8,-0.6,-0.3,-0.1,0.1,0.4,0.6,0.9,1.1,1.3,1.5,1.7,2.1,2.4,2.7,3,3.3,3.5,3.8,4.3,4.7,5.1,5.5,5.9,6.2,6.4,6.6,6.7,6.8,6.8,6.9,7,6.9,6.8,6.7,
6.5,6.4,6.4,6.3,6.2,6,5.9,5.7,5.6,5.5,5.4,5.4,5.1,4.9,4.8,4.6,4.5,4.4,4.3,3.9,3.6,3.3,3,2.8,2.6,2.4,2.6,2.5,2.4,2.3,2.3,2.2,2.2,2.3,2.4,2.4,2.5,2.5,2.6,2.6,2.4,2.1,1.9,1.8,1.6,1.4,1.3,1,0.7,0.5,0.2,0,-0.2,-0.4,-0.2,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,0,0,-0.1,-0.1,-0.2,-0.2,-0.3,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.7,-0.8,-0.8,-0.8,-0.9,-0.9,-0.9,-1.3,-1.6,-1.9,-2.1,-2.3,-2.6,-2.9,-2.9,-2.9,-2.9,
-2.9,-3,-3,-3,-2.8,-2.7,-2.5,-2.4,-2.3,-2.2,-2.1,-2,-2,-1.9,-1.9,-1.8,-1.8,-1.8,-1.9,-1.9,-2,-2.1,-2.2,-2.2,-2.3,-2.4,-2.5,-2.6,-2.7,-2.8,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,-2.8,-2.7,-2.7,-2.6,-2.6,-2.8,-3,-3.1,-3.3,-3.4,-3.5,-3.6,-3.5,-3.4,-3.3,-3.3,-3.2,-3,-2.9,-2.8,-2.8,-2.7,-2.6,-2.6,-2.6,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3,-3,-3,-3,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,
-2.7,-2.6,-2.5,-2.4,-2.3,-2.3,-2.1,-1.9,-1.8,-1.7,-1.5,-1.4,-1.3,-1.5,-1.7,-1.8,-1.9,-2,-2.1,-2.2,-2.4,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3.1,-3.2,-3.3,-3.4,-3.5,-3.5,-3.6,-3.6,-3.5,-3.4,-3.3,-3.2,-3.1,-3,-2.7,-2.3,-2,-1.8,-1.5,-1.3,-1.1,-0.9,-0.7,-0.6,-0.5,-0.3,-0.2,-0.1,-0.3,-0.5,-0.6,-0.7,-0.8,-0.9,-1,-1.1,-1.1,-1.2,-1.2,-1.2,-1.2,-1.2,-0.8,-0.4,-0.1,0.2,0.5,0.8,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0.6,0.3,0,-0.2,-0.5,-0.7,-0.8)
res <- nff(y, 20, col = "red")
str(res)
#> 'data.frame': 3650 obs. of 2 variables:
#> $ time: num 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ...
#> $ y : num 1.27 1.31 1.34 1.37 1.4 ...
Created on 2022-10-17 with reprex v2.0.2
The functions sinusoid and mvrm from package BNSP allow one to specify the number of harmonics and if that number is too high, the algorithm can remove some of the unnecessary terms and avoid overfitting.
# Specify the model
model <- y ~ sinusoid(t, harmonics = 20, amplitude = 1, period = 365)
# Fit the model
m1 <- mvrm(formula = model, data = dt, sweeps = 5000, burn = 3000, thin = 2, seed = 1, StorageDir = getwd())
# ggplot
plotOptionsM <- list(geom_point(data = dt, aes(x = t, y = y)))
plot(x = m1, term = 1, plotOptions = plotOptionsM, intercept = TRUE, quantiles = c(0.005, 0.995), grid = 100)
In this particular example, among the 20 harmonics, the 19 appear to be important.

How can I perform bootstrap to find the confidence interval for a k-nn model in R?

I have a training df with 2 columns like
a b
1 1000 20
2 1008 13
...
n ... ...
Now, as I am required to find a 95% CI for the estimate of 'b' based on a specific 'a' value, with a 'k' value of my choice and compare the CI result to other specific value of 'k's. My question is how can I perform bootstrap for this with 1000 bootstrap reps as I am required to use a fitted knn model for the training data with kernel = 'gaussian' and k can only be in range 1-20 ?
I have found that the best k for this model is k = 5, and had a go for bootstrap but it doesn't work
library(kknn)
library(boot)
boot.kn = function(formula, data, indices)
{
# Create a bootstrapped version
d = data[indices,]
# Fit a model for bs
fit.kn = fitted(train.kknn(formula,data, kernel= "gaussian", ks = 5))
# Do I even need this complicated block
target = as.character(fit.kn$terms[[2]])
rv = my.pred.stats(fit.kn, d[,target])
return(rv)
}
bs = boot(data=df, statistic=boot.kn, R=1000, formula=b ~ a)
boot.ci(bs,conf=0.95,type="bca")
Please inform me for more info if I'm not clear enough. Thank you.
Here is a way to regress b on a with the k-nearest neighbors algorithm.
First, a data set. This is a subset of the iris data set, keeping the first two columns. One row is removed to later be the new data.
i <- which(iris$Sepal.Length == 5.3)
df1 <- iris[-i, 1:2]
newdata <- iris[i, 1:2]
names(df1) <- c("a", "b")
names(newdata) <- c("a", "b")
Now load the packages to be used and determine the optimal value for k with package kknn.
library(caret)
library(kknn)
library(boot)
fit <- kknn::train.kknn(
formula = b ~ a,
data = df1,
kmax = 15,
kernel = "gaussian",
distance = 1
)
k <- fit$best.parameters$k
k
#[1] 9
And bootstrap predictions for the new point a <- 5.3.
boot.kn <- function(data, indices, formula, newdata, k){
d <- data[indices, ]
fit <- knnreg(formula, data = d)
predict(fit, newdata = newdata)
}
set.seed(2021)
R <- 1e4
bs <- boot(df1, boot.kn, R = R, formula = b ~ a, newdata = newdata, k = k)
ci <- boot.ci(bs, level = 0.95, type = "bca")
ci
#BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#Based on 10000 bootstrap replicates
#
#CALL :
#boot.ci(boot.out = bs, type = "bca", level = 0.95)
#
#Intervals :
#Level BCa
#95% ( 3.177, 3.740 )
#Calculations and Intervals on Original Scale
Plot the results.
old_par <- par(mfrow = c(2, 1),
oma = c(5, 4, 0, 0) + 0.1,
mar = c(1, 1, 1, 1) + 0.1)
hist(bs$t, main = "Histogram of bootstrap values")
abline(v = 3.7, col = "red")
abline(v = mean(bs$t), col = "blue")
abline(v = ci$bca[4:5], col = "blue", lty = "dashed")
plot(b ~ a, df1)
points(5.3, 3.7, col = "red", pch = 19)
points(5.3, mean(bs$t), col = "blue", pch = 19)
arrows(x0 = 5.3, y0 = ci$bca[4],
x1 = 5.3, y1 = ci$bca[5],
col = "blue", angle = 90, code = 3)
par(old_par)

Simple Logistic Regression in a Loop?

I have a bunch of features for a multivariate logistic regression, but i Want to test each feature individually for multiple univariate logistic regressions.
I'm trying to do a loop like so
features <- c("f1","f2","f3","f4")
out <- list()
for (f in features) {
mod <- train(form = positive ~ f,
data = training,
method = "glm",
metric = "ROC",
family = "binomial")
out <- append(out,mod)
}
I'm getting an error saying variable lengths differ (found for 'f'). I think it's not recognizing f as the column name? How can I fix this?
For future reference an answer with a reprex that uses the same solution that was probosed by #Rorschach:
x <- runif(50, min = 0, max = 100)
z <- runif(50, min = 0, max = 100)
a <- runif(50, min = 0, max = 100)
b <- runif(50, min = 0, max = 100)
positive <- rbinom(50,1, 0.4)
training <- as.data.frame(cbind(x,z,a,b,positive = positive))
training$positive <- factor(training$positive)
library(caret)
features <- c("x","z","a","b")
out <- list()
for (f in features) {
mod <- train(form = as.formula(paste("positive ~ ", f)),
data = training,
method = "glm",
family = "binomial")
out <- append(out,mod)
}

How to extract the Prediction Intervals of a Gaussian Process Regression via caret kernlab package?

I am trying to use a Gaussian Process Regression (GPR) model to predict hourly streamflow discharges in a river. I've got good results applying the caret::kernlab train () function (thanks Kuhn!).
Since the uncertainty idea is one of the main inherent ones advantages of the GPR, I would like to know if anyone could help me to access the results related to the prediction inteval of the test dataset.
I'll put an extract of the code I've been working. Since my real data are huge (and sincerely, I don't know how to put it here), I'll example with the data(airquality). The main goal in this particular example is to predict airquality$Ozone, using as predictos the lag-variables of airquality$Temperature.
rm(list = ls())
data(airquality)
airquality = na.omit(as.data.frame(airquality)); str(airquality)
library(tidyverse)
library(magrittr)
airquality$Ozone %>% plot(type = 'l')
lines(airquality$Temp, col = 2)
legend("topleft", legend = c("Ozone", "Temperature"),
col=c(1, 2), lty = 1:1, cex = 0.7, text.font = 4, inset = 0.01,
box.lty=0, lwd = 1)
attach(airquality)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train = data.frame(df_lags[1:81, ]) # Training Observed 75% dataset
ESM_test = data.frame(df_lags[82:nrow(df_lags), ]) # Testing Observed 25% dataset
grid_gaussprRadial = expand.grid(.sigma = c(0.001, 0.01, 0.05, 0.1, 0.5, 1, 2)) # Sigma parameters searching for GPR
# TRAIN MODEL ############################
# Tuning set
library(caret)
set.seed(111)
cvCtrl <- trainControl(
method ="repeatedcv",
repeats = 1,
number = 20,
allowParallel = TRUE,
verboseIter = TRUE,
savePredictions = "final")
# Train (aprox. 4 seconds time-simulation)
attach(ESM_train)
set.seed(111)
system.time(Model_train <- caret::train(Ozone ~ Temp + Temp_lag1,
trControl = cvCtrl,
data = ESM_train,
metric = "MAE", # Using MAE since I intend minimum values are my focus
preProcess = c("center", "scale"),
method = "gaussprRadial", # Setting RBF kernel function
tuneGrid = grid_gaussprRadial,
maxit = 1000,
linout = 1)) # Regression type
plot(Model_train)
Model_train
ESM_results_train <- Model_train$resample %>% mutate(Model = "") # K-fold Training measures
# Select the interested TRAIN data and arrange them as dataframe
Ozone_Obs_Tr = Model_train$pred$obs
Ozone_sim = Model_train$pred$pred
Resid = Ozone_Obs_Tr - Ozone_sim
train_results = data.frame(Ozone_Obs_Tr,
Ozone_sim,
Resid)
# Plot Obs x Simulated train results
library(ggplot2)
ggplot(data = train_results, aes(x = Ozone_Obs_Tr, y = Ozone_sim)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black")
# TEST MODEL ############################
# From "ESM_test" dataframe, we predict ESM Ozone time series, adding it in "ESM_forecasted" dataframe
ESM_forecasted = ESM_test %>%
mutate(Ozone_Pred = predict(Model_train, newdata = ESM_test, variance.model = TRUE))
str(ESM_forecasted)
# Select the interested TEST data and arrange them as a dataframe
Ozone_Obs = ESM_forecasted$Ozone
Ozone_Pred = ESM_forecasted$Ozone_Pred
# Plot Obs x Predicted TEST results
ggplot(data = ESM_forecasted, aes(x = Ozone_Obs, y = Ozone_Pred)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black")
# Model performance #####
library(hydroGOF)
gof_TR = gof(Ozone_sim, Ozone_Obs_Tr)
gof_TEST = gof(Ozone_Pred,Ozone_Obs)
Performances = data.frame(
Train = gof_TR,
Test = gof_TEST
); Performances
# Plot the TEST prediction
attach(ESM_forecasted)
plot(Ozone_Obs, type = "l", xlab = "", ylab = "", ylim = range(Ozone_Obs, Ozone_Pred))
lines(Ozone_Pred , col = "coral2", lty = 2, lwd = 2)
legend("top", legend = c("Ozone Obs Test", "Ozone Pred Test"),
col=c(1, "coral2"), lty = 1:2, cex = 0.7, text.font = 4, inset = 0.01, box.lty=0, lwd = 2)
These last lines generate the following plot:
The next, and last, step would be to extract the prediction intervals, which is based on a gaussian distribution around each prediction point, to plot it together with this last plot.
The caret::kernlab train() appliance returned better prediction than, for instance, just kernlab::gaussprRadial(), or even tgp::bgp() packages. For both of them I could find the prediction interval.
For example, to pick up the prediction intervals via tgp::bgp(), it could be done typing:
Upper_Bound <- Ozone_Pred$ZZ.q2 #Ozone_Pred - 2 * sigma^2
Lower_Bound <- Ozone_Pred$ZZ.q1 #Ozone_Pred + 2 * sigma^2
Therefore, via caret::kernlab train(), I hope the required standard deviations could be found typing something as
Model_train$...
or maybe, with
Ozone_Pred$...
Moreover, at link: https://stats.stackexchange.com/questions/414079/can-mad-median-absolute-deviation-or-mae-mean-absolute-error-be-used-to-calc,
Stephan Kolassa author explained that we could estimate the prediction intervals through MAE, or even RMSE. But I didn't understand if this is my point, since the MAE I got is just the comparison between Obs x Predicted Ozone data, in this example.
Please, this solution is very important to me! I think I am near to obtain my main results, but I don't know anymore how to try.
Thanks a lot, friends!
I don't really know how the caret framework works, but getting a prediction interval for a GP regression with a Gaussian likelihood is easy enough to do manually.
First we just need a function for the squared exponential kernel, also called the radial basis function kernel, which is what you were using. sf here is the scale factor (unused in the kernlab implementation), and ell is the length scale, called sigma in the kernlab implementation:
covSEiso <- function(x1, x2 = x1, sf = 1.0, ell = 1.0) {
sf <- sf^2
ell <- -0.5 * (1 / (ell^2))
n <- nrow(x1)
m <- nrow(x2)
d <- ncol(x1)
result <- matrix(0, nrow = n, ncol = m)
for ( j in 1:m ) {
for ( i in 1:n ) {
result[i, j] <- sf * exp(ell * sum((x1[i, ] - x2[j, ])^2))
}
}
return(result)
}
I'm not sure what your code says about which length scale to use; below I will use a length scale of 25 and scale factor of 50 (obtained via GPML's hyperparameter optimization routines). Then we use the covSEiso() function above to get the relevant covariances, and the rest is application of basic Gaussian identities. I would refer you to Chapter 2 of Rasmussen and Williams (2006) (graciously provided for free online).
data(airquality)
library(tidyverse)
library(magrittr)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train <- data.frame(df_lags[1:81, ]) # Training Data 75% dataset
ESM_test <- data.frame(df_lags[82:nrow(df_lags), ]) # Testing Data 25% dataset
## For convenience I'll define separately the training and test inputs
X <- ESM_train[ , c("Temp", "Temp_lag1")]
Xstar <- ESM_test[ , c("Temp", "Temp_lag1")]
## Get the kernel manually
K <- covSEiso(X, ell = 25, sf = 50)
## We also need covariance between the test cases
Kstar <- covSEiso(Xstar, X, ell = 25, sf = 50)
Ktest <- covSEiso(Xstar, ell = 25, sf = 50)
## Now the 95% credible region for the posterior is
predictive_mean <- Kstar %*% solve(K + diag(nrow(K))) %*% ESM_train$Ozone
predictive_var <- Ktest - (Kstar %*% solve(K + diag(nrow(K))) %*% t(Kstar))
## Then for the prediction interval we only need to add the observation noise
z <- sqrt(diag(predictive_var)) + 25
interval_high <- predictive_mean + 2 * z
interval_low <- predictive_mean - 2 * z
Then we can check out the prediction intervals
This all is pretty easy to do via my gplmr package (available on GitHub) which can call GPML from R if you have Octave installed:
data(airquality)
library(tidyverse)
library(magrittr)
library(gpmlr)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train <- data.frame(df_lags[1:81, ]) # Training Data 75% dataset
ESM_test <- data.frame(df_lags[82:nrow(df_lags), ]) # Testing Data 25% dataset
X <- as.matrix(ESM_train[ , c("Temp", "Temp_lag1")])
y <- ESM_train$Ozone
Xs <- as.matrix(ESM_test[ , c("Temp", "Temp_lag1")])
ys <- ESM_test$Ozone
hyp0 <- list(mean = numeric(), cov = c(0, 0), lik = 0)
hyp <- set_hyperparameters(hyp0, "infExact", "meanZero", "covSEiso","likGauss",
X, y)
gp_res <- gp(hyp, "infExact", "meanZero", "covSEiso", "likGauss", X, y, Xs, ys)
predictive_mean <- gp_res$YMU
interval_high <- gp_res$YMU + 2 * sqrt(gp_res$YS2)
interval_low <- gp_res$YMU - 2 * sqrt(gp_res$YS2)
Then just plot the predictions, as above:
plot(NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "n",
xlim = range(ESM_test$Temp), ylim = range(c(interval_high, interval_low)))
axis(1, tick = FALSE, line = -0.75)
axis(2, tick = FALSE, line = -0.75)
mtext("Temp", 1, 1.5)
mtext("Ozone", 2, 1.5)
idx <- order(ESM_test$Temp)
polygon(c(ESM_test$Temp[idx], rev(ESM_test$Temp[idx])),
c(interval_high[idx], rev(interval_low[idx])),
border = NA, col = "#80808080")
lines(ESM_test$Temp[idx], predictive_mean[idx])
points(ESM_test$Temp, ESM_test$Ozone, pch = 19)
plot(NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "n",
xlim = range(ESM_test$Temp_lag1), ylim = range(c(interval_high, interval_low)))
axis(1, tick = FALSE, line = -0.75)
axis(2, tick = FALSE, line = -0.75)
mtext("Temp_lag1", 1, 1.5)
mtext("Ozone", 2, 1.5)
idx <- order(ESM_test$Temp_lag1)
polygon(c(ESM_test$Temp_lag1[idx], rev(ESM_test$Temp_lag1[idx])),
c(interval_high[idx], rev(interval_low[idx])),
border = NA, col = "#80808080")
lines(ESM_test$Temp_lag1[idx], predictive_mean[idx])
points(ESM_test$Temp_lag1, ESM_test$Ozone, pch = 19)

Fitting a nonlinear function with "missing level" in mle2 (WARNING: ecologist with computer)

I am looking to optimize the fit of a model that describes the amount of litter collected in a network of .5m^2 'litter traps' in a plot of mapped trees of known diameter and species. The model of choice has two factors, allometric scaling of litter production, and exponential decay in litter travel distance.
tree1.litter = alpha*gamma^2 * DBH^Beta/(2*pi) * exp(-gamma*z-delta*DBH)
However, our trap data contains input from multiple trees (this is the "missing level" referred to in title):
Obs.Litter = tree1.litter + tree2.litter + ... + treej.litter + error
So far had very mixed results on even simulated data. It seems like with enough combinations of diameters and distances the functions should be somewhat well constrained. This analysis has been performed in an article I'm copy-catting. I've also tried the analysis on the log(Obs.Litter), which I think is the way to go. But I am not sure that the way I've coded the log version would have resulted in something that you would expect to perform any better.
At this point I suppose I'm just looking for any sort of advice (code based or conceptual) from someone more experienced with fitting nonlinear regressions or model fitting problems with this type of "hidden process". Code for data simulation and the various likelihoods are included below. I've had a bit more success with estimating these parameters with a Bayesian hierarchical model in OpenBUGS, with informative priors only.
library(plyr)
########################
##Generate Data#########
########################
alpha = 5
Beta = 2
gamma = .2
delta = .02
n = 600 #Number of trees
N.trap = 45 #Number of litter traps
D = rlnorm(n, 2)+5 #generate diameters
Z = runif(n, 0, 25) #generate distances
trap.id = sort(sample(1:N.trap, size = n, replace = T)) #assign trees to traps
tree.lit = (2*pi)^-1*alpha*gamma^2*D^Beta * exp(-gamma*Z-delta*D) #generate litter
log.tree.lit = -(2*pi) + log(alpha) + 2*log(gamma) + Beta*DBH -gamma*Z - delta*D
litter = data.frame(D=D, Z = Z, trap.id = trap.id, tree.lit = tree.lit)
data = ddply(litter, .(trap.id), summarize, trap.lit = sum(tree.lit), n.trees=length(trap.id) )
trap.lit = data[,2]
#####################################
##### Maximum Likelihood Optimization
#####################################
library(bbmle)
log.Litter.Func<-function(alpha, Beta, gamma, delta, sigma, D, Z, N.trap, trap.id, Obs.Litter){
log.Expected.Litter.tree = -log(2*pi) + log(alpha) + 2*log(gamma) + Beta*log(D) -gamma*Z - delta*D
log.Expected.Litter.trap = rep(0, N.trap)
for(i in 1:N.trap){
log.Expected.Litter.trap[i] <- sum(exp(log.Expected.Litter.tree[trap.id==i]))
}
-sum(dlnorm(log(Obs.Litter), log.Expected.Litter.trap, sd=sigma, log=T))
}
Litter.Func<-function(alpha, Beta, gamma, delta, sigma, D, Z, N.trap, trap.id, Obs.Litter){
Expected.Litter.tree = 1/(2*pi) * alpha * gamma^2 * D^Beta *exp(-gamma*Z - delta*D)
Expected.Litter.trap = rep(0, N.trap)
for(i in 1:N.trap){
Expected.Litter.trap[i] <- sum(Expected.Litter.tree[trap.id==i])
}
-sum(dnorm(Obs.Litter, Expected.Litter.trap, sd=sigma, log=T))
}
log.fit<-mle2(log.Litter.Func,
start = list(alpha = 5,gamma = .2,Beta = 2,delta = .02, sigma = 1),
#upper = list(alpha = 20,gamma = 1,Beta = 4,delta = .2,sigma = 20),
#lower = list(alpha = .002,gamma = .002,Beta = .0002,delta = .000000002,sigma = .020),
#method="L-BFGS-B",
data=list(D=D, Z=Z, N.trap=N.trap, trap.id=litter$trap.id, Obs.Litter=trap.lit)
)
fit<-mle2(Litter.Func,
start = list(alpha = 5,gamma = .2,Beta = 2,delta = .02, sigma = 1),
#upper = list(alpha = 20,gamma = 1,Beta = 4,delta = .2,sigma = 20),
#lower = list(alpha = .002,gamma = .002,Beta = .0002,delta = .000000002,sigma = .020),
#method="L-BFGS-B",
data=list(D = D,Z = Z,N.trap=N.trap, trap.id=litter$trap.id,Obs.Litter = trap.lit)
)

Resources