Error in plotting SVM classification graph - r

I'm using the support vector machine from the e1071 package to classify my data and want to visualize how the machine actually does the classification. However, when using the plot.svm function, I get an error that I can't resolve.
Script:
library("e1071")
data <-read.table("2010223_11042_complete")
names(data) <- c("Class","V1", "V2")
model <- svm(Class~.,data, type = "C-classification", kernel = "linear")
plot(model,data,fill=TRUE, grid=200, svSymbol=4, dataSymbol=1, color.palette=terrain.colors)
Output:
plot(model,data,fill=TRUE, grid=200, svSymbol=4, dataSymbol=1, color.palette=terrain.colors)
Error in rect(0, levels[-length(levels)], 1, levels[-1L], col = col) :
cannot mix zero-length and non-zero-length coordinates
Traceback:
traceback()
4: rect(0, levels[-length(levels)], 1, levels[-1L], col = col)
3: filled.contour(xr, yr, matrix(as.numeric(preds), nr = length(xr),
byrow = TRUE), plot.axes = {
axis(1)
axis(2)
colind <- as.numeric(model.response(model.frame(x, data)))
dat1 <- data[-x$index, ]
dat2 <- data[x$index, ]
coltmp1 <- symbolPalette[colind[-x$index]]
coltmp2 <- symbolPalette[colind[x$index]]
points(formula, data = dat1, pch = dataSymbol, col = coltmp1)
points(formula, data = dat2, pch = svSymbol, col = coltmp2)
}, levels = 1:(length(levels(preds)) + 1), key.axes = axis(4,
1:(length(levels(preds))) + 0.5, labels = levels(preds),
las = 3), plot.title = title(main = "SVM classification plot",
xlab = names(lis)[2], ylab = names(lis)[1]), ...)
2: plot.svm(model, data, fill = TRUE, grid = 200, svSymbol = 4,
dataSymbol = 1, color.palette = terrain.colors)
1: plot(model, data, fill = TRUE, grid = 200, svSymbol = 4,
dataSymbol = 1, color.palette = terrain.colors)
Part of my (4488 lines long) data file:
-1 0 23.532
+1 1 61.1157
+1 1 61.1157
+1 1 61.1157
-1 1 179.03
-1 0 17.0865
-1 0 27.6201
-1 0 17.0865
-1 0 27.6201
-1 1 89.6398
-1 0 42.7418
-1 1 89.6398
Since I`m just starting with R, I have no idea what this means and how I should deal with it, nor did I find anything useful in other places.

Without being sure what exactly causes the problem, I would try to transform the Class column to a factor (so defining the type as C-classification will no longer be necessary) using something like this:
data$Class <- as.factor(data$Class)
or in one step:
model <- svm(as.factor(Class)~.,data, kernel = "linear")

Related

How to select appropriate sin() terms to fit a time series using R

I want to fit a time series with sin() function because it has a form of some periods (crests and troughs). However, for now I only guessed it, e.g., 1 month, two months, ..., 1 year, 2 year. Is there some function in R to estimate the multiple periods in a data series?
Below is an example which I want to fit it using the combination of sin() functions. The expression in lm() is a try after several guesses (red line in the Figure below). How can I find the sin() terms with appropriate periods?
t <- 1:365
y <- c(-1,-1.3,-1.6,-1.8,-2.1,-2.3,-2.5,-2.7,-2.9,-3,-2,-1.1,-0.3,0.5,1.1,1.6,2.1,2.5,2.8,3.1,3.4,3.7,4.2,4.6,5,5.3,5.7,5.9,6.2,5.8,5.4,5,4.6,4.2,3.9,3.6,3.4,3.1,2.9,2.8,2.6,2.5,2.3,1.9,1.5,1.1,0.8,0.5,0.2,0,-0.1,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.8,-0.9,-0.8,-0.6,-0.3,-0.1,0.1,0.4,0.6,0.9,1.1,1.3,1.5,1.7,2.1,2.4,2.7,3,3.3,3.5,3.8,4.3,4.7,5.1,5.5,5.9,6.2,6.4,6.6,6.7,6.8,6.8,6.9,7,6.9,6.8,6.7,
6.5,6.4,6.4,6.3,6.2,6,5.9,5.7,5.6,5.5,5.4,5.4,5.1,4.9,4.8,4.6,4.5,4.4,4.3,3.9,3.6,3.3,3,2.8,2.6,2.4,2.6,2.5,2.4,2.3,2.3,2.2,2.2,2.3,2.4,2.4,2.5,2.5,2.6,2.6,2.4,2.1,1.9,1.8,1.6,1.4,1.3,1,0.7,0.5,0.2,0,-0.2,-0.4,-0.2,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,0,0,-0.1,-0.1,-0.2,-0.2,-0.3,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.7,-0.8,-0.8,-0.8,-0.9,-0.9,-0.9,-1.3,-1.6,-1.9,-2.1,-2.3,-2.6,-2.9,-2.9,-2.9,-2.9,
-2.9,-3,-3,-3,-2.8,-2.7,-2.5,-2.4,-2.3,-2.2,-2.1,-2,-2,-1.9,-1.9,-1.8,-1.8,-1.8,-1.9,-1.9,-2,-2.1,-2.2,-2.2,-2.3,-2.4,-2.5,-2.6,-2.7,-2.8,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,-2.8,-2.7,-2.7,-2.6,-2.6,-2.8,-3,-3.1,-3.3,-3.4,-3.5,-3.6,-3.5,-3.4,-3.3,-3.3,-3.2,-3,-2.9,-2.8,-2.8,-2.7,-2.6,-2.6,-2.6,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3,-3,-3,-3,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,
-2.7,-2.6,-2.5,-2.4,-2.3,-2.3,-2.1,-1.9,-1.8,-1.7,-1.5,-1.4,-1.3,-1.5,-1.7,-1.8,-1.9,-2,-2.1,-2.2,-2.4,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3.1,-3.2,-3.3,-3.4,-3.5,-3.5,-3.6,-3.6,-3.5,-3.4,-3.3,-3.2,-3.1,-3,-2.7,-2.3,-2,-1.8,-1.5,-1.3,-1.1,-0.9,-0.7,-0.6,-0.5,-0.3,-0.2,-0.1,-0.3,-0.5,-0.6,-0.7,-0.8,-0.9,-1,-1.1,-1.1,-1.2,-1.2,-1.2,-1.2,-1.2,-0.8,-0.4,-0.1,0.2,0.5,0.8,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0.6,0.3,0,-0.2,-0.5,-0.7,-0.8)
dt <- data.frame(t = t, y = y)
plot(x = dt$t, y = dt$y)
lm <- lm(y ~ sin(2*3.1416/365*t)+cos(2*3.1416/365*t)+
sin(2*2*3.1416/365*t)+cos(2*2*3.1416/365*t)+
sin(2*4*3.1416/365*t)+cos(2*4*3.1416/365*t)+
sin(2*5*3.1416/365*t)+cos(2*5*3.1416/365*t)+
sin(2*6*3.1416/365*t)+cos(2*6*3.1416/365*t)+
sin(2*0.5*3.1416/365*t)+cos(2*0.5*3.1416/365*t),
data = dt)
summary(lm)$adj.r.squared
plot(dt$y); lines(predict(lm), type = "l", col = "red")
Package forecast has the fourier function (see here), which allows you to model fourier series terms based on time series objects.
For example:
library(forecast)
dt$y <- ts(dt$y, frequency = 365)
lm<- lm(y ~ fourier(y, K=6), dt)
plot(dt$t, dt$y); lines(predict(lm), type = "l", col = "red")
Following my comment to the question,
In catastrophic-failure's answer replace Mod by Re as in SleuthEye's answer. Then call nff(y, 20, col = "red").
I realized that there is another correction to function nff to be made:
substitute length(x) or xlen for the magical number 73.
Here is the function corrected.
nff = function(x = NULL, n = NULL, up = 10L, plot = TRUE, add = FALSE, main = NULL, ...){
#The direct transformation
#The first frequency is DC, the rest are duplicated
dff = fft(x)
#The time
xlen <- length(x)
t = seq_along(x)
#Upsampled time
nt = seq(from = 1L, to = xlen + 1L - 1/up, by = 1/up)
#New spectrum
ndff = array(data = 0, dim = c(length(nt), 1L))
ndff[1] = dff[1] #Always, it's the DC component
if(n != 0){
ndff[2:(n+1)] <- dff[2:(n+1)] #The positive frequencies always come first
#The negative ones are trickier
ndff[(length(ndff) - n + 1):length(ndff)] <- dff[(xlen - n + 1L):xlen]
}
#The inverses
indff = fft(ndff/xlen, inverse = TRUE)
idff = fft(dff/xlen, inverse = TRUE)
if(plot){
if(!add){
plot(x = t, y = x, pch = 16L, xlab = "Time", ylab = "Measurement",
main = ifelse(is.null(main), paste(n, "harmonics"), main))
lines(y = Re(idff), x = t, col = adjustcolor(1L, alpha = 0.5))
}
lines(y = Re(indff), x = nt, ...)
}
ret = data.frame(time = nt, y = Mod(indff))
return(ret)
}
y <- c(-1,-1.3,-1.6,-1.8,-2.1,-2.3,-2.5,-2.7,-2.9,-3,-2,-1.1,-0.3,0.5,1.1,1.6,2.1,2.5,2.8,3.1,3.4,3.7,4.2,4.6,5,5.3,5.7,5.9,6.2,5.8,5.4,5,4.6,4.2,3.9,3.6,3.4,3.1,2.9,2.8,2.6,2.5,2.3,1.9,1.5,1.1,0.8,0.5,0.2,0,-0.1,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.8,-0.9,-0.8,-0.6,-0.3,-0.1,0.1,0.4,0.6,0.9,1.1,1.3,1.5,1.7,2.1,2.4,2.7,3,3.3,3.5,3.8,4.3,4.7,5.1,5.5,5.9,6.2,6.4,6.6,6.7,6.8,6.8,6.9,7,6.9,6.8,6.7,
6.5,6.4,6.4,6.3,6.2,6,5.9,5.7,5.6,5.5,5.4,5.4,5.1,4.9,4.8,4.6,4.5,4.4,4.3,3.9,3.6,3.3,3,2.8,2.6,2.4,2.6,2.5,2.4,2.3,2.3,2.2,2.2,2.3,2.4,2.4,2.5,2.5,2.6,2.6,2.4,2.1,1.9,1.8,1.6,1.4,1.3,1,0.7,0.5,0.2,0,-0.2,-0.4,-0.2,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,0,0,-0.1,-0.1,-0.2,-0.2,-0.3,-0.3,-0.4,-0.5,-0.5,-0.6,-0.7,-0.7,-0.8,-0.8,-0.8,-0.9,-0.9,-0.9,-1.3,-1.6,-1.9,-2.1,-2.3,-2.6,-2.9,-2.9,-2.9,-2.9,
-2.9,-3,-3,-3,-2.8,-2.7,-2.5,-2.4,-2.3,-2.2,-2.1,-2,-2,-1.9,-1.9,-1.8,-1.8,-1.8,-1.9,-1.9,-2,-2.1,-2.2,-2.2,-2.3,-2.4,-2.5,-2.6,-2.7,-2.8,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,-2.8,-2.7,-2.7,-2.6,-2.6,-2.8,-3,-3.1,-3.3,-3.4,-3.5,-3.6,-3.5,-3.4,-3.3,-3.3,-3.2,-3,-2.9,-2.8,-2.8,-2.7,-2.6,-2.6,-2.6,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3,-3,-3,-3,-2.9,-2.9,-2.9,-2.9,-2.9,-2.8,
-2.7,-2.6,-2.5,-2.4,-2.3,-2.3,-2.1,-1.9,-1.8,-1.7,-1.5,-1.4,-1.3,-1.5,-1.7,-1.8,-1.9,-2,-2.1,-2.2,-2.4,-2.5,-2.6,-2.7,-2.8,-2.8,-2.9,-3.1,-3.2,-3.3,-3.4,-3.5,-3.5,-3.6,-3.6,-3.5,-3.4,-3.3,-3.2,-3.1,-3,-2.7,-2.3,-2,-1.8,-1.5,-1.3,-1.1,-0.9,-0.7,-0.6,-0.5,-0.3,-0.2,-0.1,-0.3,-0.5,-0.6,-0.7,-0.8,-0.9,-1,-1.1,-1.1,-1.2,-1.2,-1.2,-1.2,-1.2,-0.8,-0.4,-0.1,0.2,0.5,0.8,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0.6,0.3,0,-0.2,-0.5,-0.7,-0.8)
res <- nff(y, 20, col = "red")
str(res)
#> 'data.frame': 3650 obs. of 2 variables:
#> $ time: num 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ...
#> $ y : num 1.27 1.31 1.34 1.37 1.4 ...
Created on 2022-10-17 with reprex v2.0.2
The functions sinusoid and mvrm from package BNSP allow one to specify the number of harmonics and if that number is too high, the algorithm can remove some of the unnecessary terms and avoid overfitting.
# Specify the model
model <- y ~ sinusoid(t, harmonics = 20, amplitude = 1, period = 365)
# Fit the model
m1 <- mvrm(formula = model, data = dt, sweeps = 5000, burn = 3000, thin = 2, seed = 1, StorageDir = getwd())
# ggplot
plotOptionsM <- list(geom_point(data = dt, aes(x = t, y = y)))
plot(x = m1, term = 1, plotOptions = plotOptionsM, intercept = TRUE, quantiles = c(0.005, 0.995), grid = 100)
In this particular example, among the 20 harmonics, the 19 appear to be important.

How can I improve the quality/graphics of my R plot for a Naive Bayes classifier visual

I tried a Naive Bayes classifier to see if I can predict if a person, given their age and estimated salary, would purchase a particular vehicle or not. The plot I got in the visualisation section looks not very smooth and clean, with white lines running across my plot. I'm assuiming the graphics/resolution is the problem but I am not sure.
This is a snippet of what the dataset looks like
Age EstimatedSalary Purchased
19 19000 0
35 20000 0
26 43000 0
27 57000 0
19 76000 0
27 58000 0
Here is the code
# Loading the data set
data <- read.csv(" *A csv sheet on people's age, salaries and whether or not they will purchase a certain vehicle* ")
data <- data[, 3:5]
attach(data)
# Encoding the dependent variable
data$Purchased <- factor(data$Purchased, levels = c(0, 1))
attach(data)
# Splitting the dataset
library(caTools)
set.seed(404)
split <- sample.split(Purchased, SplitRatio = 0.75)
train_set <- subset(data, split == T)
test_set <- subset(data, split == F)
# Feature scaling
train_set[-3] <- scale(train_set[-3])
test_set[-3] <- scale(test_set[-3])
# Training the model
library(e1071)
classifier <- naiveBayes(x = train_set[-3], y = train_set$Purchased)
# Predicting test results
y_pred <- predict(classifier, newdata = test_set[-3])
# Construct the confusion matrix
(cm <- table(test_set[, 3], y_pred))
Below is the code that I used to visualise the results
# Visualising the results
library(ElemStatLearn)
set <- test_set
x1 <- seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
x2 <- seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set <- expand.grid(x1, x2)
colnames(grid_set) <- c("Age", "EstimatedSalary")
y_grid <- predict(classifier, newdata = grid_set)
plot(set[, -3], main = "Naive Bayes: Test set", xlab = "Age", ylab = "EstimatedSalary", xlim = range(x1), ylim = range(x2))
contour(x1, x2, matrix(as.numeric(y_grid), length(x1), length(x2)), add = T)
points(grid_set, pch = ".", col = ifelse(y_grid == 1, "Springgreen3", "tomato"))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, "green4", "red3"))
Naive Bayes classifier plot on the test set predictions
Would like to know the reason for the white lines running up and down the plot and why it does not look smooth?
So I figured out what was giving me the weird lines and the low quality resolution. Adding the "cex = n" parameter to the "points()" function in the graph with n = 5 solved this.
Revised block of code
set <- test_set
x1 <- seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
x2 <- seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set <- expand.grid(x1, x2)
colnames(grid_set) <- c("Age", "EstimatedSalary")
y_grid <- predict(classifier, newdata = grid_set)
plot(set[, -3], main = "Naive Bayes: Test set", xlab = "Age", ylab = "EstimatedSalary", xlim = range(x1), ylim = range(x2))
contour(x1, x2, matrix(as.numeric(y_grid), length(x1), length(x2)), add = T)
points(grid_set, pch = ".", col = ifelse(y_grid == 1, "Springgreen3", "tomato"), cex = 5)
points(set, pch = 21, bg = ifelse(set[, 3] == 1, "green4", "red3"))
The revised line of code in the above block
points(grid_set, pch = ".", col = ifelse(y_grid == 1, "Springgreen3", "tomato"), cex = 5)
However the case, I would still like to know the reason behind how this happened because the explanation available in R about the functions and the parameters were not that clear to me.
Would appreciate any help given!

How to extract the Prediction Intervals of a Gaussian Process Regression via caret kernlab package?

I am trying to use a Gaussian Process Regression (GPR) model to predict hourly streamflow discharges in a river. I've got good results applying the caret::kernlab train () function (thanks Kuhn!).
Since the uncertainty idea is one of the main inherent ones advantages of the GPR, I would like to know if anyone could help me to access the results related to the prediction inteval of the test dataset.
I'll put an extract of the code I've been working. Since my real data are huge (and sincerely, I don't know how to put it here), I'll example with the data(airquality). The main goal in this particular example is to predict airquality$Ozone, using as predictos the lag-variables of airquality$Temperature.
rm(list = ls())
data(airquality)
airquality = na.omit(as.data.frame(airquality)); str(airquality)
library(tidyverse)
library(magrittr)
airquality$Ozone %>% plot(type = 'l')
lines(airquality$Temp, col = 2)
legend("topleft", legend = c("Ozone", "Temperature"),
col=c(1, 2), lty = 1:1, cex = 0.7, text.font = 4, inset = 0.01,
box.lty=0, lwd = 1)
attach(airquality)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train = data.frame(df_lags[1:81, ]) # Training Observed 75% dataset
ESM_test = data.frame(df_lags[82:nrow(df_lags), ]) # Testing Observed 25% dataset
grid_gaussprRadial = expand.grid(.sigma = c(0.001, 0.01, 0.05, 0.1, 0.5, 1, 2)) # Sigma parameters searching for GPR
# TRAIN MODEL ############################
# Tuning set
library(caret)
set.seed(111)
cvCtrl <- trainControl(
method ="repeatedcv",
repeats = 1,
number = 20,
allowParallel = TRUE,
verboseIter = TRUE,
savePredictions = "final")
# Train (aprox. 4 seconds time-simulation)
attach(ESM_train)
set.seed(111)
system.time(Model_train <- caret::train(Ozone ~ Temp + Temp_lag1,
trControl = cvCtrl,
data = ESM_train,
metric = "MAE", # Using MAE since I intend minimum values are my focus
preProcess = c("center", "scale"),
method = "gaussprRadial", # Setting RBF kernel function
tuneGrid = grid_gaussprRadial,
maxit = 1000,
linout = 1)) # Regression type
plot(Model_train)
Model_train
ESM_results_train <- Model_train$resample %>% mutate(Model = "") # K-fold Training measures
# Select the interested TRAIN data and arrange them as dataframe
Ozone_Obs_Tr = Model_train$pred$obs
Ozone_sim = Model_train$pred$pred
Resid = Ozone_Obs_Tr - Ozone_sim
train_results = data.frame(Ozone_Obs_Tr,
Ozone_sim,
Resid)
# Plot Obs x Simulated train results
library(ggplot2)
ggplot(data = train_results, aes(x = Ozone_Obs_Tr, y = Ozone_sim)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black")
# TEST MODEL ############################
# From "ESM_test" dataframe, we predict ESM Ozone time series, adding it in "ESM_forecasted" dataframe
ESM_forecasted = ESM_test %>%
mutate(Ozone_Pred = predict(Model_train, newdata = ESM_test, variance.model = TRUE))
str(ESM_forecasted)
# Select the interested TEST data and arrange them as a dataframe
Ozone_Obs = ESM_forecasted$Ozone
Ozone_Pred = ESM_forecasted$Ozone_Pred
# Plot Obs x Predicted TEST results
ggplot(data = ESM_forecasted, aes(x = Ozone_Obs, y = Ozone_Pred)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black")
# Model performance #####
library(hydroGOF)
gof_TR = gof(Ozone_sim, Ozone_Obs_Tr)
gof_TEST = gof(Ozone_Pred,Ozone_Obs)
Performances = data.frame(
Train = gof_TR,
Test = gof_TEST
); Performances
# Plot the TEST prediction
attach(ESM_forecasted)
plot(Ozone_Obs, type = "l", xlab = "", ylab = "", ylim = range(Ozone_Obs, Ozone_Pred))
lines(Ozone_Pred , col = "coral2", lty = 2, lwd = 2)
legend("top", legend = c("Ozone Obs Test", "Ozone Pred Test"),
col=c(1, "coral2"), lty = 1:2, cex = 0.7, text.font = 4, inset = 0.01, box.lty=0, lwd = 2)
These last lines generate the following plot:
The next, and last, step would be to extract the prediction intervals, which is based on a gaussian distribution around each prediction point, to plot it together with this last plot.
The caret::kernlab train() appliance returned better prediction than, for instance, just kernlab::gaussprRadial(), or even tgp::bgp() packages. For both of them I could find the prediction interval.
For example, to pick up the prediction intervals via tgp::bgp(), it could be done typing:
Upper_Bound <- Ozone_Pred$ZZ.q2 #Ozone_Pred - 2 * sigma^2
Lower_Bound <- Ozone_Pred$ZZ.q1 #Ozone_Pred + 2 * sigma^2
Therefore, via caret::kernlab train(), I hope the required standard deviations could be found typing something as
Model_train$...
or maybe, with
Ozone_Pred$...
Moreover, at link: https://stats.stackexchange.com/questions/414079/can-mad-median-absolute-deviation-or-mae-mean-absolute-error-be-used-to-calc,
Stephan Kolassa author explained that we could estimate the prediction intervals through MAE, or even RMSE. But I didn't understand if this is my point, since the MAE I got is just the comparison between Obs x Predicted Ozone data, in this example.
Please, this solution is very important to me! I think I am near to obtain my main results, but I don't know anymore how to try.
Thanks a lot, friends!
I don't really know how the caret framework works, but getting a prediction interval for a GP regression with a Gaussian likelihood is easy enough to do manually.
First we just need a function for the squared exponential kernel, also called the radial basis function kernel, which is what you were using. sf here is the scale factor (unused in the kernlab implementation), and ell is the length scale, called sigma in the kernlab implementation:
covSEiso <- function(x1, x2 = x1, sf = 1.0, ell = 1.0) {
sf <- sf^2
ell <- -0.5 * (1 / (ell^2))
n <- nrow(x1)
m <- nrow(x2)
d <- ncol(x1)
result <- matrix(0, nrow = n, ncol = m)
for ( j in 1:m ) {
for ( i in 1:n ) {
result[i, j] <- sf * exp(ell * sum((x1[i, ] - x2[j, ])^2))
}
}
return(result)
}
I'm not sure what your code says about which length scale to use; below I will use a length scale of 25 and scale factor of 50 (obtained via GPML's hyperparameter optimization routines). Then we use the covSEiso() function above to get the relevant covariances, and the rest is application of basic Gaussian identities. I would refer you to Chapter 2 of Rasmussen and Williams (2006) (graciously provided for free online).
data(airquality)
library(tidyverse)
library(magrittr)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train <- data.frame(df_lags[1:81, ]) # Training Data 75% dataset
ESM_test <- data.frame(df_lags[82:nrow(df_lags), ]) # Testing Data 25% dataset
## For convenience I'll define separately the training and test inputs
X <- ESM_train[ , c("Temp", "Temp_lag1")]
Xstar <- ESM_test[ , c("Temp", "Temp_lag1")]
## Get the kernel manually
K <- covSEiso(X, ell = 25, sf = 50)
## We also need covariance between the test cases
Kstar <- covSEiso(Xstar, X, ell = 25, sf = 50)
Ktest <- covSEiso(Xstar, ell = 25, sf = 50)
## Now the 95% credible region for the posterior is
predictive_mean <- Kstar %*% solve(K + diag(nrow(K))) %*% ESM_train$Ozone
predictive_var <- Ktest - (Kstar %*% solve(K + diag(nrow(K))) %*% t(Kstar))
## Then for the prediction interval we only need to add the observation noise
z <- sqrt(diag(predictive_var)) + 25
interval_high <- predictive_mean + 2 * z
interval_low <- predictive_mean - 2 * z
Then we can check out the prediction intervals
This all is pretty easy to do via my gplmr package (available on GitHub) which can call GPML from R if you have Octave installed:
data(airquality)
library(tidyverse)
library(magrittr)
library(gpmlr)
df_lags <- airquality %>%
mutate(Temp_lag1 = lag(n = 1L, Temp)) %>%
na.omit()
ESM_train <- data.frame(df_lags[1:81, ]) # Training Data 75% dataset
ESM_test <- data.frame(df_lags[82:nrow(df_lags), ]) # Testing Data 25% dataset
X <- as.matrix(ESM_train[ , c("Temp", "Temp_lag1")])
y <- ESM_train$Ozone
Xs <- as.matrix(ESM_test[ , c("Temp", "Temp_lag1")])
ys <- ESM_test$Ozone
hyp0 <- list(mean = numeric(), cov = c(0, 0), lik = 0)
hyp <- set_hyperparameters(hyp0, "infExact", "meanZero", "covSEiso","likGauss",
X, y)
gp_res <- gp(hyp, "infExact", "meanZero", "covSEiso", "likGauss", X, y, Xs, ys)
predictive_mean <- gp_res$YMU
interval_high <- gp_res$YMU + 2 * sqrt(gp_res$YS2)
interval_low <- gp_res$YMU - 2 * sqrt(gp_res$YS2)
Then just plot the predictions, as above:
plot(NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "n",
xlim = range(ESM_test$Temp), ylim = range(c(interval_high, interval_low)))
axis(1, tick = FALSE, line = -0.75)
axis(2, tick = FALSE, line = -0.75)
mtext("Temp", 1, 1.5)
mtext("Ozone", 2, 1.5)
idx <- order(ESM_test$Temp)
polygon(c(ESM_test$Temp[idx], rev(ESM_test$Temp[idx])),
c(interval_high[idx], rev(interval_low[idx])),
border = NA, col = "#80808080")
lines(ESM_test$Temp[idx], predictive_mean[idx])
points(ESM_test$Temp, ESM_test$Ozone, pch = 19)
plot(NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "n",
xlim = range(ESM_test$Temp_lag1), ylim = range(c(interval_high, interval_low)))
axis(1, tick = FALSE, line = -0.75)
axis(2, tick = FALSE, line = -0.75)
mtext("Temp_lag1", 1, 1.5)
mtext("Ozone", 2, 1.5)
idx <- order(ESM_test$Temp_lag1)
polygon(c(ESM_test$Temp_lag1[idx], rev(ESM_test$Temp_lag1[idx])),
c(interval_high[idx], rev(interval_low[idx])),
border = NA, col = "#80808080")
lines(ESM_test$Temp_lag1[idx], predictive_mean[idx])
points(ESM_test$Temp_lag1, ESM_test$Ozone, pch = 19)

Why my for loop for my curve returns `Inf` in R?

I'm trying to loop over 2 curves in R but only one of the curves shows. More importantly, I call axis() command two times (for x and y axes) but I get the following error:
Error in axis(1, at = success/trials) : no locations are finite
'to' must be a finite number
Question
I'm wondering (a) why only my first curve is showing and (b) why I'm getting Inf/-Inf in my axis() calls?
Here is my R code:
success = c(10, 50) ; trials = 100
for(i in 1:length(success)){
success = success[i]
col = (1:length(success))[i]
c = curve( dbinom(success, trials, x), 0, 1, add = ifelse(i > 1, T, F), ty = "l", xlim = c(0, 1),
col = col, yaxt = "n")
text(success/trials, max(c$y), bquote(bolditalic(H[.(i)])), pos = 3, xpd = T )
axis(1, at = success/trials)
}
axis(2, at = seq(0, max(c$y), len = 7), labels = round(seq(0, max(c$y), len = 7), 2), las = 1 )
In your first pass through the loop you set success to success[1] and in the second pass through the loop you set success to success[2] but success was at that point length 1, so it becomes NA.
Similarly, the second plot doesn't show because col becomes NA for the same reason.
To fix this, use a different name for the list of successes and the variable in the loop. For example, use successes = c(10,50) and then use successes when setting success and col inside the loop. (Alternatively for col, just set col = i.)
Applying that to your code:
successes = c(10, 50) ; trials = 100
for(i in 1:length(successes)){
success = successes[i]
col = i
c = curve( dbinom(success, trials, x), 0, 1, add = ifelse(i > 1, T, F), ty = "l", xlim = c(0, 1),
col = col, yaxt = "n")
text(success/trials, max(c$y), bquote(bolditalic(H[.(i)])), pos = 3, xpd = T )
axis(1, at = success/trials)
}
axis(2, at = seq(0, max(c$y), len = 7), labels = round(seq(0, max(c$y), len = 7), 2), las = 1 )

Predicted probabilities using the bartMachine R package are failure probabilities

If I run a BART model for classification using bartMachine, the returned p_hat_train values correspond to failure probabilities rather than success probabilities as done in the initial implementation of BART in the BayesTree R package.
Here is an example with a simulated binary response:
library(bartMachine)
library(BayesTree)
library(logitnorm)
N = 1000
X <- rnorm(N, 0, 1)
p_true <- invlogit(1.5*X)
y <- rbinom(N, 1, p_true)
## bartMachine
fit <- bartMachine(data.frame(X), as.factor(y), num_burn_in = 200,
num_iterations_after_burn_in = 500)
p_hat <- fit$p_hat_train
## BayesTree
fit2 <- bart(X, as.factor(y), ntree = 50, ndpost = 500)
p_hat2 <- apply(pnorm(fit2$yhat.train), 2, mean)
par(mfrow = c(2,2))
plot(p_hat, p_true, main = 'p_hat_train with bartMachine')
abline(0, 1, col = 'red')
plot(1 - p_hat, p_true, main = '1 - p_hat_train with bartMachine')
abline(0, 1, col = 'red')
plot(p_hat2, p_true, main = 'pnorm(yhat.train) with BayesTree')
abline(0, 1, col = 'red')
Inspecting the iris example from ?bartMachine suggests that bartMachine is estimating the probability that an observation is classified as the first level of the y variable, which in your example happens to be 0. To get your desired result, you'll need to specify levels when you convert y to a factor, i.e.
fit <- bartMachine(data.frame(X), factor(y, levels = c("1", "0")),
num_burn_in = 200,
num_iterations_after_burn_in = 500)
We can see what's going on when we inspect the code for build_bart_machine:
if (class(y) == "factor" & length(y_levels) == 2) {
java_bart_machine = .jnew("bartMachine.bartMachineClassificationMultThread")
y_remaining = ifelse(y == y_levels[1], 1, 0)
pred_type = "classification"
}
And looking at the output from bartMachine (using your original specification) shows what's going on:
head(cbind(fit$model_matrix_training_data, y))
# X y_remaining y
# 1 -0.85093975 0 1
# 2 0.20955263 1 0
# 3 0.66489564 0 1
# 4 -0.09574123 1 0
# 5 -1.22480134 1 0
# 6 -0.36176273 1 0

Resources