I'd like to optimize three parameters (gamma, cost and epsilon) in eps-regression (SVR) using GA in R. Here's what I've done.
library(e1071)
data(Ozone, package="mlbench")
a<-na.omit(Ozone)
index<-sample(1:nrow(a), trunc(nrow(a)/3))
trainset<-a[index,]
testset<-a[-index,]
model<-svm(V4 ~ .,data=trainset, cost=0.1, gamma=0.1, epsilon=0.1, type="eps-regression", kernel="radial")
error<-model$residuals
rmse <- function(error) #root mean sqaured error
{
sqrt(mean(error^2))
}
rmse(error)
Here, I set cost, gamma and epsilon to be 0.1 respectively, but I don't think they are the best value. So, I'd like to employ Genetic Algorithm to optimize these parameters.
GA <- ga(type = "real-valued", fitness = rmse,
min = c(0.1,3), max = c(0.1,3),
popSize = 50, maxiter = 100)
Here, I used RMSE as the fitness function. but I think fitness function has to include the parameters that are to be optimized. But, in SVR, the objective function is too complicated to write out with R code, which I tried to find for a LONG time but to no avail. Someone who knows SVR and GA at the same time, someone who has an experience of optimizing SVR parameters using GA, Please help me. please.
In such an application, one passes the parameters whose values are to be optimized (in your case, cost, gamma and epsilon) as parameters of the fitness function, which then runs the model fitting + evaluation function and uses a measure of model performance as a measure of fitness. Therefore, the explicit form of the objective function is not directly relevant.
In the implementation below, I used 5-fold cross-validation to estimate the RMSE for a given set of parameters. In particular, since package GA maximizes the fitness function, I have written the fitness value for a given value of the parameters as minus the average rmse over the cross-validation datasets. Hence, the maximum fitness that can be attained is zero.
Here it is:
library(e1071)
library(GA)
data(Ozone, package="mlbench")
Data <- na.omit(Ozone)
# Setup the data for cross-validation
K = 5 # 5-fold cross-validation
fold_inds <- sample(1:K, nrow(Data), replace = TRUE)
lst_CV_data <- lapply(1:K, function(i) list(
train_data = Data[fold_inds != i, , drop = FALSE],
test_data = Data[fold_inds == i, , drop = FALSE]))
# Given the values of parameters 'cost', 'gamma' and 'epsilon', return the rmse of the model over the test data
evalParams <- function(train_data, test_data, cost, gamma, epsilon) {
# Train
model <- svm(V4 ~ ., data = train_data, cost = cost, gamma = gamma, epsilon = epsilon, type = "eps-regression", kernel = "radial")
# Test
rmse <- mean((predict(model, newdata = test_data) - test_data$V4) ^ 2)
return (rmse)
}
# Fitness function (to be maximized)
# Parameter vector x is: (cost, gamma, epsilon)
fitnessFunc <- function(x, Lst_CV_Data) {
# Retrieve the SVM parameters
cost_val <- x[1]
gamma_val <- x[2]
epsilon_val <- x[3]
# Use cross-validation to estimate the RMSE for each split of the dataset
rmse_vals <- sapply(Lst_CV_Data, function(in_data) with(in_data,
evalParams(train_data, test_data, cost_val, gamma_val, epsilon_val)))
# As fitness measure, return minus the average rmse (over the cross-validation folds),
# so that by maximizing fitness we are minimizing the rmse
return (-mean(rmse_vals))
}
# Range of the parameter values to be tested
# Parameters are: (cost, gamma, epsilon)
theta_min <- c(cost = 1e-4, gamma = 1e-3, epsilon = 1e-2)
theta_max <- c(cost = 10, gamma = 2, epsilon = 2)
# Run the genetic algorithm
results <- ga(type = "real-valued", fitness = fitnessFunc, lst_CV_data,
names = names(theta_min),
min = theta_min, max = theta_max,
popSize = 50, maxiter = 10)
summary(results)
which produces the results (for the range of parameter values that I specified, which may require fine-tuning based on the data):
GA results:
Iterations = 100
Fitness function value = -14.66315
Solution =
cost gamma epsilon
[1,] 2.643109 0.07910103 0.09864132
Related
I'm performing an elastic-net logistic regression on a dataset using the glmnet package in R.
I'm tuning the Alpha by cross-validation. See R code below:
for(j in 1:length(a)){
for (i in 1:nAlphaIterations){
set.seed(i)
cv <- cv.glmnet(X_dataset, Y_dataset, nfold = 10, family = "binomial", type.measure = "deviance", standardize = TRUE, paralle = F, alpha = a[j])
currAlpha <- data.frame(
cvm = cv$cvm[cv$lambda == cv$lambda.1se],
lambda.1se = cv$lambda.1se,
alpha = a[j],
seed = i)
allAlpha <- rbind(allAlpha, currAlpha) # all current alphas combined
}
print(j)
}
However, I am not sure which type of measure I should use for determing the alpha. Deviance? Or is AUC better?
Assuming your goal is to achieve high classification accuracy, you can use type.measure = "class" inside cv.glmnet() function, which optimizes for classification accuracy. To address random data splits in the comment, you can fix your foldid by creating a vector of fold index for each observation, as suggested in (https://glmnet.stanford.edu/articles/glmnet.html#introduction-1):
Users can explicitly control the fold that each observation is assigned to via the foldid argument. This is useful, for example, in using cross-validation to select a value for $\alpha$
I have a list of data
frequency x1,x2,...,xn
i.e. 10,20,...,5000.
Intensity y1,yx,...,yn
0,0,...,50,60,50,...,0
where I want to fit a normal distribution to the data.
I found some website online such as (http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html) through the procedure like,
my_data <- rnorm(250, mean=1, sd=0.45)# unkonwn distribution parameters
fit <- fitdistr(my_data, densfun="normal")
but obviously, those methods won't work.
How to fit the above data to a normal distribution?
You can use the maximum likelihood function, mle, to solve this problem. Here is how you would do that:
my_data <- rnorm(250, mean=1, sd=0.45)# unkonwn distribution parameters
logLik <- function(sigma, mu){
ll <- vapply(my_data,
function(x) dnorm(x, mean = mu, sd = sigma),
FUN.VALUE = numeric(1))
-sum(log(ll))
}
mle(logLik, start = list(sigma = 1, mu = 1))
mle requires a log-likehood function that it uses to determine the optimal parameters (which in the case of a normal distribution are mu (mean) and sigma (st. dev.)). It takes the negative sum of the log-likelihood -sum(log(ll)) as part of a numerical procedure to find the best parameters for the distribution. It then returns the estimated parameters:
Call:
mle(minuslogl = logLik, start = list(sigma = 1, mu = 1))
Coefficients:
sigma mu
0.4595003 0.9724402
In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.
I want to calculate a variance-covariance matrix of parameters. The parameters are obtained by a non-linear least squares fit.
library(minpack.lm)
library(numDeriv)
variables
t <- seq(0.1,20,0.3)
a <- 20
b <- 14
c <- 0.4
jitter <- rnorm(length(t),0,0.5)
Hobs <- a+b*exp(-c*t)+jitter
function def
Hhat <- function(parList, t) {parList$a + parList$b*exp(-parL
Hhatde <- function(par, t) {par[1] + par[2]*exp(-par[3]*t)}st$c*t)}
residFun <- function(par, t, observed) observed - Hhat(par,t)
initial conditions
parStart = list(a = 20, b = 10 ,c = 0.5)
nls.lm
library(minpack.lm)
out1 <- nls.lm(par = parStart, fn = residFun, observed = Hobs,
t = t, control = nls.lm.control(nprint=0))
I wish to calculate manually what is given back via vcov(out1)
I tried it with: but sigma and vcov(out1) which don't seem to be the same
J <- jacobian(Hhatde, c(19.9508523,14.6586555,0.4066367 ), method="Richardson",
method.args=list(),t=t)
sigma <- solve((t(J)%*%J))
vcov(out1)
now trying to do it with the hessian, I can't get it working for error message cf below
hessian
H <- hessian(Hhatde, x = c(19.9508523,14.6586555,0.4066367 ), method="complex", method.args=list(),t=t)
Error in hessian.default(Hhatde, x = c(19.9508523, 14.6586555, 0.4066367), :
Richardson method for hessian assumes a scalar valued function.
How do I do I get my hessian() to work.
I am not very strong on the math here, hence the trial and error approach.
vcov(out1) returns an estimate of the scaled variance-covariance matrix for the parameters in your model. The inverse of the cross product of the gradient, solve(crossprod(J)) returns an estimate of the unscaled variance-covariance matrix. The scaling factor is the estimated variance of the errors. So to calculate the scaled variance-covariance matrix (with some rounding error) using the gradient and the residuals from your model:
df = length(Hobs) - length(out1$par) # degrees freedom
se_var = sum(out1$fvec^2) / df # estimated error variance
var_cov = se_var * solve(crossprod(J)) # scaled variance-covariance
print(var_cov)
print(vcov(out1))
To brush up on non-linear regression and non-linear least squares, you might wish to check out Seber & Wild's Nonlinear regression, or Bates & Watts' Nonlinear regression analysis and its applications. John Fox also has a short online appendix that you may find helpful.
I want to do the following two steps:
Based on a given time series, I want to calibrate an AR(1) process, i.e. I want to estimate the parameters.
Based on the estimated parameters, I want to simulate an AR(1) processes.
Here was my approach:
set.seed(123)
#Just generate random AR(1) time series; based on this, I want to estimate the parameters
ts_AR <- arima.sim(n=10000, list(ar=c(0.5)))
#1. Estimate parameters with arima()
model_AR <- arima(ts_AR, order=c(1,0,0))
#Looks actually good
model_AR
Series: ts_AR
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 intercept
0.4891 -0.0044
s.e. 0.0087 0.0195
sigma^2 estimated as 0.9974: log likelihood=-14176.35
AIC=28358.69 AICc=28358.69 BIC=28380.32
#2. Simulate based on model
arima.sim(model=model_AR, n = 100)
Error in arima.sim(model = model_AR, n = 100) :
'ar' part of model is not stationary
I'm not the biggest time-series expert, but I'm pretty sure that an AR(1) process with a persistence parameter of below one should result in a stationary model. However, the error message tells me somethings
different. So do I do something stupid here? If so, why and what should I do to simulate the AR(1) process based on my estimated parameters. Or can't you just pass the output of arima as the model input into arima.sim? Then, however, I don't understand how I get such an error message...I would expect something like "model input cannot be read. It should be something like ..."
It's not the clearest interface in the world, but the model argument is meant to be a list giving the ARMA order, not an actual arima model.
arima.sim(model=as.list(coef(model_AR)), n=100)
This will create a simulated series with AR coefficient .489 as estimated from your starting data. Note that the intercept is ignored.
I don't think you are using the right approach since there's uncertainty about your coefficient estimate.
The best way to achieve what you want in a proper way is to incorporate uncertainty in the generation process, there are probably parametric way to do that but I think bootstrap can be handy here.
Lets generate the AR process first
set.seed(123)
ts_AR <- arima.sim(n = 10000, list(ar = 0.5))
We'll define two helper functions that will used in the boostrap. The first one generate the statistics we need (here the coef of the AR process and the actual time series) and the second function implement our resampling scheme (it'll be based on residuals)
ar_fun <- function(ts) c(ar = coef(arima(ts, order = c(1, 0, 0),
include.mean = FALSE)), ts = ts)
ar_sim <- function(res, n.sim, ran.args) {
rg <- function(n, res) sample(res, n, replace = TRUE)
ts <- ran.args$ts
model <- ran.args$model
arima.sim(model = model, n = n.sim,
rand.gen = rg, res = c(res))
}
Now we can start our simulation
ar_fit <- arima(ts_AR, order = c(1, 0, 0), include.mean = FALSE)
ts_res <- residuals(ar_fit)
ts_res <- ts_res - mean(ts_res)
ar_model <- list(ar = coef(ar_fit))
require(boot)
set.seed(1)
ar_boot <- tsboot(ts_res, ar_fun,
R = 99, sim = "model",
n.sim = 100, orig.t = FALSE,
ran.gen = ar_sim,
ran.args = list(ts = ts_AR, model = ar_model))
If you want to get all the coefficient generated and the associated time series
coefmat <- apply(ar_boot$t, 1, "[", 1)
seriesmat <- apply(ar_boot$t, 1, "[", -1)
You can get more details in help file of tsboot and in Bootstrap Methods and Their Application, chap 8.