I would like to understand how the R kknn package calculates weights, distances, and class probabilities for binary classification problems. In the R code below, there are three observations in the training sample and one observation in the holdout sample. The two predictor variables are height and weight. With Euclidean distance, the distances for each observation in the training sample are then:
sqrt((6-8)^2 + (4-5)^2) = 2.24
sqrt((6-3)^2 + (4-7)^2) = 4.24
sqrt((6-7)^2 + (4-3)^2) = 1.41.
With k=3 and with equal weights, I get a probability for the holdout as:
(1/3 * 1) + (1/3 * 0) + (1/3 * 1) = 0.67.
With k=2 and with equal weights, I get a probability for the holdout as:
(1/2 * 1) + (1/2 * 1) = 1.00.
I would like to understand how the R kknn package makes these same calculations with the "triangular," "gaussian," and "inverse" weights (and more generally).
library(kknn)
training <- data.frame(class = c(1, 0, 1), height = c(8, 3, 7), weight = c(5, 7, 3))
holdouts <- data.frame(class = 1, height = 6, weight = 4)
triangular_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "triangular", k = 3)
triangular_kernel[["fitted.values"]]
triangular_kernel[["W"]]
triangular_kernel[["D"]]
gaussian_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "gaussian", k = 3)
gaussian_kernel[["fitted.values"]]
gaussian_kernel[["W"]]
gaussian_kernel[["D"]]
inverse_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "inv", k = 3)
inverse_kernel[["fitted.values"]]
inverse_kernel[["W"]]
inverse_kernel[["D"]]
Calling kknn::kknn prints the source code for the kknn function in the console. With it, one can go through the function line by line to see what it does.
Distance
kknn calls a compiled C code dmEuclid. To obtain its source code, we follow this guide, writing the following code in R:
untar(download.packages(pkgs = "kknn", destdir = ".", type = "source")[,2])
and then open the src directory of kknn_1.3.1.tar in your working directory (getwd()) to find and open dm.C using any text editor. Scroll about halfway to find dmEuclid. To test the exact outputs of dmEuclid, you could install the build tools, and open a C++ file in Rstudio by selecting it in the dropdown menu, and run the code with different inputs.
Following the function outputs, in your case the dmtmp$dm results in
3.779645e-01 1.133893e+00 1.000000e+150 3.685210e-156
Per your specification k, the first 3 values are chosen as distance D.
This is manually converted to maxdist = 1e-06 by the package author, as the max distance is smaller than that in your case.
Weights
The kknn function uses the following section to allocate a weight scheme, per your defined kernel.
W <- D/maxdist
W <- pmin(W, 1 - (1e-06))
W <- pmax(W, 1e-06)
at this point your W values are larger than 1, and so W is then coerced to approximately 1.
if (kernel == "inv"
W <- 1/W
if (kernel == "triangular")
W <- 1 - W
if (kernel == "gaussian") {
alpha = 1/(2 * (k + 1))
qua = abs(qnorm(alpha))
W = W * qua
W = dnorm(W, sd = 1)
}
the explanation for which can be found in the paper linked by gowerc.
W is then converted to matrix W <- matrix(W, p, k) with 1 row (p=1), 3 columns (k=3)
Fitted value
p = 1 in your case is 1, k=3, cl = c(1,0,1).
C <- matrix(dmtmp$cl, nrow = p, ncol = k + 1)
C <- C[, 1:k] + 1
CL <- matrix(cl[C], nrow = p, ncol = k)
W <- matrix(W, p, k)
fit <- rowSums(W * CL)/pmax(rowSums(W), 1e-06)
Related
I am trying to figure out how to sample from a custom density in rJAGS but am running into issues. having searched the site, I saw that there is a zeroes (or ones) trick that can be employed based on BUGS code but am having a hard time with its implementation in rJAGS. I think I am doing it correctly but keep getting the following error:
Error in jags.model(model1.spec, data = list(x = x, N = N), n.chains = 4, :
Error in node dpois(lambda)
Length mismatch in Node::setValue
Here is my rJAGS code for reproducibility:
library(rjags)
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
model1.string <-"
model {
for (i in 1:N){
x[i] ~ dpois(lambda)
lambda <- -N*log(1-exp(-(1/mu)))
}
mu ~ dlnorm(mup, taup)
mup <- log(.0001)
taup <- 1/49
R <- 1 - exp(-(1/mu) * .0001)
}
"
model1.spec<-textConnection(model1.string)
jags <- jags.model(model1.spec,
data = list('x' = x,
'N' = N),
n.chains=4,
n.adapt=100)
Here, my negative log likelihood of the density I am interested in is -N*log(1-exp(-(1/mu))). Is there an obvious mistake in the code?
Using the zeros trick, the variable on the left-hand side of the dpois() relationship has to be an N-length vector of zeros. The variable x should show up in the likelihood somewhere. Here is an example using the normal distribution.
set.seed(519)
N <- 100
x <- rnorm(100, mean=3)
z <- rep(0, N)
C <- 10
pi <- pi
model1.string <-"
model {
for (i in 1:N){
lambda[i] <- pow(2*pi*sig2, -0.5) * exp(-.5*pow(x[i]-mu, 2)/sig2)
loglam[i] <- log(lambda[i]) + C
z[i] ~ dpois(loglam[i])
}
mu ~ dnorm(0,.1)
tau ~ dgamma(1,.1)
sig2 <- pow(tau, -1)
sumLL <- sum(log(lambda[]))
}
"
model1.spec<-textConnection(model1.string)
set.seed(519)
jags <- jags.model(model1.spec,
data = list('x' = x,
'z' = z,
'N' = N,
'C' = C,
'pi' = pi),
inits = function()list(tau = 1, mu = 3),
n.chains=4,
n.adapt=100)
samps1 <- coda.samples(jags, c("mu", "sig2"), n.iter=1000)
summary(samps1)
Iterations = 101:1100
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 4.493 2.1566 0.034100 0.1821
sig2 1.490 0.5635 0.008909 0.1144
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.6709 3.541 5.218 5.993 7.197
sig2 0.7909 0.999 1.357 1.850 2.779
I am using the glmnet package in R, and not(!) the caret package for my binary ElasticNet regression. I have come to the point where I would like to compare models (e.g. lambda set to lambda.1se or lambda.min, and models where k-fold is set to 5 or 10). But, I have not yet achieved to compute the AICc or BIC for my models. How do I do that? I tried this and this but it did not work for me, I only get an empty list.
Code:
set.seed(123)
foldid <- sample(rep(seq(10), length.out = nrow(x.train)))
list.of.fits.df <- list()
for (i in 0:10){
fit.name <- paste0("alpha", i/10)
list.of.fits.df[[fit.name]] <- cv.glmnet(x.train, y.train, type.measure = c("auc"), alpha = i/10, family = "binomial", nfolds = 10, foldid = foldid, parallel = TRUE)
}
best.fit <- coef(list.of.fits.df[[fit.name]], s = list.of.fits.df[[fit.name]]$lambda.1se)
best.fit.min <- coef(list.of.fits.df[[fit.name]], s = list.of.fits.df[[fit.name]]$lambda.min)
#AICc & BIC
#???
How can I find the AICc and BIC for my best fit model?
You can alter the solution given in this answer slightly to obtain the desired result The reason it doesn't work "out of the box" is that the cv.glmnet function returns the result of several fits, but the individual results are stored in x$glmnet.fit, and we can use this to create a simple function for calculating AICc and BIC.
glmnet_cv_aicc <- function(fit, lambda = 'lambda.1se'){
whlm <- which(fit$lambda == fit[[lambda]])
with(fit$glmnet.fit,
{
tLL <- nulldev - nulldev * (1 - dev.ratio)[whlm]
k <- df[whlm]
n <- nobs
return(list('AICc' = - tLL + 2 * k + 2 * k * (k + 1) / (n - k - 1),
'BIC' = log(n) * k - tLL))
})
}
All we'll then have to do is provide the model and get our estimated AICc.
best.aicc <- glmnet_cv_aicc(list.of.fits.df[[fit.name]])
best.aicc.min <- glmnet_cv_aicc(list.of.fits.df[[fit.name]], 'lambda.min')
For a reproducible example, one could use one of the many examples provided in help(glmnet)
n = 500
p = 30
nzc = trunc(p/10)
x = matrix(rnorm(n * p), n, p)
beta3 = matrix(rnorm(30), 10, 3)
beta3 = rbind(beta3, matrix(0, p - 10, 3))
f3 = x %*% beta3
p3 = exp(f3)
p3 = p3/apply(p3, 1, sum)
g3 = glmnet:::rmult(p3)
set.seed(10101)
cvfit = cv.glmnet(x, g3, family = "multinomial")
print(glmnet_cv_aicc(cvfit))
# Output
#$AICc
#[1] -556.2404
#
#$BIC
#[1] -506.3058
print(glmnet_cv_aicc(cvfit, 'lambda.min'))
# Output
#$AICc
#[1] -601.0234
#
#$BIC
#[1] -506.4068
I am currently using R combining with Stan to conduct MCMC sampling for obtaining posterior distribution of a certain demand variable d, given historical demand dH and currently observed variable x (so the formulation is figuring out P(d|dH, x), which is proportional to P(x|d)P(d|dH).
My question
I found it really weird that the sampling process shows MCMC jumping back and forth between warmup and sampling (isn't it the case that the first nth iterations are always in warmup stage, followed by actual sampling stage?) At the same time, it skipped Chain 1 completely(?!). Below is the picture of the progress it shows:
My code
for(i in 1:365){
nrow = nrow(rte_m[[i]]);
ncol = ncol(rte_m[[i]]);
A <- as.matrix(rte_m[[i]]);
sigma_x <- as.vector(sample.int(10, nrow(kf_vect[[i]]), replace=TRUE))
sigma_y <- as.vector(eps_vect[[i]])
yH <- as.vector(dh_vect[[i]]);
yT <- yH + as.vector(eps_vect[[i]]);
epsilon <- sample.int(10, nrow(kf_vect[[i]]), replace=TRUE)
x <- as.vector(as.matrix(rte_m[[i]])%*%yT) + epsilon
iterations = 500;
#input data into a list called stan_data
stan_data = list(nrow = nrow, ncol = ncol,
yH = yH,
x = x, epsilon = epsilon,
A = A, sigma_x = sigma_x, sigma_y = sigma_y);
#input it into our Stan model file "stamodeling.stan"
stanmodel1 <- stan_model(file = "stamodeling.stan",
model_name = "stanmodel1");
#MCMC sampling
stanfit <- sampling(stanmodel1, data = list(ncol = ncol,nrow = nrow,
yH = yH,
x=x, epsilon = epsilon,
A = A, sigma_x = sigma_x, sigma_y = sigma_y)
,iter=iterations, warmup = 200, chains = 4, cores = 2);
Stan Modeling File
Data Files
What's happening isn't that a given chain is switching between warmup and sampling. Instead, what's happening is that the progress messages from the various chains are being interspersed with one another.
So, for example, when you see the following:
[Iteration:] 50/500 [0%] (Warmup)
[Iteration:] 50/500 [0%] (Warmup)
You're actually seeing two messages, one from Chain A and the second from Chain B.
My task is to simulate a compound Poisson process defined as:
where
is a Poisson process and Y_i are Gamma(shape,scale) distributed. This is my R code:
# parameter for Poisson distribution.
lambda = 1
# parameters for Gamma distribution.
shape = 7.5
scale = 1
comp.pois = function(t.max, lambda) {
stopifnot(t.max >= 0 && t.max %% 1 == 0)
# offset ns by 1 because first y is 0.
# generate N(t), that is number of arrivals until time t.
ns = cumsum(rpois(n = t.max, lambda = lambda)) + 1
# generate gamma distributed random variables Y_i.
ys = c(0, rgamma(n = max(ns), shape = shape, scale = scale))
# generate all X(t) for t <= t.max.
return(c(0, cumsum(x = ys[ns])))
}
Compute a random sample of X(10) and compare means and variances.
# sample size.
size = 1000
t = 10
# ts is a vector of sample values for X(10).
ts = sapply(1:size, function(i) comp.pois(t, lambda)[t])
# sample mean and variance:
(mean.s = mean(ts))
(var.s = var(ts))
# theoretical mean and variance:
(mean.t = lambda * t * shape * scale)
(var.t = (shape + 1) * shape * scale^2)
output:
> # sample:
> (mean.s = mean(ts))
[1] 63.38403
> (var.s = var(ts))
[1] 184.3264
> # theoretical:
> (mean.t = lambda * t * shape * scale)
[1] 75
> (var.t = (shape + 1) * shape * scale^2)
[1] 63.75
This variance is gigantic, but I cannot spot my mistake. Please help. Thank you.
EDIT:
I used the following algorithm to generate the N(t). I don't know why it is supposed to be better. I took it from Rizzo, Maria L. Statistical computing with R. CRC Press, 2007. The mean is good, but the variance is even worse. I tried sampling from the Gamma distribution only once for the entire simulation (although I'm pretty sure this does not reflect the problem very well) and the mean was off by around 10-40 for t = 10. When resampling for every X(t) (which is what the following code does), the mean is very exact. As pointed out, the variance is horrifying. This is probably not a good solution, but I suppose it is as good as it gets.
lambda = 3
shape = 6
scale = 2
size = 10000
eps = 1e-8
t = 10
# with probability 1-eps, n or less gamma distributed random variables are needed.
n = qpois(1-eps, lambda = lambda * t)
# sample from the gamma distribution. Not sure if it's ok to use the same sample every time.
# with this, the mean is of by about 10%.
# ys = c(rgamma(n = n, shape = shape, scale = scale))
# the interarrival times are exponentially distributed with rate lambda.
pp.exp = function (t0) {
# not sure how many Tn are needed :/
Tn = rexp(1000, lambda)
Sn = cumsum(Tn)
return(min(which(Sn > t0)) - 1)
}
# generate N(t) which follow the poisson process.
ns = sapply(1:size, function (i) pp.exp(t))
# generate X(t) as in the problem description.
xs = sapply(ns, function (n) {
ys = c(rgamma(n = n, shape = shape, scale = scale))
sum(ys[1:n])
})
output (t=10) in this case:
> # compare mean and variance of 'size' samples of X(t) for verification.
> # sample:
> (mean.s = mean(xs))
[1] 359.864
> (var.s = var(xs))
[1] 4933.277
> # theoretical:
> (mean.t = lambda * t * shape * scale)
[1] 360
> (var.t = (shape + 1) * shape * scale^2)
[1] 168
I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!
It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)