kernel matrix computation outside SVM training in kernlab - r

I was developing a new algorithm that generates a modified kernel matrix for training with a SVM and encountered a strange problem.
For testing purposes I was comparing the SVM models learned using kernelMatrix interface and normal kernel interface. For example,
# Model with kernelMatrix computation within ksvm
svp1 <- ksvm(x, y, type="C-svc", kernel=vanilladot(), scaled=F)
# Model with kernelMatrix computed outside ksvm
K <- kernelMatrix(vanilladot(), x)
svp2 <- ksvm(K, y, type="C-svc")
identical(nSV(svp1), nSV(svp2))
Note that I have turned scaling off, as I am not sure how to perform scaling on kernel matrix.
From my understanding both svp1 and svp2 should return the same model. However I observed that this not true for a few datasets, for example glass0 from KEEL.
What am I missing here?

I think this has to do with same issue posted here. kernlab appears to treat the calculation of ksvm differently when explicitly using vanilladot() because it's class is 'vanillakernel' instead of 'kernel'.
if you define your own vanilladot kernel with a class of 'kernel' instead of 'vanillakernel' the code will be equivalent for both:
kfunction.k <- function(){
k <- function (x,y){crossprod(x,y)}
class(k) <- "kernel"
k}
l<-0.1 ; C<-1/(2*l)
svp1 <- ksvm(x, y, type="C-svc", kernel=kfunction.k(), scaled=F)
K <- kernelMatrix(kfunction.k(),x)
svp2 <- ksvm(K, y, type="C-svc", kernel='matrix', scaled=F)
identical(nSV(svp1), nSV(svp2))
It's worth noting that svp1 and svp2 are both different from their values in the original code because of this change.

Related

'Invalid parent values' error when running JAGS from R

I am running a simple generalized linear model, calling JAGS from R. The model is negatively binomially distributed. The model is being fitted to data on counts of fish, with the majority of individual counts ('C' in the data set below) being zeros.
I initially ran the model with one covariate, temperature ('Temp'). About half of the time the model ran and the other half of the time the model gave me the error, 'Error in node C[###] Invalid parent values.' The value for C[###] in the error message changes with each successive attempt to run the model.
Since my success at running the model was inconsistent, I tried adding another covariate, salinity ('Salt'). Then the model would not run at all, with the same error message as above.
Any ideas or suggestions on the source of the error are greatly appreciated.
I am suspecting that the initial values for the dispersion parameter, r, may be the issue. Ideally I add several more covariates into model fitting if this error can be addressed.
The data set and code are immediately below. For sake of getting the data to load properly on this website, I have omitted 662 of the 672 total values; even with the reduced data set (n = 10 instead of n = 672) the problem remains.
Thank you.
setwd("C:/Users/John/Desktop")
library('coda')
library('rjags')
library('R2jags')
set.seed(1000000000)
#data
n=10
C=c(0,0,0,0,0,1,0,0,0,1)
Temp=c(0,29.3,25.3,28.7,28.7,24.4,25.1,25.1,24.2,23.3)
Salt=c(6,6,0,6,6,0,12,12,6,12)
sink("My Model.txt")
cat("
model {
r~dunif(0,10)
beta0~dunif (-20,20)
beta1~dunif (-20,20)
beta2~dunif (-20,20)
for (i in 1:n) {
C[i] ~ dnegbin(p[i], r)
p[i] <- r/(r+lambda[i])
log(lambda[i]) <- mu[i]
mu[i] <- beta0 + beta1*Temp[i] + beta2*Salt[i]
}
}
", fill=TRUE)
sink()
n=n
C=C
Temp=Temp
Salt=Salt
#bundle data
bugs.data = list(
"n",
"C",
"Temp",
"Salt")
#parameters to monitor
params<-c(
"r",
"beta0",
"beta1",
"beta2")
#initial values
inits <- function(){list(
r=floor(runif(1,0,5)),
beta0=runif(1,-5,5),
beta1=runif(1,-5,5),
beta2=runif(1,-5,5))}
model.file <- 'My Model.txt'
jagsfit <- jags(data=bugs.data, inits=inits, params, n.iter=1000, n.thin=10, n.burnin=100, model.file)
print(jagsfit, digits=5)
This works fine for me most of the time, but it would fail with the error you describe if the inits function samples a value of r of 0 - which you have made more likely by using floor() in the inits function (not sure why you did that - r is not restricted to integers but is strictly positive). Also, every time you run the model you will get different initial values (unless setting a random seed in R) which is making your life more complicated that it needs to be. I generally recommend picking fixed (and probably over dispersed) initial values, such as r=0.01 and r=10 for the two chains in your example.
However, JAGS picks usable initial values for this model as you can see by not providing your own inits e.g.:
library('runjags')
listdata <- lapply(bugs.data, get)
names(listdata) <- unlist(bugs.data)
run.jags(model.file, params, listdata)
I would also have a think about the prior you are using for r - it could well be that this will have a bigger effect on your posterior than intended. Another (not necessarily better) option is something like a gamma prior.
Matt

How to weight a kernel function in ksvm of kernlab package in R?

I am rewriting a RBF kernel function:
install.packages("kernlab")
library(kernlab)
rbf <- function(x, y) {
gamma<-0.5
exp(-0.5*norm((as.matrix(x)-as.matrix(y)),"f"))
}
class(rbf) <- "kernel"
the data I am using (10 observations, 3 variables, and the forth column is the target):
data<-matrix(1:40,nrow=10,ncol=4)
train<-data[1:(0.6*nrow(data)), ]
test<-data[((0.6*nrow(data))+1):nrow(data), ]
modelling with custom kernel function using ksvm and prediction:
k_rbf <- ksvm(train[,ncol(train)]~.,data=train,C=0.1,type="eps-svr",epsilon=0.01,kernel=rbf)
ksvm_rbf<-predict(k_rbf, test)
so far, the function is working well, but i want to further design the custom rbf kernel adding similarity function. Normally, we have the RBF kernel as below:
the RBF kernel adding with seasonality which I designed looks as below:
Where x_i and x_j are two objects representing the time series at timestamp t_i and t_j respectively and S is the seasonal period
so, adding a new column with the index for each row:
t<-1:nrow(data)
data_t<-cbind(t,data)
train_t<-data_t[1:(0.6*nrow(data_t)), ]
test_t<-data_t[((0.6*nrow(data_t))+1):nrow(data_t), ]
adding the similarity part in the kernel I have built:
sea_rbf <- function(x, y) {
gamma<-0.5
S<-3
n_x<-x[1] # row ID of X
n_y<-y[1] # row ID of y
x<-x[2:4]
y<-y[2:4]
d<-abs((n_x-n_y)%% S)
sea<-min(d,S-d)
value <-exp(-0.5*norm((as.matrix(x)-as.matrix(y)),"f"))*exp(-sea^2)
return (value )}
class(sea_rbf) <- "kernel"
k_rbf_t <- ksvm(train_t[,ncol(train_t)]~.,data=train_t,C=0.1,type="eps-svr",epsilon=0.01,kernel=sea_rbf)
ksvm_rbf_t<-predict(k_rbf_t, test_t)
there isn't error during training and prediction, but when I run the debug() to see the process details, that is not what i expected! For example, the code of "n_x<-x1" in function did never fetch the row ID from the raw data :(.
did i have a wrong understanding for the kernel function usage using the ksvm function :((((
any help is welcome!
thanks a lot!!!

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

computing ridge estimate manually in R, simple

I'm trying to learn about ridge regression, and I am using R. From what I understand the following should be the same beta.r1 and beta.r2 in the code below are the same
library(MASS)
n=50
v1=runif(n)
v2=v1+2
V=cbind(1,v1,v2)
w=3+v1+v2
I=diag(3)
lambda=2 #arbitrarily chosen
beta.r1=solve(t(V)%*%V+lambda*I)%*%t(V)%*%w
#Using library(MASS)
fit=lm.ridge(w~v1+v2,lambda=2, Inter=FALSE)
beta.r2=coef(fit)
#Shouldn't beta.r1 and beta.r2 be the same?
I think the variable scaling performed in the lm.ridge code (which you can access by typing lm.ridge into your R console) that likely cause differences. The code scales each variable by its root-mean-squared value:
Xscale <- drop(rep(1/n, n) %*% X^2)^0.5
X <- X/rep(Xscale, rep(n, p))
Your code does not perform any variable scaling.
The variable scaling is hinted at on the ?lm.ridge help page in the description of what is returned by lm.ridge:
scales: scalings used on the X matrix.
Therefore you can access the scaling used by lm.ridge:
fit$scales
# v1 v2
# 0.2650311 0.2650311

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Resources