R: library(ks) - bug or not? - r

In my program, I need to have a function, that makes continuous density estimate (defined everywhere on reals) from arbitrary sample. So I've chosen the library(ks) and found that it sometimes produces buggy object, and other functions (e.g. plot) crash R session while accessing it.
Please, I want someone else to check whether it is a bug in the package, my R build, (or even I'm doing something wrong).
So, the code to reproduce the crash :
library(ks)
set.seed(8192)
density_generator<-function(s)
{
# this function returns kernel density estimate built on sample 's'
hpi1 <- hpi(x=s) # calculating h parameter for kernel estimation
fhat.pi1 <- kde(x=s, H=hpi1) # generating density object
fhat.pi1
}
## testing the density generator
conditional_density_object<-density_generator(c(1,2,3,4,5))
foo<-function(z){predict(conditional_density_object,x=z)}
y<-seq(from=-7,to=11, by=0.01) # R session fails for some of the parameters
plot(y,sapply(y,foo),pch=".")
Other parameters, e.g. y<-seq(from=-7,to=5, by=0.01) do not crash the R session:
If everything works Ok for the range (-7,11) - please, check other (large) numbers, maybe this is system-specific.

Related

Double discrete integration of periodic function with R: doubly integrated function contains linear artifact

I need to integrate a signal from accelerometer, in order to get speed and position over time.
I'm trying the code on some code-generated acceleration data:
1)squarewave
2)sawtooth
3)sin
The speed function obtained is ok, the problem is with the position function obtained integrating speed. IN each case (squarewave, sawtooth, sin) the doubly discrete-integrated funtion shows a linear term superposed to the expected oscillating one.
I've perfomed this discrete-integration with both diffinv() function and with this custom function I've written:
#function that, given a function sampled at some time values, calculates its primitive
calculatePrimitive<-function(f_t, time, initialValue){
F_t<-0
F_t[1]<-initialValue
for (i in 2:length(f_t)) {
F_t[i] <- F_t[i-1] + (( (f_t[i]+f_t[i-1])/2 )*(time[i]-time[i-1]) )
}
F_t
}
The result is the same, no matter which function i use to performe the discrete integration, and it is shown in the attached graphs for cases 1) to 3).
I don't understand why this happen when, no matter what is the acceleration data, the discrete integration is applied to data that have been obtained by descrete integration themselves.

R masking conflict: packages bnlearn and sna

I have problems to parallely use R packages bnlearn and sna. The following example is straightforward:
library(bnlearn)
data("asia")
# build network
a <- hc(asia)
# output
a
The output is as expected:
Bayesian network learned via Score-based methods
model:
[A][S][T][L|S][B|S][E|T:L][X|E][D|B:E]
nodes: 8
arcs: 7
undirected arcs: 0
directed arcs: 7
average markov blanket size: 2.25
average neighbourhood size: 1.75
average branching factor: 0.88
learning algorithm: Hill-Climbing
score: BIC (disc.)
penalization coefficient: 4.258597
tests used in the learning procedure: 77
optimized: TRUE
Once I load the sna package, I receive something completely different:
library(sna)
#output
a
I get:
Biased Net Model
Parameters:
Error in matrix(c(x$d, x$pi, x$sigma, x$rho), ncol = 1) :
'data' must be of a vector type, was 'NULL'
As I don't really call any functions (just want to get the output of a), I don't think that using the :: operator can help.
I wonder if the problem is masking of an internal function that I can't really influence. Any help would be great!
This is somewhat similar to other q & a's except in this case there is an implicit call to print, rather than an explicit function call. It is this print function that is getting masked.
To print a, you can either type a in the terminal, or be explicit and type print(a). To get the nice print layout of the bn object, the author has written a print method, and this is what is dispatched when typing either a or print(a). (To see it without this specific printing you can use print.default(a)). After noting that the class(a) == "bn", you can look for the print method, by using methods("print") or typing bnlearn:::print and then <tab> to see available functions: this leads to a (non-exported) function bnlearn:::print.bn.
So long story short, the sna package also has a print.bn method, for objects of class "bn" (biased net), and it is this function that masks the one from bnlearn.
So if you load sna after bnlearn, you can still get the nice printing by either explicitly using bnlearn:::print.bn(a), or redefine the print method print.bn <- bnlearn:::print.bn, and it should print as expected.

Passing a list to a function in R for using in optimization

I want to program the maximum likelihood of a gamma distribution in R; until now I have done the following:
library(stats4)
x<-scan("http://www.cmc.edu/pages/faculty/MONeill/Math152/Handouts/gamma-arrivals.txt")
loglike2<-function(LL){
alpha<-LL$a
beta<-LL$b
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)}
mle(loglike2,start=list(a=0.5,b=0.5))
but when I want to run it, the following message appear:
Error in mle(loglike2, start = list(a = 0.5, b = 0.5)) :
some named arguments in 'start' are not arguments to the supplied log-likelihood
What am I doing wrong?
From the error message it sounds like mle needs to be able to see the variable names listed in start= in the function call itself.
loglike2<-function(a, b){
alpha<-a
beta<-b
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}
mle(loglike2,start=list(a=0.5,b=0.5))
If that doesn't work you should post a reproducible example with all variables defined and also explicitly indicate which package the mle function is coming from.
The error message is unfortunately criptic because it indicates mising
values owing to the fact that alpha and gamma have to be positive and mle optimizes over the real numbers. Hence, you need to transfomt the vector over which the function is being optimized, like so:
library(stats4)
x<-scan("http://www.cmc.edu/pages/faculty/MONeill/Math152/Handouts/gamma-arrivals.txt")
loglike<-function(alpha,beta){
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}
fit <- mle(function(alpha,beta)
# transfrom the parameters so they are positive
loglike(exp(alpha),exp(beta)),
start=list(alpha=log(.5),beta=log(.5)))
# of course you would have to exponentiate the estimates too.
exp(coef(fit1))
note that the error now is that you are using n in loglike()
which you have not defined. If you define n, then you get an error stating
Lapack routine dgesv: system is exactly singular: U[1,1] = 0. which is
caused either by a not very good guess for the start value of alpha and
beta or (more likely) that loglike() does not have a minima (I think your
deleted post from last night had a slightly different formula which I was
able to get working, but not able to respond to b/c the post was deleted...)
FYI, if you want to inspect the alpha and beta parameters that cause the
errors, you can use scoping assignment to post the most recently called
parameters to the environment in which loglike() is defined as in:
loglike<-function(alpha,beta){
g <<- c(alpha,beta)
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}

Equation of rbfKernel in kernlab is different from the standard?

I have observed that kernlab uses rbfkernel as,
rbf(x,y) = exp(-sigma * euclideanNorm(x-y)^2)
but according to this wiki link, the rbf kernel should be of the form
rbf(x,y) = exp(-euclideanNorm(x-y)^2/(2*sigma^2))
which is also more intuitive since two close samples with a large kernel sigma value will lead to a higher similarity matching.
I am not sure what e1071 svm uses (native code libsvm?)
I hope someone can enlighten me on why there is a difference ? I caught this because I was initially using e1071 but switched to ksvm but saw inconsistent results for the two.
A small example for comparison
set.seed(123)
x <- rnorm(3)
y <- rnorm(3)
sigma <- 100
rbf <- rbfdot(sigma=sigma)
rbf(x, y)
exp( -sum((x-y)^2)/(2*sigma^2) )
I would expect the kernel value to be close to 1 (since x,y come from sigma=1, while kernel sigma=100). This is observed only in the second case.
I came across that discrepancy too and I wound up digging into the source to figure out if there was a typo in the documentation or what was going on exactly since sigma in the context of Gaussians traditionally goes as the standard deviation in the denominator right?
Here's the relevant source
**kernlab\R\kernels.R**
## Define the kernel objects,
## functions with an additional slot for the kernel parameter list.
## kernel functions take two vector arguments and return a scalar (dot product)
rbfdot<- function(sigma=1)
{
rval <- function(x,y=NULL)
{
if(!is(x,"vector")) stop("x must be a vector")
if(!is(y,"vector")&&!is.null(y)) stop("y must a vector")
if (is(x,"vector") && is.null(y)){
return(1)
}
if (is(x,"vector") && is(y,"vector")){
if (!length(x)==length(y))
stop("number of dimension must be the same on both data points")
return(exp(sigma*(2*crossprod(x,y) - crossprod(x) - crossprod(y))))
# sigma/2 or sigma ??
}
}
return(new("rbfkernel",.Data=rval,kpar=list(sigma=sigma)))
}
You can observe from their comment on sigma/2 or sigma ?? that they may perhaps be a bit confused about the convention to adopt, the presence of /2 would be consistent with the standard deviation form /(2*sigma), but I had to speculate about this discovery.
Now another corroborating piece of evidence is in the help page for ? rbfdot which reads...
sigma The inverse kernel width used by the Gaussian the Laplacian,
the Bessel and the ANOVA kernel
And that is consistent with the form they use with sigma in the numerator, since in the denominator it would scale proportionately with the width of the Gaussian right. So it indeed looks like they settled on the convention that is described in the Wikipedia article as the gamma form, where they say
An equivalent, but simpler, definition involves a parameter gamma =
-1/(2*sigma^2)
So the difference just seems to be a matter of adopting different but equivalent conventions. One motivator for the particular convention (which someone may confirm in a comment) may arise from issues of code reuse and consistency, where as you see the parameter is used by three other kernel forms that may have their parameters more traditionally set in the numerator. I'm not sure on that point however since I've never used those alternate kernels and am unfamiliar with each.

R: SVM performance using custom kernel (user defined kernel) is not working in kernlab

I'm trying to use user defined kernel. I know that kernlab offer user defined kernel(custom kernel functions) in R. I used data spam including package kernlab.
(number of variables=57 number of examples =4061)
I'm defined kernel's form,
kp=function(d,e){
as=v*d
bs=v*e
cs=as-bs
cs=as.matrix(cs)
exp(-(norm(cs,"F")^2)/2)
}
class(kp)="kernel"
It is the transformed kernel for gaussian kernel, where v is the continuously changed values that are inverse of standard deviation vector about each variables, for example:
v=(0.1666667,........0.1666667)
The training set defined 60% of spam data (preserving the proportions of the different classes).
if data's type is spam, than data's type = 1 for train svm
m=ksvm(xtrain,ytrain,type="C-svc",kernel=kp,C=10)
But this step is not working. It's always waiting for a response.
So, I ask you this problem, why? Is it because the number of examples are too big? Is there any other R package that can train SVMs for user defined kernel?
First, your kernel looks like a classic RBF kernel, with v = 1/sigma, so why do you use it? You can use a built-in RBF kernel and simply set the sigma parameter. In particular - instead of using frobenius norm on matrices you could use classic euclidean on the vectorized matrices.
Second - this is working just fine.
> xtrain = as.matrix( c(1,2,3,4) )
> ytrain = as.factor( c(0,0,1,1) )
> v= 0.01
> m=ksvm(xtrain,ytrain,type="C-svc",kernel=kp,C=10)
> m
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 10
Number of Support Vectors : 4
Objective Function Value : -39.952
Training error : 0
There are at least two reasons for you still waiting for results:
RBF kernels induce the most hard problem to optimize for SVM (especially for large C)
User defined kernels are far less efficient then builtin
As I am not sure, whether ksvm actually optimizes the user-defined kernel computation (in fact I'm pretty sure it does not), you could try to build the kernel matrix ( K[i,j] = K(x_i,x_j) where x_i is i'th training vector) and provide ksvm with it. You can achieve this by
K <- kernelMatrix(kp,xtrain)
m <- ksvm(K,ytrain,type="C-svc",kernel='matrix',C=10)
Precomputing kernel matrix can be quite long process, but then optimization itself will be much faster, so it is a good method if you want to test many different C values (which you for sure should do). Unfortunately this requires O(n^2) memory, so if you use more then 100 000 vectors, you will need really great amount of RAM.

Resources