I am trying to fit my own distribution to my data, find the optimum parameters of the distribution to match the data and ultimately find the FWHM of the peak in the distribution. From what I've read, the package fitdistrplus is the way to do this. I know the data takes the shape of a lorentzian peak on a quadratic background.
plot of the data:
plot of raw data
The raw data used:
data = c(0,2,5,4,5,4,3,3,2,2,0,4,4,2,5,5,3,3,4,4,4,3,3,5,5,6,6,8,4,0,6,5,7,5,6,3,2,1,7,0,7,9,5,7,5,3,5,5,4,1,4,8,10,2,5,8,7,14,7,5,8,4,2,2,6,5,4,6,5,7,5,4,8,5,4,8,11,9,4,8,11,7,8,6,9,5,8,9,10,8,4,5,8,10,9,12,10,10,5,5,9,9,11,19,17,9,17,10,17,18,11,14,15,12,11,14,12,10,10,8,7,13,14,17,18,16,13,16,14,17,20,15,12,15,16,18,24,23,20,17,21,20,20,23,20,15,20,28,27,26,20,17,19,27,21,28,32,29,20,19,24,19,19,22,27,28,23,37,41,42,34,37,29,28,28,27,38,32,37,33,23,29,55,51,41,50,44,46,53,63,49,50,47,54,54,43,45,58,54,55,67,52,57,67,69,62,62,65,56,72,75,88,87,77,70,71,84,85,81,84,75,78,80,82,107,102,98,82,93,98,90,94,118,107,113,103,99,103,96,108,114,136,126,126,124,130,126,113,120,107,107,106,107,136,143,135,151,132,117,118,108,120,145,140,122,135,153,157,133,130,128,109,106,122,133,132,150,156,158,150,137,147,150,146,144,144,149,171,185,200,194,204,211,229,225,235,228,246,249,238,214,228,250,275,311,323,327,341,368,381,395,449,474,505,529,585,638,720,794,896,919,1008,1053,1156,1134,1174,1191,1202,1178,1236,1200,1130,1094,1081,1009,949,890,810,760,690,631,592,561,515,501,489,467,439,388,377,348,345,310,298,279,253,257,259,247,237,223,227,217,210,213,197,197,192,195,198,201,202,211,193,203,198,202,174,164,162,173,170,184,170,168,175,170,170,168,162,149,139,145,151,144,152,155,170,156,149,147,158,171,163,146,151,150,147,137,123,127,136,149,147,124,137,133,129,130,128,139,137,147,141,123,112,136,147,126,117,116,100,110,120,105,91,100,100,105,92,88,78,95,75,75,82,82,80,83,83,66,73,80,76,69,81,93,79,71,80,90,72,72,63,57,53,62,65,49,51,57,73,54,56,78,65,52,58,49,47,56,46,43,50,43,40,39,36,45,28,35,36,43,48,37,36,35,39,31,24,29,37,26,22,36,33,24,31,31,20,30,28,23,21,27,26,29,21,20,22,18,19,19,20,21,20,25,18,12,18,20,20,13,14,21,20,16,18,12,17,20,24,21,20,18,11,17,12,5,11,13,16,13,13,12,12,9,15,13,15,11,12,11,8,13,16,16,16,14,8,8,10,11,11,17,15,15,9,9,13,12,3,11,14,11,14,13,8,7,7,15,12,8,12,14,9,5,2,10,8)
I have calculated the equations which define the distribution and cumulative distribution:
dFF <- function(x,a,b,c,A,gamma,pos) a + b*x + (c*x^2) + ((A/pi)*(gamma/(((x-pos)^2) + (gamma^2))))
pFF <- function(x,a,b,c,A,gamma,pos) a*x + (b/2)*(x^2) + (c/3)*(x^3) + A/2 + (A/pi)*(atan((x - pos)/gamma))
I believe these to be correct. From what I understand, a distribution fit should be possible using just these definitions using the fitdist (or mledist) method:
fitdist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331))
mledist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331))
This returns the statement 'function cannot be evaluated at initial parameters> Error in fitdist(data, "FF", start = list(0, 0.3, -4e-04, 70000, 13, 331)):the function mle failed to estimate the parameters, with the error code 100' in the first case and in the second I just get a list of 'NA' values for the estimates.
I then calculated a function to give the quantile distribution values to use the other fitting methods (qmefit):
qFF <- function(p,a,b,c,A,gamma,pos)
{
qList = c()
axis = seq(1,600,1)
aF = dFF(axis,a,b,c,A,gamma,pos)
arr = histogramCpp(aF) # change data to a histogram format
for(element in 1:length(p)){
q = quantile(arr,p[element], names=FALSE)
qList = c(qList,q)
}
return(qList)
}
Part of this code requires calling the c++ function (by using the library Rcpp):
#include <Rcpp.h>
#include <vector>
#include <math.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector<int> histogramCpp(NumericVector x) {
std::vector<int> arr;
double number, fractpart, intpart;
for(int i = 0; i <= 600; i++){
number = (x[i]);
fractpart = modf(number , &intpart);
if(fractpart < 0.5){
number = (int) intpart;
}
if(fractpart >= 0.5){
number = (int) (intpart+1);
}
for(int j = 1; j <= number; j++){
arr.push_back(i);
}
}
return arr;
}
This c++ method just turns the data into a histogram format. If the first element of the vector describing the data is 4 then '1' is added 4 times to the returned vector etc. . This also seems to work as sensible values are returned. plot of the quantile function:
Plot of quantiles returned for probabilities from 0 to 1 in steps of 0.001
The 'qmefit' method can then be attempted through the fitdist function:
fitdist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331), method = 'qme', probs = c(0,0.3,0.4,0.5,0.7,0.9))
I chose the 'probs' values randomly as I don't fully understand their meaning. This either straight-up crashes the R session or after a brief stuttering returns a list of 'NA' values as estimates and the line <std::bad_alloc : std::bad_alloc>
I am not sure if I am making a basic mistake here and any help or recommendations are appreciated.
In the end I managed to find a work-around for this using the rPython package and lmfit from python. It solved my issue and might be useful for others with the same issue. The R-code was as follows:
library(rPython)
python.load("pyFit.py")
python.assign("row",pos)
python.assign("vals",vals)
python.exec("FWHM,ERROR,FIT = fitDist(row,vals)")
FWHM = python.get("FWHM")
ERROR = python.get("ERROR")
cFIT = python.get("FIT")
and the called python code was:
from lmfit import Model, minimize, Parameters, fit_report
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
import math
def cauchyDist(x,a,b,c,d,e,f,g,A,gamma,pos):
return a + b*x + c*pow(x,2) + d*pow(x,3) + e*pow(x,4) + f*pow(x,5) + g*pow(x,6) + (A/np.pi)*(gamma/((pow((x-pos),2)) + (pow(gamma,2))))
def fitDist(row, vals):
gmod = Model(cauchyDist)
x = np.arange(0,600)
result = gmod.fit(vals, x=x, a = 0, b = 0.3, c = -0.0004, d = 0, e = 0, f= 0, g = 0, A = 70000, gamma = 13, pos = row)
newFile = open('fitData.txt', 'w')
newFile.write(result.fit_report())
newFile.close()
with open('fitData.txt', 'r') as inF:
for line in inF:
if 'gamma:' in line:
j = line.split()
inF.close()
FWHM = float(j[1])
error = float(j[3])
fit = result.best_fit
fit = fit.tolist()
return FWHM, error, fit
I increased the order of polynomial to obtain a better fit for the data and returned the FWHM, its error and the values for the fit. There are likely much better ways of achieving this but the final fit is as I needed.
Final fit. Red data points are raw data, the black line is the fitted distribution.
I wanted to use user-defined kernel function for Ksvm in R.
so, I tried to make a vanilladot kernel and compare with "vanilladot" which is built in "kernlab" as practice.
I write my kernel as follow.
#
###vanilla kernel with class "kernel"
#
kfunction.k <- function(){
k <- function (x,y){crossprod(x,y)}
class(k) <- "kernel"
k}
l<-0.1 ; C<-1/(2*l)
###use kfunction.k
tmp<-ksvm(x,factor(y),scaled=FALSE, type = "C-svc", kernel=kfunction.k(), C = C)
alpha(tmp)[[1]]
ind<-alphaindex(tmp)[[1]]
x.s<-x[ind,] ; y.s<-y[ind]
w.class.k<-t(alpha(tmp)[[1]]*y.s)%*%x.s
w.class.k
I thouhgt result of this operation is eqaul to that of following.
However It dosn't.
#
###use "vanilladot"
#
l<-0.1 ; C<-1/(2*l)
tmp1<-ksvm(x,factor(y),scaled=FALSE, type = "C-svc", kernel="vanilladot", C = C)
alpha(tmp1)[[1]]
ind1<-alphaindex(tmp1)[[1]]
x.s<-x[ind1,] ; y.s<-y[ind1]
w.tmp1<-t(alpha(tmp1)[[1]]*y.s)%*%x.s
w.tmp1
I think maybe this problem is related to kernel class.
When class is set to "kernel", this problem is occured.
However When class is set to "vanillakernel", the result of ksvm using user-defined kernel is equal to that of ksvm using "vanilladot" which is built in Kernlab.
#
###vanilla kernel with class "vanillakernel"
#
kfunction.v.k <- function(){
k <- function (x,y){crossprod(x,y)}
class(k) <- "vanillakernel"
k}
# The only difference between kfunction.k and kfunction.v.k is "class(k)".
l<-0.1 ; C<-1/(2*l)
###use kfunction.v.k
tmp<-ksvm(x,factor(y),scaled=FALSE, type = "C-svc", kernel=kfunction.v.k(), C = C)
alpha(tmp)[[1]]
ind<-alphaindex(tmp)[[1]]
x.s<-x[ind,] ; y.s<-y[ind]
w.class.v.k<-t(alpha(tmp)[[1]]*y.s)%*%x.s
w.class.v.k
I don't understand why the result is different from "vanilladot", when setting the class to "kernel".
Is there an error in my operation?
First, it seems like a really good question!
Now to the point. In the sources of ksvm we can find when is a line drawn between using user-defined kernel, and the built-ins:
if (type(ret) == "spoc-svc") {
if (!is.null(class.weights))
weightedC <- class.weights[weightlabels] * rep(C,
nclass(ret))
else weightedC <- rep(C, nclass(ret))
yd <- sort(y, method = "quick", index.return = TRUE)
xd <- matrix(x[yd$ix, ], nrow = dim(x)[1])
count <- 0
if (ktype == 4)
K <- kernelMatrix(kernel, x)
resv <- .Call("tron_optim", as.double(t(xd)), as.integer(nrow(xd)),
as.integer(ncol(xd)), as.double(rep(yd$x - 1,
2)), as.double(K), as.integer(if (sparse) xd#ia else 0),
as.integer(if (sparse) xd#ja else 0), as.integer(sparse),
as.integer(nclass(ret)), as.integer(count), as.integer(ktype),
as.integer(7), as.double(C), as.double(epsilon),
as.double(sigma), as.integer(degree), as.double(offset),
as.double(C), as.double(2), as.integer(0), as.double(0),
as.integer(0), as.double(weightedC), as.double(cache),
as.double(tol), as.integer(10), as.integer(shrinking),
PACKAGE = "kernlab")
reind <- sort(yd$ix, method = "quick", index.return = TRUE)$ix
alpha(ret) <- t(matrix(resv[-(nclass(ret) * nrow(xd) +
1)], nclass(ret)))[reind, , drop = FALSE]
coef(ret) <- lapply(1:nclass(ret), function(x) alpha(ret)[,
x][alpha(ret)[, x] != 0])
names(coef(ret)) <- lev(ret)
alphaindex(ret) <- lapply(sort(unique(y)), function(x)
which(alpha(ret)[,
x] != 0))
xmatrix(ret) <- x
obj(ret) <- resv[(nclass(ret) * nrow(xd) + 1)]
names(alphaindex(ret)) <- lev(ret)
svindex <- which(rowSums(alpha(ret) != 0) != 0)
b(ret) <- 0
param(ret)$C <- C
}
The important parts are two things, first, if we provide ksvm with our own kernel, then ktype=4 (while for vanillakernel, ktype=0) so it makes two changes:
in case of user-defined kernel, the kernel matrix is computed instead of actually using the kernel
tron_optim routine is ran with the information regarding the kernel
Now, in the svm.cpp we can find the tron routines, and in the tron_run (called from tron_optim), that LINEAR kernel has a separate optimization routine
if (param->kernel_type == LINEAR)
{
/* lots of code here */
while (Cpj < Cp)
{
totaliter += s.Solve(l, prob->x, minus_ones, y, alpha, w,
Cpj, Cnj, param->eps, sii, param->shrinking,
param->qpsize);
/* lots of code here */
}
totaliter += s.Solve(l, prob->x, minus_ones, y, alpha, w, Cp, Cn,
param->eps, sii, param->shrinking, param->qpsize);
delete[] w;
}
else
{
Solver_B s;
s.Solve(l, BSVC_Q(*prob,*param,y), minus_ones, y, alpha, Cp, Cn,
param->eps, sii, param->shrinking, param->qpsize);
}
As you can see, the linear case is treated in the more complex, more detailed way. There is an inner optimization loop calling the solver many times. It would require really deep analysis of actual optimization being performed here, but at this step one can answer your question in a following way:
There is no error in your operation
kernlab's svm has a separate routine for training SVM with linear kernel, which is based on the type of kernel passed to the code, changing "kernel" to "vanillakernel" made the ksvm think it is actually working with vanillakernel, and so performed this separate optimization routine
It does not seem as a bug in fact, as the linear SVM is in fact very different from the kernelized version in terms of efficient optimization techniques. Amount of heuristic as well as numerical issues that has to be taken care of is really big. As a result, some approximations are required and can lead to the different results. While for the rich feature space (like those induced by RBF kernel) it should not really matter, for simple kernels line linear ones - this simplifications can lead to significant output changes.