Error in declaring data in rstan for graded response model - r
I am trying to use Stan, specifically through rstan, to fit a graded response model. Luo and Jiao (2018), available here, provide Stan code for doing so. Here is their code, edited only to include more white space:
data{
int<lower=2, upper=4> K; //number of categories
int <lower=0> n_student;
int <lower=0> n_item;
int<lower=1,upper=K> Y[n_student,n_item];
}
parameters {
vector[n_student] theta;
real<lower=0> alpha [n_item];
ordered[K-1] kappa[n_item]; //category difficulty
real mu_kappa; //mean of the prior distribution of category difficulty
real<lower=0> sigma_kappa; //sd of the prior distribution of category difficulty
}
model{
alpha ~ cauchy(0,5);
theta ~ normal(0,1);
for (i in 1: n_item){
for (k in 1:(K-1)){
kappa[i,k] ~ normal(mu_kappa,sigma_kappa);
}}
mu_kappa ~ normal(0,5);
sigma_kappa ~ cauchy(0,5);
for (i in 1:n_student){
for (j in 1:n_item){
Y[i,j] ~ ordered_logistic(theta[i]*alpha[j],kappa[j]);
}}
}
generated quantities {
vector[n_item] log_lik[n_student];
for (i in 1: n_student){
for (j in 1: n_item){
log_lik[i, j] = ordered_logistic_log (Y[i, j],theta[i]*alpha[j],kappa[j]);
}}
}
However, when I try to use this code, the parser throws an error. Here is the R code to reproduce the error:
library("rstan")
n <- 100
m <- 10
K <- 4
example_responses <- sample(x = 1:4, size = n * m, replace = TRUE)
example_responses <- matrix(example_responses, nrow = n, ncol = m)
example_dat <- list(K = K,
n_student = n,
n_item = m,
Y = example_responses)
fit <- stan(file = "~/grm.stan", data = example_dat)
Here is the error I receive:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
error in 'modelf6471b3f018_grm' at line 2, column 21
-------------------------------------------------
2:
3: data {
4: int<lower=2, upper=4> K; // number of categories
^
5: int<lower=0> n_student;
-------------------------------------------------
PARSER EXPECTED: <one of the following:
a variable declaration, beginning with type,
(int, real, vector, row_vector, matrix, unit_vector,
simplex, ordered, positive_ordered,
corr_matrix, cov_matrix,
cholesky_corr, cholesky_cov
or '}' to close variable declarations>
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
failed to parse Stan model 'grm' due to the above error.
I've tried going through the code and the Stan manual to see what the issue is with the data declaration, but I can't find a problem with it. The declaration appears to be very similar to a declaration example in the Stan Language Reference:
int<lower = 1> N;
Can anyone tell me what I'm missing?
Your code has non-standard characters in some of the white space, including right after K;
Related
Multiple forcings in a multi-patch ode model - R package desolve and compiled C code
I am trying to create an SEIR model with multiple patches using the package deSolve in R. At each time step, there is some movement of individuals between patches that can infect individuals in other patches. I also have an external forcing parameter that is specific to each patch (representing different environmental conditions). I've been able to get this working in base R, but given the number of patches and compartments and the duration of the model, I'm trying to convert it to compiled code to speed it up. I've gotten the different patches working, but am struggling with how to incorporate a different forcing parameter for each patch. When forcings are provided, there is an automatic check checkforcings (https://rdrr.io/cran/deSolve/src/R/forcings.R) that doesn't allow for a matrix with more than two columns, and I'm not quite sure what the best workaround is for this. Write my own ode and checkforcings functions to override this? Restructure the forcings data once it gets into C? My final model has 195 patches so I'd prefer to be to automate it somehow so I am not writing out thousands of equations or hundreds of functions. Also fine if the answer is just, do this in a different language, but would appreciate insight into what language I should switch to. Julia maybe? Below is code for a very simple example that just highlights this "different forcings in different patches problem". R Code # Packages ######################################################### library(deSolve) library(ggplot2); theme_set(theme_bw()) library(tidyr) library(dplyr) # Initial Parameters and things #################################### times <- 1:500 n_patch <- 2 patch_ind <- 100 state_names <- (c("S", "I")) n_state <- length(state_names) x <-rep(0, n_patch*n_state) names(x) <- unlist(lapply(state_names, function(x) paste(x, stringr::str_pad(seq(n_patch), width = 3, side = "left", pad =0), sep = "_"))) #start with infected individuals in patch 1 x[startsWith(names(x), "S")] <- patch_ind x['S_001'] <- x['S_001'] - 5 x['I_001'] <- x['I_001'] + 5 x['I_002'] <- x['I_002'] + 20 params <- c(gamma = 0.1, betam = 0.2) #seasonality forcing <- data.frame(times = times, rain = rep(rep(c(0.95,1.05), each = 50), 5)) new_approx_fun <- function(rain.column, t){ approx_col <- approxfun(rain.column, rule = 2) return(approx_col(t)) } rainfall2 <- data.frame(P1 = forcing$rain, P2 = forcing$rain+0.01) # model in R r.mod2 <- function(t,x,params){ # turn state.vec into matrix # columns are different states, rows are different patches states <- matrix(x, nrow = n_patch, ncol = n_state, byrow = F) S <- states[,1] I <- states[,2] N <- rowSums(states[,1:2]) with(as.list(params),{ #seasonal forcing rain <- as.numeric(apply(as.matrix(rainfall2), MARGIN = 2, FUN = new_approx_fun, t = t)) dS <- gamma*I - rain*betam*S*I/N dI <- rain*betam*S*I/N - gamma*I return(list(c(dS, dI), rain)) }) } out.R2 <- data.frame(ode(y = x, times =times, func = r.mod2, parms = params)) #create seasonality for C ftime <- seq(0, max(times), by = 0.1) rain.ft <- approx(times, rainfall2$P1, xout = ftime, rule = 2)$y forcings2 <- cbind(ftime, rain.ft, rain.ft +0.01) # C model system("R CMD SHLIB ex-patch-season-multi.c") dyn.load(paste("ex-patch-season-multi", .Platform$dynlib.ext, sep = "")) out.dll <- data.frame(ode(y = x, times = times, func = "derivsc", dllname = "ex-patch-season-multi", initfunc = "parmsc", parms = params, forcings = forcings2, initforc = "forcc", nout = 1, outnames = "rain")) C code #include <R.h> #include <math.h> #include <Rmath.h> // this is for testing to try and get different forcing for each patch // /*define parameters, pay attention to order */ static double parms[2]; static double forc[1]; #define gamma parms[0] #define betam parms[1] //define forcing #define rain forc[0] /* initialize parameters */ void parmsc(void (* odeparms)(int *, double *)){ int N=2; odeparms(&N, parms); } /* forcing */ void forcc(void (* odeforcs)(int *, double *)) { int N=1; odeforcs(&N, forc); } /* model function */ void derivsc(int *neq, double *t, double *y, double *ydot, double *yout, int *ip){ //use for-loops for patches //define all variables at start of block int npatch=2; double S[npatch]; double I[npatch]; double N[npatch]; int i; for(i=0; i<npatch; i++){ S[i] = y[i]; }; for(i=0; i <npatch; i++){ int ind = npatch+i; I[i] = y[ind]; }; for(i=0; i<npatch; i++){ N[i] = S[i] + I[i]; }; //use for loops for equations { // Susceptible for(i=0; i<npatch; i++){ ydot[i] = gamma*I[i] - rain*betam*I[i]*S[i]/N[i] ; }; //infected for(i=0; i<npatch; i++){ int ind=npatch+i; ydot[ind] = rain*betam*I[i]*S[i]/N[i] - gamma*I[i]; }; }; yout[0] = rain; }
The standard way for multiple forcings in compiled code of the deSolve package is described in the lsoda help page: forcings only used if ‘dllname’ is specified: a list with the forcing function data sets, each present as a two-columned matrix Such a list can be created automatically in a script. There are also other ways possible with some creative C or Fortran programming. For more complex models, I would recommend to use the rodeo package. It allows to specify dynamic models in a tabular form (CSV, LibreOffice, Excel), including parameters and forcing functions. The code generator of the package creates then a fast Fortran code, that can be solved with deSolve. An overview can be found in a paper of Kneis et al (2017), https://doi.org/10.1016/j.envsoft.2017.06.036 and a more extended tutorial at https://dkneis.github.io/ .
Error in Stan Code when variable is clearly defined
I am getting the following error in my Stan code: SYNTAX ERROR, MESSAGE(S) FROM PARSER: No matches for: gpareto_lcdf(real, real, real) Available argument signatures for gpareto_lcdf: gpareto_lcdf(vector, real, real) error in 'modelafda6ff99d79_gpd' at line 54, column 50 ------------------------------------------------- 52: for (i in 1:n) { 53: if (censored[i]) { 54: target += gpareto_lcdf(value[i] | k, sigma); ^ 55: } else { ------------------------------------------------- Error in stanc(file = file, model_code = model_code, model_name = model_name, : failed to parse Stan model 'gpd' due to the above error. In my R studio version, it seems to be complaining about the sigma parameter and not being able to find a match for it. I don't understand why this is an issue given that sigma is defined in my gpareto_lcdf. Here is the code that I am using: functions { real gpareto_lpdf(vector y, real k, real sigma) { // generalised Pareto log pdf int N = rows(y); real inv_k = inv(k); if (k<0 && max(y)/sigma > -inv_k) reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma) if (sigma<=0) reject("sigma<=0; found sigma =", sigma) if (fabs(k) > 1e-15) return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma); else return -sum(y)/sigma -N*log(sigma); // limit k->0 } real gpareto_lcdf(vector y, real k, real sigma) { // generalised Pareto log cdf real inv_k = inv(k); if (k<0 && max(y)/sigma > -inv_k) reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma) if (sigma<=0) reject("sigma<=0; found sigma =", sigma) if (fabs(k) > 1e-15) return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma))))); else return sum(log1m_exp(-(y)/sigma)); // limit k->0 } } data { // the input data int<lower = 1> n; real<lower = 0> value[n]; int<lower = 0, upper = 1> censored[n]; // parameters for the prior real<lower = 0> a; real<lower = 0> b; } parameters { real k; real sigma; } model { // prior k ~ gamma(a, b); sigma ~ gamma(a,b); // likelihood for (i in 1:n) { if (censored[i]) { target += gpareto_lcdf(value[i] | k, sigma); } else { target += gpareto_lpdf(value[i] | k, sigma); } } } Clearly sigma is defined in the gpareto_lcdf and so I am unsure why Stan is complaining about this.
Your code in the likelihood section of the model block doesn't match the way you have defined the gpareto...() functions in the functions block. The gpareto functions take a vector as the first argument but instead you are looping through and trying to pass a single element of value each time. That's why you get the error that the data types you are passing to gpareto_lcdf() do not match the "signature" of the function. The function expects the first argument to be a vector, the second to be a real, and the third to be a real. But you are passing three reals. The error has nothing to do with sigma. The ^ symbol is pointing to the entire function call to gpareto_lcdf() and just happens to be pointing near where the word sigma is, but the error isn't related to sigma. To fix this error, you would need to do one of the following: Redefine the gpareto() functions to take three real arguments and keep your loop in the model block as is. Rewrite your model block to not use a loop and instead be vectorized. I'm not sure the vectorization will work with the condition you have in the model block so you may be forced to go with the first solution. I would recommend posting this question on the Stan forum where you may get a better answer.
RStan Error: “Exception: variable does not exist… failed to create the sampler; sampling not done”
Hey everyone––I'm having trouble fitting a Stan model using RStan. I feel like there's an issue with how I've formatted my data. My data was originally in a data.frame, as it was generated using a mechanistic model. I re-formatted my data as a list, but I'm new to this, so there's a fair chance I messed up. Here's the story so far (P.S. I'm sorry if this post isn't formatted properly!): library(rstan) library(gdata) library(bayesplot) write("// Stan Model; code the actual model in Stan functions{ real[] dZ_dt(real t, // Time real[] Z, // System state {Parasite, Host} real[] alpha, // Parameters real[] x_r, // Unused data (?) int[] x_i){ real P = Z[1]; // System state coded as an array, such that Z = (P,H) real H = Z[2]; real r = alpha[1]; // Parameters of the system, in order they appear real O = alpha[2]; real h = alpha[3]; real b = alpha[4]; real c = alpha[5]; real u = alpha[6]; real dP_dt = P*r - H*(O*P/1 + O*P*h); // Deterministic mechanistic model real dH_dt = b + H*(c*(O*P/1 + O*P*h)-u); return{dP_dt,dH_dt}; // Return the system state } } data{ int<lower=0>N; // Define N as non-negative integer real ts[N]; // Assigns time points to N (I think?) real y_init[2]; real<lower=0>y[N,2]; // Define y as real and non-negative } parameters{ real<lower=0>alpha[6]; // Make all items in alpha non-neg real<lower=0>Z_init[2]; // Initial population size non-neg real<lower=0>sigma[2]; // Error term non-negative } transformed parameters{ real Z[N,2] = integrate_ode_rk45(dZ_dt,Z_init,0,ts,alpha,rep_array(0.0,0),rep_array(0,0),1e-6,1e-5,2000); } model{ alpha[{1}]~uniform(0,10); alpha[{2}]~uniform(0,1); alpha[{3}]~uniform(0,60); alpha[{4}]~uniform(0,100); alpha[{5}]~uniform(0,1); alpha[{6}]~uniform(0,1); sigma~lognormal(-1,1); Z_init~lognormal(log(10),1); for (k in 1:2){ y_init[k]~lognormal(log(Z_init[k]),sigma[k]); y[ ,k]~lognormal(log(Z[ ,k]),sigma[k]); } }", "Stan_Model_TypeII.stan") stanc("Stan_Model_TypeII.stan") # To check that we wrote a file (I think we did?) Stan_Model_TypeII <- stan_model("Stan_Model_TypeII.stan") # Squeezing the data into a form that Stan gets N <- length(Stoch_Data_TypeII$t)-1 # N is 1952 which makes sense bc length of DF is 1953 ts <- 1:N y_init <- c(Stoch_Data_TypeII$P[1], Stoch_Data_TypeII$H[1]) # Initial states, P = 1; H = 18 y <- as.matrix(Stoch_Data_TypeII[2:(N+1),2:3]) y <- cbind(y[,2],y[,1]); # This worked, sick; where y[,1] is H, and y[,2] is P Stan_StochData_TypeII <- list(N,ts,y_init,y) # Fitting the data to the model fit <- stan(file = "Stan_Model_TypeII.stan", data = Stan_StochData_TypeII, warmup = 500, iter = 1000, chains = 2, cores = 1, thin = 1, algorithm = "HMC", diagnostic_file = "TypeII_Fitting_Output.R", seed = 1996, verbose = TRUE) And here's the progress made on the model: TRANSLATING MODEL 'Stan_Model_TypeII' FROM Stan CODE TO C++ CODE NOW. successful in parsing the Stan model 'Stan_Model_TypeII'. CHECKING DATA AND PREPROCESSING FOR MODEL 'Stan_Model_TypeII' NOW. COMPILING MODEL 'Stan_Model_TypeII' NOW. STARTING SAMPLER FOR MODEL 'Stan_Model_TypeII' NOW. And here's what the error code reads: Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : Exception: variable does not exist; processing stage=data initialization; variable name=N; base type=int (in 'model1b35f2189f6_Stan_Model_TypeII' at line 25) failed to create the sampler; sampling not done
R fit user defined distribution
I am trying to fit my own distribution to my data, find the optimum parameters of the distribution to match the data and ultimately find the FWHM of the peak in the distribution. From what I've read, the package fitdistrplus is the way to do this. I know the data takes the shape of a lorentzian peak on a quadratic background. plot of the data: plot of raw data The raw data used: data = c(0,2,5,4,5,4,3,3,2,2,0,4,4,2,5,5,3,3,4,4,4,3,3,5,5,6,6,8,4,0,6,5,7,5,6,3,2,1,7,0,7,9,5,7,5,3,5,5,4,1,4,8,10,2,5,8,7,14,7,5,8,4,2,2,6,5,4,6,5,7,5,4,8,5,4,8,11,9,4,8,11,7,8,6,9,5,8,9,10,8,4,5,8,10,9,12,10,10,5,5,9,9,11,19,17,9,17,10,17,18,11,14,15,12,11,14,12,10,10,8,7,13,14,17,18,16,13,16,14,17,20,15,12,15,16,18,24,23,20,17,21,20,20,23,20,15,20,28,27,26,20,17,19,27,21,28,32,29,20,19,24,19,19,22,27,28,23,37,41,42,34,37,29,28,28,27,38,32,37,33,23,29,55,51,41,50,44,46,53,63,49,50,47,54,54,43,45,58,54,55,67,52,57,67,69,62,62,65,56,72,75,88,87,77,70,71,84,85,81,84,75,78,80,82,107,102,98,82,93,98,90,94,118,107,113,103,99,103,96,108,114,136,126,126,124,130,126,113,120,107,107,106,107,136,143,135,151,132,117,118,108,120,145,140,122,135,153,157,133,130,128,109,106,122,133,132,150,156,158,150,137,147,150,146,144,144,149,171,185,200,194,204,211,229,225,235,228,246,249,238,214,228,250,275,311,323,327,341,368,381,395,449,474,505,529,585,638,720,794,896,919,1008,1053,1156,1134,1174,1191,1202,1178,1236,1200,1130,1094,1081,1009,949,890,810,760,690,631,592,561,515,501,489,467,439,388,377,348,345,310,298,279,253,257,259,247,237,223,227,217,210,213,197,197,192,195,198,201,202,211,193,203,198,202,174,164,162,173,170,184,170,168,175,170,170,168,162,149,139,145,151,144,152,155,170,156,149,147,158,171,163,146,151,150,147,137,123,127,136,149,147,124,137,133,129,130,128,139,137,147,141,123,112,136,147,126,117,116,100,110,120,105,91,100,100,105,92,88,78,95,75,75,82,82,80,83,83,66,73,80,76,69,81,93,79,71,80,90,72,72,63,57,53,62,65,49,51,57,73,54,56,78,65,52,58,49,47,56,46,43,50,43,40,39,36,45,28,35,36,43,48,37,36,35,39,31,24,29,37,26,22,36,33,24,31,31,20,30,28,23,21,27,26,29,21,20,22,18,19,19,20,21,20,25,18,12,18,20,20,13,14,21,20,16,18,12,17,20,24,21,20,18,11,17,12,5,11,13,16,13,13,12,12,9,15,13,15,11,12,11,8,13,16,16,16,14,8,8,10,11,11,17,15,15,9,9,13,12,3,11,14,11,14,13,8,7,7,15,12,8,12,14,9,5,2,10,8) I have calculated the equations which define the distribution and cumulative distribution: dFF <- function(x,a,b,c,A,gamma,pos) a + b*x + (c*x^2) + ((A/pi)*(gamma/(((x-pos)^2) + (gamma^2)))) pFF <- function(x,a,b,c,A,gamma,pos) a*x + (b/2)*(x^2) + (c/3)*(x^3) + A/2 + (A/pi)*(atan((x - pos)/gamma)) I believe these to be correct. From what I understand, a distribution fit should be possible using just these definitions using the fitdist (or mledist) method: fitdist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331)) mledist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331)) This returns the statement 'function cannot be evaluated at initial parameters> Error in fitdist(data, "FF", start = list(0, 0.3, -4e-04, 70000, 13, 331)):the function mle failed to estimate the parameters, with the error code 100' in the first case and in the second I just get a list of 'NA' values for the estimates. I then calculated a function to give the quantile distribution values to use the other fitting methods (qmefit): qFF <- function(p,a,b,c,A,gamma,pos) { qList = c() axis = seq(1,600,1) aF = dFF(axis,a,b,c,A,gamma,pos) arr = histogramCpp(aF) # change data to a histogram format for(element in 1:length(p)){ q = quantile(arr,p[element], names=FALSE) qList = c(qList,q) } return(qList) } Part of this code requires calling the c++ function (by using the library Rcpp): #include <Rcpp.h> #include <vector> #include <math.h> using namespace Rcpp; // [[Rcpp::export]] std::vector<int> histogramCpp(NumericVector x) { std::vector<int> arr; double number, fractpart, intpart; for(int i = 0; i <= 600; i++){ number = (x[i]); fractpart = modf(number , &intpart); if(fractpart < 0.5){ number = (int) intpart; } if(fractpart >= 0.5){ number = (int) (intpart+1); } for(int j = 1; j <= number; j++){ arr.push_back(i); } } return arr; } This c++ method just turns the data into a histogram format. If the first element of the vector describing the data is 4 then '1' is added 4 times to the returned vector etc. . This also seems to work as sensible values are returned. plot of the quantile function: Plot of quantiles returned for probabilities from 0 to 1 in steps of 0.001 The 'qmefit' method can then be attempted through the fitdist function: fitdist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331), method = 'qme', probs = c(0,0.3,0.4,0.5,0.7,0.9)) I chose the 'probs' values randomly as I don't fully understand their meaning. This either straight-up crashes the R session or after a brief stuttering returns a list of 'NA' values as estimates and the line <std::bad_alloc : std::bad_alloc> I am not sure if I am making a basic mistake here and any help or recommendations are appreciated.
In the end I managed to find a work-around for this using the rPython package and lmfit from python. It solved my issue and might be useful for others with the same issue. The R-code was as follows: library(rPython) python.load("pyFit.py") python.assign("row",pos) python.assign("vals",vals) python.exec("FWHM,ERROR,FIT = fitDist(row,vals)") FWHM = python.get("FWHM") ERROR = python.get("ERROR") cFIT = python.get("FIT") and the called python code was: from lmfit import Model, minimize, Parameters, fit_report from sklearn import mixture import numpy as np import matplotlib.pyplot as plt import math def cauchyDist(x,a,b,c,d,e,f,g,A,gamma,pos): return a + b*x + c*pow(x,2) + d*pow(x,3) + e*pow(x,4) + f*pow(x,5) + g*pow(x,6) + (A/np.pi)*(gamma/((pow((x-pos),2)) + (pow(gamma,2)))) def fitDist(row, vals): gmod = Model(cauchyDist) x = np.arange(0,600) result = gmod.fit(vals, x=x, a = 0, b = 0.3, c = -0.0004, d = 0, e = 0, f= 0, g = 0, A = 70000, gamma = 13, pos = row) newFile = open('fitData.txt', 'w') newFile.write(result.fit_report()) newFile.close() with open('fitData.txt', 'r') as inF: for line in inF: if 'gamma:' in line: j = line.split() inF.close() FWHM = float(j[1]) error = float(j[3]) fit = result.best_fit fit = fit.tolist() return FWHM, error, fit I increased the order of polynomial to obtain a better fit for the data and returned the FWHM, its error and the values for the fit. There are likely much better ways of achieving this but the final fit is as I needed. Final fit. Red data points are raw data, the black line is the fitted distribution.
(in R) Why is result of ksvm using user-defined linear kernel different from that of ksvm using "vanilladot"?
I wanted to use user-defined kernel function for Ksvm in R. so, I tried to make a vanilladot kernel and compare with "vanilladot" which is built in "kernlab" as practice. I write my kernel as follow. # ###vanilla kernel with class "kernel" # kfunction.k <- function(){ k <- function (x,y){crossprod(x,y)} class(k) <- "kernel" k} l<-0.1 ; C<-1/(2*l) ###use kfunction.k tmp<-ksvm(x,factor(y),scaled=FALSE, type = "C-svc", kernel=kfunction.k(), C = C) alpha(tmp)[[1]] ind<-alphaindex(tmp)[[1]] x.s<-x[ind,] ; y.s<-y[ind] w.class.k<-t(alpha(tmp)[[1]]*y.s)%*%x.s w.class.k I thouhgt result of this operation is eqaul to that of following. However It dosn't. # ###use "vanilladot" # l<-0.1 ; C<-1/(2*l) tmp1<-ksvm(x,factor(y),scaled=FALSE, type = "C-svc", kernel="vanilladot", C = C) alpha(tmp1)[[1]] ind1<-alphaindex(tmp1)[[1]] x.s<-x[ind1,] ; y.s<-y[ind1] w.tmp1<-t(alpha(tmp1)[[1]]*y.s)%*%x.s w.tmp1 I think maybe this problem is related to kernel class. When class is set to "kernel", this problem is occured. However When class is set to "vanillakernel", the result of ksvm using user-defined kernel is equal to that of ksvm using "vanilladot" which is built in Kernlab. # ###vanilla kernel with class "vanillakernel" # kfunction.v.k <- function(){ k <- function (x,y){crossprod(x,y)} class(k) <- "vanillakernel" k} # The only difference between kfunction.k and kfunction.v.k is "class(k)". l<-0.1 ; C<-1/(2*l) ###use kfunction.v.k tmp<-ksvm(x,factor(y),scaled=FALSE, type = "C-svc", kernel=kfunction.v.k(), C = C) alpha(tmp)[[1]] ind<-alphaindex(tmp)[[1]] x.s<-x[ind,] ; y.s<-y[ind] w.class.v.k<-t(alpha(tmp)[[1]]*y.s)%*%x.s w.class.v.k I don't understand why the result is different from "vanilladot", when setting the class to "kernel". Is there an error in my operation?
First, it seems like a really good question! Now to the point. In the sources of ksvm we can find when is a line drawn between using user-defined kernel, and the built-ins: if (type(ret) == "spoc-svc") { if (!is.null(class.weights)) weightedC <- class.weights[weightlabels] * rep(C, nclass(ret)) else weightedC <- rep(C, nclass(ret)) yd <- sort(y, method = "quick", index.return = TRUE) xd <- matrix(x[yd$ix, ], nrow = dim(x)[1]) count <- 0 if (ktype == 4) K <- kernelMatrix(kernel, x) resv <- .Call("tron_optim", as.double(t(xd)), as.integer(nrow(xd)), as.integer(ncol(xd)), as.double(rep(yd$x - 1, 2)), as.double(K), as.integer(if (sparse) xd#ia else 0), as.integer(if (sparse) xd#ja else 0), as.integer(sparse), as.integer(nclass(ret)), as.integer(count), as.integer(ktype), as.integer(7), as.double(C), as.double(epsilon), as.double(sigma), as.integer(degree), as.double(offset), as.double(C), as.double(2), as.integer(0), as.double(0), as.integer(0), as.double(weightedC), as.double(cache), as.double(tol), as.integer(10), as.integer(shrinking), PACKAGE = "kernlab") reind <- sort(yd$ix, method = "quick", index.return = TRUE)$ix alpha(ret) <- t(matrix(resv[-(nclass(ret) * nrow(xd) + 1)], nclass(ret)))[reind, , drop = FALSE] coef(ret) <- lapply(1:nclass(ret), function(x) alpha(ret)[, x][alpha(ret)[, x] != 0]) names(coef(ret)) <- lev(ret) alphaindex(ret) <- lapply(sort(unique(y)), function(x) which(alpha(ret)[, x] != 0)) xmatrix(ret) <- x obj(ret) <- resv[(nclass(ret) * nrow(xd) + 1)] names(alphaindex(ret)) <- lev(ret) svindex <- which(rowSums(alpha(ret) != 0) != 0) b(ret) <- 0 param(ret)$C <- C } The important parts are two things, first, if we provide ksvm with our own kernel, then ktype=4 (while for vanillakernel, ktype=0) so it makes two changes: in case of user-defined kernel, the kernel matrix is computed instead of actually using the kernel tron_optim routine is ran with the information regarding the kernel Now, in the svm.cpp we can find the tron routines, and in the tron_run (called from tron_optim), that LINEAR kernel has a separate optimization routine if (param->kernel_type == LINEAR) { /* lots of code here */ while (Cpj < Cp) { totaliter += s.Solve(l, prob->x, minus_ones, y, alpha, w, Cpj, Cnj, param->eps, sii, param->shrinking, param->qpsize); /* lots of code here */ } totaliter += s.Solve(l, prob->x, minus_ones, y, alpha, w, Cp, Cn, param->eps, sii, param->shrinking, param->qpsize); delete[] w; } else { Solver_B s; s.Solve(l, BSVC_Q(*prob,*param,y), minus_ones, y, alpha, Cp, Cn, param->eps, sii, param->shrinking, param->qpsize); } As you can see, the linear case is treated in the more complex, more detailed way. There is an inner optimization loop calling the solver many times. It would require really deep analysis of actual optimization being performed here, but at this step one can answer your question in a following way: There is no error in your operation kernlab's svm has a separate routine for training SVM with linear kernel, which is based on the type of kernel passed to the code, changing "kernel" to "vanillakernel" made the ksvm think it is actually working with vanillakernel, and so performed this separate optimization routine It does not seem as a bug in fact, as the linear SVM is in fact very different from the kernelized version in terms of efficient optimization techniques. Amount of heuristic as well as numerical issues that has to be taken care of is really big. As a result, some approximations are required and can lead to the different results. While for the rich feature space (like those induced by RBF kernel) it should not really matter, for simple kernels line linear ones - this simplifications can lead to significant output changes.