Separating circles using kernel PCA - r

I am trying to reproduce a simple example of using kernel PCA. The objective is to separate out the points from two concentric circles.
Creating the data:
circle <- data.frame(radius = rep(c(0, 1), 500) + rnorm(1000, sd = 0.05),
phi = runif(1000, 0, 2 * pi),
group = rep(c("A", "B"), 500))
#
circle <- transform(circle,
x = radius * cos(phi),
y = radius * sin(phi),
z = rnorm(length(radius))) %>% select(group, x, y, z)
TFRAC = 0.75
#
train <- sample(1:1000, TFRAC * 1000)
circle.train <- circle[train,]
circle.test <- circle[-train,]
> head(circle.train)
group x y z
491 A -0.034216 -0.0312062 0.70780
389 A 0.052616 0.0059919 1.05942
178 B -0.987276 -0.3322542 0.75297
472 B -0.808646 0.3962935 -0.17829
473 A -0.032227 0.0027470 0.66955
346 B 0.894957 0.3381633 1.29191
I have split the data up into training and testing sets because I have the intention (once I get this working!) of testing the resulting model.
In principal kernel PCA should allow me to separate out the two classes. Other discussions of this example have used the Radial Basis Function (RBF) kernel, so I adopted this too. In R kernel PCA is implemented in the kernlab package.
library(kernlab)
circle.kpca <- kpca(~ ., data = circle.train[, -1], kernel = "rbfdot", kpar = list(sigma = 10), features = 1)
I requested only the first component and specified the RBF kernel. This is the result:
There has definitely been a major transformation of the data, but the transformed data is not what I was expecting (which would be a nice, clean separation of the two classes). I have tried fiddling with the value of the parameter sigma and, although the results do vary dramatically, I still didn't get what I was expecting. I assume that sigma is related to the parameter gamma mentioned here, possibly via the relationship given here (without the negative sign?).
I'm pretty sure that I am making a naive rookie error here and I would really appreciate any pointers which would get me onto the right track.
Thanks,
Andrew.

Try sigma = 20. I think you will get the answer you are looking for. The sigma in kernlab is actually what is usually referred to as gamma for rbf kernel so they are inversely related.

Related

Memory problems while using lm.circular()

I am trying to run circular regression in R using the circular package. My dataset is somewhat large, ~85000 rows and 6 variables. When I try to run the model, I get a error message reading "Error: cannot allocate vector of size 53.3 Gb." I am more of a statistician than a programmer so I can't figure out how to fix this, other than it seems odd that it's throwing out this large memory allocation, as my dataset is not that large. I have attached a fictional dataset and code below. Thank you.
library(circular)
set.seed(12)
n = 80000
df <- data.frame(y = rnorm(n,2,.2),
x1 = rnorm(n,100,2),
x2 = rnorm(n,0,1),
x3 = rnorm(n,9,.2),
x4 = rnorm(n,0,1),
x5 = rnorm(n,1,.1))
y <- circular(df$y, type = "angles", units = "radians")
x <- model.matrix(y ~., data = df)
m1 <- lm.circular(y = y, x = x, type = "c-l", init = c(1,.01,.5,.5,.5,.5))
The implementation tries to set up some diagonal matrices of size n x n using
A <- diag(k * A1(k), nrow = n)
g.p <- diag(apply(x, 1, function(row, betaPrev) 2/(1 + (t(betaPrev) %*%
row)^2), betaPrev = betaPrev), nrow = n)
(in circular:::LmCircularclRad) without using any sparse matrix tricks. For your example, those matrices would each take 50 GB of memory, and that allocation fails.
I don't think there's anything you can do to avoid this, other than suggesting a more efficient way to carry out the required calculations. Usually linear algebra using diagonal matrices can be done with much less memory use, but you'll have to look closely at this code to see if that's the case here.

Determine what is the break point for the slope change in R [migrated]

I'm trying to implement a "change point" analysis, or a multiphase regression using nls() in R.
Here's some fake data I've made. The formula I want to use to fit the data is:
$y = \beta_0 + \beta_1x + \beta_2\max(0,x-\delta)$
What this is supposed to do is fit the data up to a certain point with a certain intercept and slope ($\beta_0$ and $\beta_1$), then, after a certain x value ($\delta$), augment the slope by $\beta_2$. That's what the whole max thing is about. Before the $\delta$ point, it'll equal 0, and $\beta_2$ will be zeroed out.
So, here's my function to do this:
changePoint <- function(x, b0, slope1, slope2, delta){
b0 + (x*slope1) + (max(0, x-delta) * slope2)
}
And I try to fit the model this way
nls(y ~ changePoint(x, b0, slope1, slope2, delta),
data = data,
start = c(b0 = 50, slope1 = 0, slope2 = 2, delta = 48))
I chose those starting parameters, because I know those are the starting parameters, because I made the data up.
However, I get this error:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Have I just made unfortunate data? I tried fitting this on real data first, and was getting the same error, and I just figured that my initial starting parameters weren't good enough.
(At first I thought it could be a problem resulting from the fact that max is not vectorized, but that's not true. It does make it a pain to work with changePoint, wherefore the following modification:
changePoint <- function(x, b0, slope1, slope2, delta) {
b0 + (x*slope1) + (sapply(x-delta, function (t) max(0, t)) * slope2)
}
This R-help mailing list post describes one way in which this error may result: the rhs of the formula is overparameterized, such that changing two parameters in tandem gives the same fit to the data. I can't see how that is true of your model, but maybe it is.
In any case, you can write your own objective function and minimize it. The following function gives the squared error for data points (x,y) and a certain value of the parameters (the weird argument structure of the function is to account for how optim works):
sqerror <- function (par, x, y) {
sum((y - changePoint(x, par[1], par[2], par[3], par[4]))^2)
}
Then we say:
optim(par = c(50, 0, 2, 48), fn = sqerror, x = x, y = data)
And see:
$par
[1] 54.53436800 -0.09283594 2.07356459 48.00000006
Note that for my fake data (x <- 40:60; data <- changePoint(x, 50, 0, 2, 48) + rnorm(21, 0, 0.5)) there are lots of local maxima depending on the initial parameter values you give. I suppose if you wanted to take this seriously you'd call the optimizer many times with random initial parameters and examine the distribution of results.
Just wanted to add that you can do this with many other packages. If you want to get an estimate of uncertainty around the change point (something nls cannot do), try the mcp package.
# Simulate the data
df = data.frame(x = 1:100)
df$y = c(rnorm(20, 50, 5), rnorm(80, 50 + 1.5*(df$x[21:100] - 20), 5))
# Fit the model
model = list(
y ~ 1, # Intercept
~ 0 + x # Joined slope
)
library(mcp)
fit = mcp(model, df)
Let's plot it with a prediction interval (green line). The blue density is the posterior distribution for the change point location:
# Plot it
plot(fit, q_predict = T)
You can inspect individual parameters in more detail using plot_pars(fit) and summary(fit).

Numerically solving Lotka-Volterra ODE in R

Disclaimer: Cross-post on Stack Computational Science
Aim: I am trying to numerically solve a Lotka-Volterra ODE in R, using de sde.sim() function in the sde package. I would like to use the sde.sim() function in order to eventually transform this system into an SDE. So initially, I started with an simple ODE system (Lotka Volterra model) without a noise term.
The Lotka-Volterra ODE system:
with initial values for x = 10 and y = 10.
The parameter values for alpha, beta, delta and gamma are 1.1, 0.4, 0.1 and 0.4 respectively (mimicking this example).
Attempt to solve problem:
library(sde)
d <- expression((1.1 * x[0] - 0.4 * x[0] * x[1]), (0.1 * x[0] * x[1] - 0.4 * x[1]))
s <- expression(0, 0)
X <- sde.sim(X0=c(10,10), T = 10, drift=d, sigma=s)
plot(X)
However, this does not seem to generate a nice cyclic behavior of the predator and prey population.
Expected Output
I used the deSolve package in R to generate the expected output.
library(deSolve)
alpha <-1.1
beta <- 0.4
gamma <- 0.1
delta <- 0.4
yini <- c(X = 10, Y = 10)
Lot_Vol <- function (t, y, parms) {
with(as.list(y), {
dX <- alpha * X - beta * X * Y
dY <- 0.1 * X * Y - 0.4 * Y
list(c(dX, dY))
}) }
times <- seq(from = 0, to = 100, by = 0.01)
out <- ode(y = yini, times = times, func = Lot_Vol, parms = NULL)
plot(y=out[, "X"], x = out[, "time"], type = 'l', col = "blue", xlab = "Time", ylab = "Animals (#)")
lines(y=out[, "Y"], x = out[, "time"], type = 'l', col = "red")
Question
I think something might be wrong the the drift function, however, I am not sure what. What is going wrong in the attempt to solve this system of ODEs in sde.sim()?
Assuming that not specifying a method takes the first in the list, and that all other non-specified parameters take default values, you are performing the Euler method with step size h=0.1.
As is known on a function that has convex concentric trajectories, the Euler method will produce an outward spiral. As a first order method, the error should grow to size about T*h=10*0.1=1. Or if one wants to take the more pessimistic estimate, the error has size (exp(LT)-1)*h/L, with L=3 in some adapted norm this gives a scale of 3.5e11.
Exploring the actual error e(t)=c(t)*h of the Euler method, one gets the following plots. Left are the errors of the components and right the trajectories for various step sizes in the Euler method. The error coefficient the function c(t) in the left plots is scaled down by the factor (exp(L*t)-1)/L to get comparable values over large time intervals, the value L=0.06 gave best balance.
One can see that the actual error
abs(e(t))<30*h*(exp(L*t)-1)/L
is in-between the linear and exponential error models, but closer to the linear one.
To reduce the error, you have to decrease the step size. In the call of SDE.sim, this is achieved by setting the parameter N=5000 or larger to get a step size h=10/5000=0.002 so that you can hope to be correct in the first two digits with an error bound of 30*h*T=0.6. In the SDE case you accumulate Gaussian noise of size sqrt(h) in every step, so that the truncation error of O(h^2) is a rather small perturbation of the random number.

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

`smooth.spline` severely underfits long (periodic) time series

I would like to smooth very long, noisy data, in R. But I have found that for highly periodic data, out-of-the-box smooth.spline() quickly breaks down and the smoothed data begins to exhibit ringing.
Consider a cosine time series (with or without noise)
t <- seq(0,100*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)
y100_s <- smooth.spline(y)$y
plot( y~t, type="l" )
lines( y100_s~t, col="blue" )
We can examine the effect of adding more values to smooth.spline(),
# rms increases as points are added to smooth.spline
rms <- sapply( seq(250,3000,by=250), function(i)
sqrt( mean( (y[1:i] - smooth.spline(y[1:i])$y)^2 )) )
plot(rms)
Even at lower frequencies the fit is ringing (optional).
t <- seq(0,50*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)
y50_s <- smooth.spline(y)$y
require(pracma)
peaks <- list(findpeaks(y50_s),findpeaks(-y50_s))
plot( y~t, type="l" )
lines( y50_s~t, col="red" )
lines( peaks[[1]][,1]~t[peaks[[1]][,2]], type="l" )
lines( -peaks[[2]][,1]~t[peaks[[2]][,2]], type="l" )
After exploring for a bit, this behaviour appears to be a function of the spar argument, but I can't set this to a small enough value to eliminate the effect. This might be an obvious consequence of spline fitting, and a fault of relying on out-of-the-box methods, but I would appreciate some insight. Are there controls I can specify in smooth.spline(), or alternative recommendations/strategies for smoothing?
I don't know whether you are always fitting a periodic signal. If that is the case, using periodic spline from mgcv::gam is much better. However, let's forget about this issue for the moment.
If your data have high, frequent oscillation, you have to choose sufficient number of knots, i.e., a decent density of knots, otherwise you just result in over-smoothing (i.e., under-fitting).
Have a look at your example:
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) # + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y)
You have n = 3000 data points. By default, smooth.spline uses much smaller number of knots than data when n > 49. Precisely it is selected by a service routine .nknots.smspl. But there is no optimality justification for this. So it is up to you to justify whether this is reasonable. Let's check:
length(fit$fit$nk) - 2L ## or `.nknots.smspl(3000)`
# [1] 194
fit$df
# [1] 194
It uses only 194 knots and model ends up with 194 degree of freedom without penalization effect. As I said earlier, you just end up with under-fitting:
plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
Ideally, penalized regression ends up with a degree of freedom substantially smaller than number of knots. It is often forgotten that penalization is used to fix over-fitting problem resulting from the original non-penalized regression. If we don't even see the penalization effect, then the original non-penalized model is under-fitting data, so increase number of knots until we reach an over-fitting status. If you are lazy to think about this, set all.knots = TRUE. Univariate smoothing spline is very computationally cheap at O(n) costs. Even if you use all data as knots, you won't get into efficiency problem.
fit <- smooth.spline(t, y, all.knots = TRUE)
length(fit$fit$nk) - 2L
# [1] 3000
fit$df
# [1] 3000
Oh, we still did not see the effect of penalization, why? Because we don't have noisy data. You did not add noise to your y, so by using all knots we are doing interpolation. Add some noise to y to truly understand what I explain about penalization.
set.seed(0)
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y, all.knots = TRUE)
length(fit$fit$nk)
# [1] 3002
fit$df
# [1] 705.0414
Note how much smaller 705 is compared with 3000. Have look at fitted spline?
plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
There is neither under-fitting nor over-fitting; penalization results in optimal trade-off between bias and variance.

Resources