How to interpolate those signal data with a polynomial? - r

I am trying to find the coefficients of a polynomial in R, but I am not sure of which order the polynomial is.
I have data:
x=seq(6, 174, by=8)
y=rep(c(-1,1),11)
Now I want to find the (obviously) non-linear function that hits up all these points. Function values should still is in the interval [-1,1], and all these points should be understood as the vertex of a parabola.
EDIT
Actually this is not example data, I just need exactly this function for exactly these points.
I tried to describe it with polynomials up to degree 25 and then gave up, with polynomials it seems that it is only possible to approximate the curve but not to get it directly.
Comments suggested using a sine curve. Does someone know how to get the exact trigonometric function?

Your data have a strong characteristic that they are sampled from a sinusoid signal. With restriction that y is constrained onto [-1,1], we know for sure the amplitude is 1, so let's assume we want a sin function:
y = sin((2 * pi / T) * x + phi)
where T is period and phi is phase. The period of your data is evident: 2 * 8 = 16. To get phi, just use the fact that when x = 6, y = -1. That is
sin(12 * pi / T + phi) = -1
which gives one solution: phi = -pi/2 - 12 * pi / T.
Here we go:
T <- 16
phi <- -pi/2 - 12 * pi / T
f <- function(x) sin(x * pi / 8 + phi)
plot(x, y)
x0 <- seq(6, 174, by = 0.2)
y0 <- f(x0)
lines(x0, y0, col = 2)
Your original intention to have a polynomial is not impossible, but it can't be an ordinary polynomial. An ordinary polynomial is unbounded. It will tends to Inf or -Inf when x tends to Inf or -Inf.
Local polynomial is possible. Since you say: all these points should be understood as the vertex of a parabola, you seem to expect a smooth function. Then a cubic spline is ideal. Specifically, we don't want a natural cubic spline but a period cubic spline. The spline function from stats package can help us:
int <- spline(x[-1], y[-1], method = "periodic", xout = x0)
Note, I have dropped the first datum, as with "periodic" method, spline wants y to have the same value on both ends. Once we drop the first datum, y values are 1 on both sides.
plot(x, y)
lines(int, col = 2)
I did not compare the spline interpolation with the sinusoid function. They can't be exactly the same, but in statistical modelling we can use either one to model the underlying cyclic signal / effect.

Related

How can I use cubic splines for extrapolation?

I am looking to use natural cubic splines to interpolate between some data points using stats::splinefun(). The documentation states:
"These interpolation splines can also be used for extrapolation, that is prediction at points outside the range of ‘x’. Extrapolation makes little sense for ‘method = "fmm"’; for natural splines it is linear using the slope of the interpolating curve at the nearest data point."
I have attempted to replicate the spline function in Excel as a review, which is working fine except that I can't replicate the extrapolation approach. Example data and code below:
library(stats)
# Example data
x <- c(1,2,3,4,5,6,7,8,9,10,12,15,20,25,30,40,50)
y <- c(7.1119,5.862,5.4432,5.1458,4.97,4.8484,4.7726,4.6673,4.5477,4.437,4.3163,4.1755,4.0421,3.9031,3.808,3.6594,3.663)
df <- data.frame(x,y)
# Create spline functions
splinetest <- splinefun(x = df$x, y = df$y, method = "natural")
# Create dataframe of coefficients
splinetest_coef <- environment(splinetest)$z
splinetest_coefdf <- data.frame(i = 0:16, x = splinecoef_inf$x, a = splinecoef_inf$y, b = splinecoef_inf$b, c = splinecoef_inf$c, d = splinecoef_inf$d)
# Calculate extrapolated value at 51
splinetest(51)
# Result:
# [1] 3.667414
Question: How is this result calculated?
Expected result using linear extrapolation from x = 40 and x = 50 is 3.663 + (51 - 50) x (3.663 - 3.6594) / (50 - 40) = 3.66336
The spline coefficients are as follows at i = 50: a = 3.663 and b = 0.00441355...
Therefore splinetest(51) is calculated as 3.663 + 0.0441355
How is 0.0441355 calculated in this function?
Linear extrapolation is not done by computing the slope between a particular pair of points, but by using the estimated derivatives at the boundary ("closest point" in R's documentation). The derivatives at any point can be calculated directly from the spline function, e.g. to calculate the estimated first derivative at the upper boundary:
splinetest(max(df$x), deriv = 1)
[1] 0.004413552
This agrees with your manual back-calculation of the slope used to do the extrapolation.
As pointed out in the comments, plotting the end of the curve/data set with curve(splinetest, from = 30, to = 60); points(x,y) illustrates clearly the difference between the derivative at the boundary (x=50) and the line based on the last two data points (i.e. (y(x=50) - y(x=40))/10)

Estimation of noise parameters from data

I'm interested in extracting noise parameters from (x, y) data, where x are known input values, y are the corresponding signals, and I typically know quite well the functional form of the noise-generating processes.
###
library(MASS)
n <- 100000
x <- runif(n, min = 50, max = 1000)
y1 <- rnorm(n, 0, 5) + rnegbin(x / 4, theta = 100)
plot(x[1:100000*100], y1[1:100000*100]) #plot every 1ooth datapoint only, for speed
y2 <- x / 7.5 + rnorm(n, 50, 2) + rnorm(n, 0, 0.1 * x / 7.5)
plot(x[1:100000*100], y2[1:100000*100])
y3 <- x + rnorm(n, 0, 5) + rnorm(n, 0, 0.1 * x)
plot(x[1:100000*100], y3[1:100000*100])
y4 <- rnorm(n, 0, 5) + (rnegbin((x + rnorm(n, 0, 5)), theta = 100)) / 2
plot(x[1:100000*100], y4[1:100000*100])
###
In the (x, y1) example, the y data are (noisy) counts generated according to a negative binomial model (theta = 100) plus some constant (homoskedastic) gaussian noise (with avg = 0, stdev = 5) from the input values x. Also, there is a step somewhere in the process before the generation of the counts that scales down the signal by a factor of 4.
In the (x, y2) example, the signals are generated by a different kind of process where the response factor is 1 / 7.5 so y values are first of all scaled down by that factor, then constant gaussian background noise (avg = 50, stdev = 2), plus more gaussian noise that is proportional to the y (= x / 7.5) is added.
I suppose my main question is: Is there a function or functions that take (x, y) data, as well as user-specified noise model as inputs and spits out the best estimates for, in the case of example y1, the four parameters: theta, constant noise avg, constant noise stdev, and scaling factor?
As a simplification, as in example (x, y3), I may be able to scale the signals manually before the fitting so the signals are (on average) directly proportional to the input, and the algorithm should not consider any scaling factors but assume that on avarage, y is directly proportion to x (slope = 1), eliminating this parameter from consideration, and thus simplifying the fitting process.
There is even more complicated example (x, y4). Here, the signals are produced by a complex process, with the signal is initially proportional to x (no scaling), constant gaussian noise is added (avg = 0, stdev = 5), then noisy counts are generated from that already noisy input, signals are divided by 2, and then more constant gaussian noise (avg = 0, stdev = 5) is added to top it off.
All these seem like a doable things in data processing oriented R, but I haven't been able to find suitable solution. Still, I am not very experienced programmer in general (and R specifically), and I don't have any experience in developing algorithms, so my preceptions may be totally off, in that case my apologies.
The noise levels and exact models presented in these toy examples may not be realistic, but I'm just trying to illustrate that I'd like to find a way to model somewhat complex and user-specified multi-parameter noise scenarios. In lot of cases, I know to high accuracy what kind of processes generated the real-world data of interest so that is not a problem, I can write down the function. The bottleneck is to know what are the appropriate R function(s)/packages to use (and something like this is available in R) and how to use them correctly so I can extract the parameters (which are my main interest) that characterize these processes.
In case there are no general solutions that would cover lot of ground, I'm of course interested in algorithms and functions that specialize in just one type of data as well (e.g., just linear regression). Finally, if fitting complex noise models is not an option, I'd be also interested in starting with something simpler, like just fitting negative binomial alone, in some cases other noise sources may be negligible so it may just do. Like this one
y5 <- rnegbin(x, theta = 100)
Thank you!

Numerically solving Lotka-Volterra ODE in R

Disclaimer: Cross-post on Stack Computational Science
Aim: I am trying to numerically solve a Lotka-Volterra ODE in R, using de sde.sim() function in the sde package. I would like to use the sde.sim() function in order to eventually transform this system into an SDE. So initially, I started with an simple ODE system (Lotka Volterra model) without a noise term.
The Lotka-Volterra ODE system:
with initial values for x = 10 and y = 10.
The parameter values for alpha, beta, delta and gamma are 1.1, 0.4, 0.1 and 0.4 respectively (mimicking this example).
Attempt to solve problem:
library(sde)
d <- expression((1.1 * x[0] - 0.4 * x[0] * x[1]), (0.1 * x[0] * x[1] - 0.4 * x[1]))
s <- expression(0, 0)
X <- sde.sim(X0=c(10,10), T = 10, drift=d, sigma=s)
plot(X)
However, this does not seem to generate a nice cyclic behavior of the predator and prey population.
Expected Output
I used the deSolve package in R to generate the expected output.
library(deSolve)
alpha <-1.1
beta <- 0.4
gamma <- 0.1
delta <- 0.4
yini <- c(X = 10, Y = 10)
Lot_Vol <- function (t, y, parms) {
with(as.list(y), {
dX <- alpha * X - beta * X * Y
dY <- 0.1 * X * Y - 0.4 * Y
list(c(dX, dY))
}) }
times <- seq(from = 0, to = 100, by = 0.01)
out <- ode(y = yini, times = times, func = Lot_Vol, parms = NULL)
plot(y=out[, "X"], x = out[, "time"], type = 'l', col = "blue", xlab = "Time", ylab = "Animals (#)")
lines(y=out[, "Y"], x = out[, "time"], type = 'l', col = "red")
Question
I think something might be wrong the the drift function, however, I am not sure what. What is going wrong in the attempt to solve this system of ODEs in sde.sim()?
Assuming that not specifying a method takes the first in the list, and that all other non-specified parameters take default values, you are performing the Euler method with step size h=0.1.
As is known on a function that has convex concentric trajectories, the Euler method will produce an outward spiral. As a first order method, the error should grow to size about T*h=10*0.1=1. Or if one wants to take the more pessimistic estimate, the error has size (exp(LT)-1)*h/L, with L=3 in some adapted norm this gives a scale of 3.5e11.
Exploring the actual error e(t)=c(t)*h of the Euler method, one gets the following plots. Left are the errors of the components and right the trajectories for various step sizes in the Euler method. The error coefficient the function c(t) in the left plots is scaled down by the factor (exp(L*t)-1)/L to get comparable values over large time intervals, the value L=0.06 gave best balance.
One can see that the actual error
abs(e(t))<30*h*(exp(L*t)-1)/L
is in-between the linear and exponential error models, but closer to the linear one.
To reduce the error, you have to decrease the step size. In the call of SDE.sim, this is achieved by setting the parameter N=5000 or larger to get a step size h=10/5000=0.002 so that you can hope to be correct in the first two digits with an error bound of 30*h*T=0.6. In the SDE case you accumulate Gaussian noise of size sqrt(h) in every step, so that the truncation error of O(h^2) is a rather small perturbation of the random number.

How can i plot this transcendental equation in R?

This is the kepler equation in terms of the angle theta:
M=2*atan(tan(theta/2)*c)-e*sin(2*atan(tan(theta/2)*c))
where e=0.2056 and c=sqrt((1-e)/(1+e))
M goes from 0 to 2pi.
My X value is M and my Y value is theta. What code should I use to plot theta(M)?
Adjusted answer to make range of M be (0,2*pi)
Your equation:
M=2*atan(tan(theta/2)c)-esin(2*atan(tan(theta/2)*c))
defines M as a function of theta. It may be that in actual use you will know M and need to compute theta, but to get a plot, there is no need for an analytic formula for theta as a function of M. You just need a series of x-y values. So, you can generate a sequence of thetas, compute M and plot them, like this:
e=0.2056
c=sqrt((1-e)/(1+e))
theta = seq(0,2*pi, 0.1)
M=2*atan(tan(theta/2)*c)-e*sin(2*atan(tan(theta/2)*c))
M[M<0] = M[M<0] + 2*pi
plot(M, theta, pch=20)
If you need to be able to compute values of theta from a given M, you can approximate the inverse function like this.
THETA = approxfun(M, theta)
plot(M, THETA(M), type="l", ylab="theta")

Calculate the volume under a plot of kernel bivariate density estimation

I need to calculate a measure called mutual information. First of all, I need to calculate another measure, called entropy, for example, the joint entropy of x and y:
-∬p(x,y)·log p(x,y)dxdy
So, to calculate p(x,y), I used the kernel density estimator (in this way, function kde2d, and it returned the Z values (probability of having x and y in that window).
Again, by now, I have a matrix of Z values [1x100] x [1x100], that's equal my p(x,y). But I have to integrate it, by discovering the volume under the surface (doble integral). But I didn't found a way to do that. The function quad2d, to compute the double quadrature didn't work, because I only integrated a numerical matrix p(x,y), and it gives me a constant....
Anyone knows something to find that volume/calculate the double integral?
The image of the plot from persp3d:
Thanks everybody !!!!
Once you have the results from kde2d, it is very straighforward to compute a numerical integral. The example session below sketches how to do it.
As you know, numerical double integral is just a 2D summation. The kde2d, by default takes range(x) and range(y) as 2D domain. I see that you got a 100*100 matrix, so I think you have set n = 100 in using kde2d. Now, kde$x, kde$y defines a 100 * 100 grid, with den$z giving density on each grid cell. It is easy to compute the size of each grid cell (they are all equal), then we do three steps:
find normalizing constants; although we know that in theory, density sums up (or integrates) to 1, but after computer discretization, it only approximates 1. So we first compute this normalizing constant for later rescaling;
the integrand for entropy is z * log(z); since z is a 100 * 100 matrix, this is also a matrix. You simply sum them up, and multiply it by the cell size cell_size, then you get a non-normalized entropy;
rescale the non-normalized entropy for a normalized one.
## sample data: bivariate normal, with covariance/correlation 0
set.seed(123); x <- rnorm(1000, 0, 2) ## marginal variance: 4
set.seed(456); y <- rnorm(1000, 0, 2) ## marginal variance: 4
## load MASS
library(MASS)
## domain:
xlim <- range(x)
ylim <- range(y)
## 2D Kernel Density Estimation
den <- kde2d(x, y, n = 100, lims = c(xlim, ylim))
##persp(den$x,den$y,den$z)
z <- den$z ## extract density
## den$x, den$y expands a 2D grid, with den$z being density on each grid cell
## numerical integration is straighforward, by aggregation over all cells
## the size of each grid cell (a rectangular cell) is:
cell_size <- (diff(xlim) / 100) * (diff(ylim) / 100)
## normalizing constant; ideally should be 1, but actually only close to 1 due to discretization
norm <- sum(z) * cell_size
## your integrand: z * log(z) * (-1):
integrand <- z * log(z) * (-1)
## get numerical integral by summation:
entropy <- sum(integrand) * cell_size
## self-normalization:
entropy <- entropy / norm
Verification
The above code gives entropy of 4.230938. Now, Wikipedia - Multivariate normal distribution gives entropy formula:
(k / 2) * (1 + log(2 * pi)) + (1 / 2) * log(det(Sigma))
For the above bivariate normal distribution, we have k = 2. We have Sigma (covariance matrix):
4 0
0 4
whose determinant is 16. Hence, the theoretical value is:
(1 + log(2 * pi)) + (1 / 2) * log(16) = 4.224171
Good match!

Resources