How to check whether the curve is C1 class? - math

How to check whether the curve is C1 class or C2 class.
Example:
x = [1,2,3,4,5,6,7,8,9 ......1500]
y = [0.56, 1, 12, 41, 01. ....... 11, 0.11, 3, 23, 95]
This curve is C1 class "function" ?
Thank you very much.

MatLab vectors contain samples of the function, not the function itself.
Sampled data is always discrete, not continuous.
There are infinitely many functions with the same samples. Specifically, there are always both continuous and discontinous functions with those samples, so there's no way to determine C1 or not from just samples.
Example of a continuous function: The Fourier (or DCT) reconstructed estimate.
Example of a discontinuous function: The Fourier reconstructed estimate, plus a sawtooth wave with period equal to the sampling rate.

You can't tell from the data you're given; you have to know something about how you represent a function from it.
For example, if I plot those as a histogram it's discontinuous (jumps at each point). If I do straight line interpolation between points it's C0 continuous. If I use a smooth interpolation like a spline I can get C1 continuity and so on depending on how I choose to represent the function from your arrays of data.

While technically you can't check if the data corresponds to a C1 or C2 curve - you can do something that still might be useful.
C1 means continuous 1st derivative. So if you calculate the derivative numerically and then see big jumps in the derivative then you might suspect that the underlying curve is not C1. (You can't actually guarantee that, but you can guarantee that it is either not C1 or has derivative outside some bounds). Conversely if you don't get any big jumps then there is a C1 curve with bounded derivative that does fit the data - just not necessarily the same curve that actually generated the data.
You can do something similar with the numerically calculated second derivative to determine its C2 status. (Note that if its not C1, then it can't be C2 - so if that test fails you can forget about the second test.)
Here's roughly how I'd do it in C++ for the C1 case with evenly spaced x points. (If things are not evenly spaced you'll need to tweak the calculation of s).
double y[N] = {0.56, 1, 12, 41, ..., 11, 0.11, 3, 23, 95 };
double max_abs_slope = 0;
double sum_abs_slope = 0;
double sum_abs_slope_sq = 0;
unsigned int imax=0;
for(unsigned int i=0; i<N-1; ++i )
{
double s = fabs( y[i+1]-y[i] );
sum_abs_slope += s;
sum_abs_slope_sq += s*s;
if(s>max_abs_slope) { max_abs_slope = s; imax = i; }
}
// We expect the max to be within three std-dev of the average.
double stddev = sqrt( (N*sum_abs_slope_sq - sum_abs_slope*sum_abs_slope)/(N*(N-1)) );
if( ( max_abs_slope - sum_abs_slope/(N-1) ) > 3 * stddev )
{
std::cout<<"There's an unexpectedly large jump in interval "<<imax<<std::endl;
}
else
{
std::cout<<"It seems smooth"<<std::endl;
}
However you might use a different threshold than 3*stddev, you might pick an actual limit based on
your knowledge of the underlying problem, or you might choose to be stricter (using a value >3) or less strict (<3).
I've not tested this code, so it may not run or may be buggy.
I've also not checked that 3*stddev makes sense for any curves.
This is very much caveat emptor.

Related

How can I write an arbitrary continuous distribution in Julia, or at least simulate sampling from one?

Suppose I have an arbitrary probability distribution function (PDF) defined as a function f, for example:
using Random, Distributions
#// PDF with parameter θ ϵ (0,1), uniform over
#// segments [-1,0) and [0,1], zero elsewhere
f = θ -> x ->
(-1 <= x < 0) ? θ^2 :
(0 <= x <= 1) ? 1-θ^2 :
0
How can I sample values from a random variable with this PDF in Julia? (or alternatively, how can I at least simulate sampling from such a random variable?)
i.e. I want the equivalent of rand(Normal(),10) for 10 values from a (standard) normal disribution, but I want to use the function f to define the distribution used (something like rand(f(0.4),10) - but this doesn't work)
(This is already an answer for discrete distributions at How can I write an arbitrary discrete distribution in Julia?: however I'm wanting to use a continuous distribution. There are some details of creating a sampler at https://juliastats.org/Distributions.jl/v0.14/extends.html which I think might be useful, but I don't understand how to apply them. Also in R I've used the inverse CDF technique as described at https://blogs.sas.com/content/iml/2013/07/22/the-inverse-cdf-method.html for simulating such random variables, but am unsure how it might best be implemented in Julia.)
First problem is that what you've provided is not a complete specification of a probability distribution since it doesn't say anything about the distribution within the interval [-1, 0) or within the interval [0, 1]. So for the purposes of this answer I'm going to assume your probability distribution function is uniform on each of these intervals. Given this, then I would argue the most Julian way to implement your own distribution would be to create a new subtype, in this case, of ContinuousUnivariateDistribution. Example code follows:
using Distributions
struct MyDistribution <: ContinuousUnivariateDistribution
theta::Float64
function MyDistribution(theta::Float64)
!(0 <= theta <= 1) && error("Invalid theta: $(theta)")
new(theta)
end
end
function Distributions.rand(d::MyDistribution)::Float64
if rand() < d.theta^2
x = rand() - 1
else
x = rand()
end
return x
end
function Distributions.quantile(d::MyDistribution, p::Real)::Float64
!(0 <= p <= 1) && error("Invalid probability input: $(p)")
if p < d.theta^2
x = -1.0 + (p / d.theta^2)
else
x = (p - d.theta^2) / (1 - d.theta^2)
end
return x
end
In the above code I have implemented a rand and quantile method for the new distribution which is the minimum to be able to make function calls like rand(MyDistribution(0.4), 20) to sample 20 random numbers from your new distribution. See here for a list of other methods you may want to add to your new distribution type (depending on your use-case perhaps you won't bother).
Note, if efficiency is an issue, you may look into some of the methods that will allow you to minimise the number of d.theta^2 operations, e.g. Distributions.sampler. Alternatively, you could just store theta^2 internally in MyDistribution but always display the underlying theta. Up to you really.
Finally, you don't really need type annotations on function outputs. I've just included them for clarity.

Don't understand code that gets random number between two values

So, I understand this seems like a stupid question. With that said however, within my classes concerned with teaching proper code. I came upon this. Min + (Math.random() * ((Max - Min) + 1)) Essentially the code goes. Add your minimum value + a random number between 0.0 and 1.0 multiplied by the max value minus the minimum value plus 1. The book regards this code as a base for retrieving a random value within certain parameters. i.e max=40 min=20. That would get a value between 40 and 20.
The thing is, I know what the code is saying and doing. I was using this to generate a random character by adding (char) to the front and using 'a' and 'z' as the values. The thing is though, I don't understand how, mathematically speaking, this even works. I understand it makes me a pretty poor programmer on my part. I never claimed to be great or brilliant. I know algebra and some basic higher math concepts but there are some stupidly basic formulas like this that leave me scratching my head.
In terms of programming logic, this isn't so much an issue for me, but seeing concepts like this. I'm confused. I don't get the mathematical logic of this code. Am I missing anything? I mean, with a math random value between 0.0 and 1.0, I don't see how it procures a value between the minimum and maximum value. Would anybody be willing to be give me a layman's explanation of how this works?
it is called linear interpolation or sometimes even linear extrapolation depends if you are enlarging or shrinking dynamic range. Anyway the idea behind dynamic range changing is this:
let have:
x = < x0 , x1 > // input range
And we want to change them to
y = < y0 , y1 > // output range
So let me derive it step by step:
// equation range operation
y = x // < x0 , x1 > -x0
y = x-x0 // < 0 , x1-x0 > /(x1-x0)
y = (x-x0)/(x1-x0) // < 0 , 1 > *(y1-y0)
y = (y1-y0)(x-x0)/(x1-x0) // < 0 , y1-y0 > +y0
y = y0 + (y1-y0)(x-x0)/(x1-x0) // < y0 , y1 >
Now I suspect x=Math().random() returns values x=<0,1> and we want result in <y0,y1> = <min,max> then:
y = min + (max-min)(x-0)/(1-0)
y = min + (max-min)*x
The +1 resulting in <min,max+1> range or in case your Random()<1 restores range from <min,max) back to <min,max> hard to say without context (I do not code in your language assuming JAVA or something similar I am more of a C++ guy)
For simplicity linear interpolation/extrapolation is to obtain values between two edges/points/values linearly with some parameter t=<0,1>
x(t) = x0 + (x1-x0)*t
if (t=0) then x(0)=x0
if (t=1) then x(1)=x1
if (t=0.5) then x(0.5)= middle between x0 and x1
If t=<0,1> then we are talking about linear interpolation. If t is outside this range then we are talking about linear extrapolation (equation is the same).
Linear means when you sample points/values with constant t step then the resulting values will also have constant distance between them. And also lies on a single line ...
Hope it is clear now.
Imagine rubber fiber spanned between points 0 and 1 (line segment).
Sprinkle some dye drops on it - you have generated random values on 0..1 interval.
Now fix left point and stretch this fiber until its length becomes Max - Min.
Now shift it right by Min.
You can see some color points (random values) on interval Min..Max
In general this is linear transformation of one interval (0..1) into another (Min..Max). Note that initial interval might be arbitrary.

how to compare two curves (arrays of points)

i have 2 arrays of points (x,y), with those points I can draw 2 curves.
Anyone have ideas how to calculate how those curves are similar?
You can always calculate the area between those two curves. (This is a bit easier if the endpoints match.) The curves are similar if the area is small, not so similar if the area is not small.
Note that I did not define 'small'. That was intentional. Then again, you didn't define 'similar'.
Edit
Sometimes area isn't the best metric. For example consider the function f(x)=0 and f(x)=1e6*sin(x). If the range of x is some integral multiple of 2*pi, the area between these curves is zero. A function that oscillates between plus and minus one million is not a good approximation of f(x)=0.
A better metric is needed. Here are a couple. Note: I am assuming here that the x values are identical in the two sets; the only things that differ are the y values.
Sum of squares. For each x value, compute delta_yi = y1,i - y2,i and accumulate delta_yi2. This metric is the basis for a least square optimization, where the goal is to minimize the sum of the squares of the errors. This is a widely used approach because oftentimes it is fairly easy to implement.
Maximum deviation. Find the abs_delta_yi = |y1,i - y2,i| that maximizes the |y1,i - y2,i| for all x values. This metric is the basis for a lot of the implementations of the functions in the math library, where the goal is to minimize the maximum error. These math library implementations are approximations of the true function. As a consumer of such an approximation, I typically care more about the worst thing that the approximation is going to do to my application than I care about how that approximation is going to behave on average.
You might want to consider using Dynamic Time Warping (DTW) or Frechet distance.
Dynamic Time Warping
Dynamic Time Warping sums the difference throughout the entire curve. It can handle two arrays of different sizes. Here is a snippet from Wikipedia on how the code might look. This solution uses a two-dimensional array. The cost would be the distance between two points. The final value of the array DTW[n, m] contains the cumulative distance.
int DTWDistance(s: array [1..n], t: array [1..m]) {
DTW := array [0..n, 0..m]
for i := 1 to n
DTW[i, 0] := infinity
for i := 1 to m
DTW[0, i] := infinity
DTW[0, 0] := 0
for i := 1 to n
for j := 1 to m
cost:= d(s[i], t[j])
DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion
DTW[i , j-1], // deletion
DTW[i-1, j-1]) // match
return DTW[n, m]
}
DTW is similar to Jacopson's answer.
Frechet Distance
Frechet distance calculates the farthest that the curves separate. This means that all other points on the curve are closer together than this distance. This approach is typically represented with a dog and owner as shown here:
Frechet Distance Example.
Depending on your arrays, you can compare the distance of the points and use the maximum.
I assume a Curve is an array of 2D points over the real numbers, the size of the array is N, so I call p[i] the i-th point of the curve; i goes from 0 to N-1.
I also assume that the two curves have the same size and that it is meaningful to "compare" the i-th point of the first curve with the i-th point of the second curve.
I call Delta, a real number, the result of the comparison of the two curves.
Delta can be computed as follow:
Delta = 0;
for( i = 0; i < N; i++ ) {
Delta = Delta + distance(p[i],q[i]);
}
where p are points from the first curve and q are points from the second curve.
Now you have to choose a suitable distance function depending on your problem: the function has two points as arguments and returns a real number.
For example distance can be the usual distance of two point on the plane (Pythagorean theorem and http://en.wikipedia.org/wiki/Euclidean_distance).
An example of the method in C++:
#include <numeric>
#include <vector>
#include <cmath>
#include <iostream>
#include <functional>
#include <stdexcept>
typedef double Real_t;
class Point
{
public:
Point(){}
Point(std::initializer_list<Real_t> args):x(args.begin()[0]),y(args.begin()[1]){}
Point( const Real_t& xx, const Real_t& yy ):x(xx),y(yy){}
Real_t x,y;
};
typedef std::vector< Point > Curve;
Real_t point_distance( const Point& a, const Point& b )
{
return hypot(a.x-b.x,a.y-b.y);
}
Real_t curve_distance( const Curve& c1, const Curve& c2 )
{
if ( c1.size() != c2.size() ) throw std::invalid_argument("size mismatch");
return std::inner_product( c1.begin(), c1.end(), c2.begin(), Real_t(0), std::plus< Real_t >(), point_distance );
}
int main(int,char**)
{
Curve c1{{0,0},
{1,1},
{2,4},
{3,9}};
Curve c2{{0.1,-0.1},
{1.1,0.9},
{2.1,3.9},
{3.1,8.9}};
std::cout << curve_distance(c1,c2) << "\n";
return 0;
}
If your two curves have different size then you have to think how to extend the previous method, for example you can reduce the size of the longest curve by means of a suitable algorithm (for example the Ramer–Douglas–Peucker algorithm can be a starting point) in order to match it to the size of the shortest curve.
I have just described a very simple method, you can also take different approaches; for example you can fit two curves to the two set of points and then work with the two curves expressed as mathematical function.
This can also be solved, thinking in terms of distributions.
Especially if the position of a value is interchangeable within an array.
Then you could calculate the mean and the std (and other distribution characteristics) for both arrays. And calculate the difference between those characteristics.

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.

Can I force two components in a three-way linear regression to be positive?

I'm sorry if I'm not using the correct mathemathical terms, but I hope you'll understand what I'm trying to accomplish.
My problem:
I'm using linear regression (currently least squares method) on the values from two vectors x and y against the result z. This is to be done in matlab, and I'm using the \-operator to perform the regression. My dataset will contain a few thousand observations (up to about 50000 at max).
The x-values will be in the area of 10-300 (most between 60 and 100) and the y-values in the 1-3 area.
My code looks like this:
X = [ones(size(x,1) x y];
parameters = X\y;
The output "parameters" are then the three factors a0, a1 and a2 which is used in this formula:
a0 * 1 + a1 * xi + a2 * yi = zi
(The i's are supposed to be subscripted)
This works like expected, although I want the two parameters a1 and a2 to ALWAYS be positive values, even when the vector z is negative (this means that the a0 will be negative, of course), since this is what the real model looks like (z is always positively correlated to x and z). Is this possible using the least squares method? I'm also open for other algorithms for linear regression.
Let me try and rephrase to clarify. Accoring to your model z is always positively correlated with x and y. However, sometimes when you solve the linear regression for the coefficient this gives you a negative value.
If you are right about the data, this should only happen when the correct coefficient is small, and noise happens to take it negative. You could just assign it to zero, but then the means wouldn't match properly.
In which case the correct solution is as jpalacek says, but explained with more detail here:
Try and regress against x and y. If both positive take the result.
If a1 is negative, assume it should be zero. regress z against y. If a2 is positive then take a1 as 0, and a0 and a2 from this regression.
If a2 is negative, assume it should be zero too. Regress z against 1, and take this as a0. Let a1 and a2 be 0.
This should give you what you want.
The simple solution is to use a tool designed to solve it. That is, use lsqlin, from the optimization toolbox. Set a lower bound constraint for two of the three parameters.
Thus, assuming x, y, and z are all COLUMN vectors,
A = [ones(length(x),1),x,y];
lb = [-inf, 0, 0];
a = lsqlin(A,z,[],[],[],[],lb);
This will constrain only the second and third unknown parameters.
Without the optimization toolbox, use lsqnonneg, which is part of matlab itself. Here too the solution is easy enough.
A = [ones(length(x),1),x,y];
a = lsqnonneg(A,z);
Your model will be
z = a(1) + a(2)*x + a(3)*y
If a(1) is essentially zero, i.e., it is within a tolerance of zero, then assume that the first parameter was constrained by the bound at zero. In that case, solve a second problem by changing the sign on the column of ones in A.
A(:,1) = -1;
a = lsqnonneg(A,z);
If this solution has a(1) significantly non-zero, then the second solution must be better than the first. Your model will now be
z = -a(1) + a(2)*x + a(3)*y
It costs you at most two calls to lsqnonneg, and the second call is only ever made some fraction (lacking any information about your problem, the odds are 50% of the second call) of the time.

Resources