Generating random numbers from various distributions in CUDA - r

I am playing around with doing MCMC on the GPU, and need implementations for various samplers, written for CUDA.
Most of the posts I've seen on StackOverflow relate to uniform, binomial, and normal sampling. Are there any libraries that allow me the simplicity and variety of the d-p-q-r functions in R (See this page)?
I would like to be able to sample from Gamma, Normal, Binomial, and the inverse distributions used in Bayesian problems (inverse chi square, inverse gamma), and would prefer not to have to write my own using inverse probability transforms and acceptance-rejection sampling.

For the Gamma distribution, this is what I use at the moment.
It is the GSL function modified to work with CuRAND.
__device__ double ran_gamma (curandState localState, const double a, const double b){
/* assume a > 0 */
if (a < 1){
double u = curand_uniform_double(&localState);
return ran_gamma (localState, 1.0 + a, b) * pow (u, 1.0 / a);
}
{
double x, v, u;
double d = a - 1.0 / 3.0;
double c = (1.0 / 3.0) / sqrt (d);
while (1){
do{
x = curand_normal_double(&localState);
v = 1.0 + c * x;
} while (v <= 0);
v = v * v * v;
u = curand_uniform_double(&localState);
if (u < 1 - 0.0331 * x * x * x * x)
break;
if (log (u) < 0.5 * x * x + d * (1 - v + log (v)))
break;
}
return b * d * v;
}
}

Related

Simplex noise magic numbers explanation

I was reading The Book of Shader's chapter about simplex noise (click for full code), and had difficulty understanding a few magic numbers used here. This will not be a bug related thread, but should make sense under SO's community criteria.
See these lines:
vec3 p = permute( permute( i.y + vec3(0.0, i1.y, 1.0 ))
+ i.x + vec3(0.0, i1.x, 1.0 ));
// random numbers for gradient generation
// element wise: 0.5 - x ^ 4
// use max clamp element wise: if x < 0 then m = 0 (i.e. the gradient from the vertex is 0)
// x1, x2, x3: 3 verteces of triangle simplex
// dot product is the distance from v to simpelx verteces
vec3 m = max(0.5 - vec3(dot(x0,x0), dot(x12.xy,x12.xy), dot(x12.zw,x12.zw)), 0.0);
m = m*m ;
m = m*m ;
vec3 x = 2.0 * fract(p * C.www) - 1.0; // gradient? 2 * fract(x / 41) - 1, in [-1, 1]
vec3 h = abs(x) - 0.5; // in [-0.5, 0.5]
vec3 ox = floor(x + 0.5); // in [-1, 1]
vec3 a0 = x - ox;
// (x - ox) ^ 2 + (abs(x) - .5) ^ 2
m *= 1.79284291400159 - 0.85373472095314 * ( a0 * a0 + h * h ); // ???
Meanwhile I understand in another java implementation they used a precomputed 2d array table and a random index to look up gradients, the above lines don't make much sense for me. I guess m stands for weights for each vertex's gradient contribution, and the rest remains a puzzle.
Hope there could be resources / comments help me out on understanding this snippet.

math behind normal compression

below codes(rewritten by C#) are used to compress unit normal vector from Wild Magic 5.17,could someone explain some math behind them or share some related refs ? I can figure out the octant bits setting, but the mantissa packing and unpacking seem complex ...
codes gist
some of codes here
// ...
public static ushort CompressNormal(Vector3 normal)
{
var x = normal.x;
var y = normal.y;
var z = normal.z;
Debug.Assert(MathUtil.IsSame(x * x + y * y + z * z, 1));
// Determine octant.
ushort index = 0;
if (x < 0.0)
{
index |= 0x8000;
x = -x;
}
if (y < 0.0)
{
index |= 0x4000;
y = -y;
}
if (z < 0.0)
{
index |= 0x2000;
z = -z;
}
// Determine mantissa.
ushort usX = (ushort)Mathf.Floor(gsFactor * x);
ushort usY = (ushort)Mathf.Floor(gsFactor * y);
ushort mantissa = (ushort)(usX + ((usY * (255 - usY)) >> 1));
index |= mantissa;
return index;
}
// ...
Author wanted to use 13 bits.
Trivial way: 6 bits for x component + 6 bits for y - occupies only 12 bits, so he invented approach to assign ~90 (lsb) units for x and and ~90 (msb) units for y (90*90~2^13).
I have no idea why he uses quadratic formula for y-component - this way gives slightly different distribution of approximated values between smaller and larger values - but why specifically for y?
I've asked Mr. Eberly (author of Wild Magic) and he gives the ref, desc in short, codes above try to map (x, y) to an index of triangular array (index is from 0 to N * (N + 1) / 2 - 1)
more details are in the related doc here,
btw, another solution here with a different compress method.

Implementing triangle, sawtooth and reverse sawtooth formulas in program code

I've looked at the formulas for these waves but I can't figure out how to implement them. I was able to figure you the SINE and SQUARE waves:
float x = note.frequency / AppSettings::sampleRate;
float theta_increment = 2.0f * M_PI * x;
float value = 0;
if(waveType == SINE){
value = sin(theta_increment);
}
else if (waveType == SQUARE){
value = sin(note.theta);
value = (value > 0) - (value < 0);
}
The formula I tried was based on this example and the explanation from wiki:
square(t) = sgn(sin(2πt))
// this is how I tried to implement it
theta_increment - floor(theta_increment - 0.5f);
But this generates a very low sounding tone and the frequency change doesn't seem to have any effect (not one that I can hear anyway). So cans someone help me with implementing sawtooth and triangle? Some explanation would be very helpful to because unlike sine and square I don't understand these formulas very well.
Delphi code. I hope that formulas are clear. Frequencies and magnitudes are consistent.
w := 1.0; // angular frequency
for i := 0 to 999 do begin
t := i * 2 * Pi / 400 - 3/2 * Pi; // just X-axis scale
wt := w * t;
f := wt / (2.0 * Pi); //frequency
sn := sin(wt); // sine wave
saw := 2.0 * (f - Floor(f)) - 1.0; //sawtooth
f := f + 0.25;
tr := Abs(4 * (f - Floor(f + 0.5))) - 1.0; //triangle
Series1.AddXY(t, sn);
Series2.AddXY(t, saw);
Series3.AddXY(t, tr);
end;
Result:

Calculate bessel function in MATLAB using Jm+1=2mj(m) -j(m-1) formula

I tried to implement bessel function using that formula, this is the code:
function result=Bessel(num);
if num==0
result=bessel(0,1);
elseif num==1
result=bessel(1,1);
else
result=2*(num-1)*Bessel(num-1)-Bessel(num-2);
end;
But if I use MATLAB's bessel function to compare it with this one, I get too high different values.
For example if I type Bessel(20) it gives me 3.1689e+005 as result, if instead I type bessel(20,1) it gives me 3.8735e-025 , a totally different result.
such recurrence relations are nice in mathematics but numerically unstable when implementing algorithms using limited precision representations of floating-point numbers.
Consider the following comparison:
x = 0:20;
y1 = arrayfun(#(n)besselj(n,1), x); %# builtin function
y2 = arrayfun(#Bessel, x); %# your function
semilogy(x,y1, x,y2), grid on
legend('besselj','Bessel')
title('J_\nu(z)'), xlabel('\nu'), ylabel('log scale')
So you can see how the computed values start to differ significantly after 9.
According to MATLAB:
BESSELJ uses a MEX interface to a Fortran library by D. E. Amos.
and gives the following as references for their implementation:
D. E. Amos, "A subroutine package for Bessel functions of a complex
argument and nonnegative order", Sandia National Laboratory Report,
SAND85-1018, May, 1985.
D. E. Amos, "A portable package for Bessel functions of a complex
argument and nonnegative order", Trans. Math. Software, 1986.
The forward recurrence relation you are using is not stable. To see why, consider that the values of BesselJ(n,x) become smaller and smaller by about a factor 1/2n. You can see this by looking at the first term of the Taylor series for J.
So, what you're doing is subtracting a large number from a multiple of a somewhat smaller number to get an even smaller number. Numerically, that's not going to work well.
Look at it this way. We know the result is of the order of 10^-25. You start out with numbers that are of the order of 1. So in order to get even one accurate digit out of this, we have to know the first two numbers with at least 25 digits precision. We clearly don't, and the recurrence actually diverges.
Using the same recurrence relation to go backwards, from high orders to low orders, is stable. When you start with correct values for J(20,1) and J(19,1), you can calculate all orders down to 0 with full accuracy as well. Why does this work? Because now the numbers are getting larger in each step. You're subtracting a very small number from an exact multiple of a larger number to get an even larger number.
You can just modify the code below which is for the Spherical bessel function. It is well tested and works for all arguments and order range. I am sorry it is in C#
public static Complex bessel(int n, Complex z)
{
if (n == 0) return sin(z) / z;
if (n == 1) return sin(z) / (z * z) - cos(z) / z;
if (n <= System.Math.Abs(z.real))
{
Complex h0 = bessel(0, z);
Complex h1 = bessel(1, z);
Complex ret = 0;
for (int i = 2; i <= n; i++)
{
ret = (2 * i - 1) / z * h1 - h0;
h0 = h1;
h1 = ret;
if (double.IsInfinity(ret.real) || double.IsInfinity(ret.imag)) return double.PositiveInfinity;
}
return ret;
}
else
{
double u = 2.0 * abs(z.real) / (2 * n + 1);
double a = 0.1;
double b = 0.175;
int v = n - (int)System.Math.Ceiling((System.Math.Log(0.5e-16 * (a + b * u * (2 - System.Math.Pow(u, 2)) / (1 - System.Math.Pow(u, 2))), 2)));
Complex ret = 0;
while (v > n - 1)
{
ret = z / (2 * v + 1.0 - z * ret);
v = v - 1;
}
Complex jnM1 = ret;
while (v > 0)
{
ret = z / (2 * v + 1.0 - z * ret);
jnM1 = jnM1 * ret;
v = v - 1;
}
return jnM1 * sin(z) / z;
}
}

Trilateration with limits?

I'm in need of help solving an issue, the problem came up doing one of my small robot experiments, the basic idea, is that each little robot has the ability to approximate the distance, from themselves to an object, however the approximate I'm getting is way too rough, and I'm hoping to calculate something more accurate.
So:
Input: A list of vertex (v_1, v_2, ... v_n), a vertex v_* (robots)
Output: The coordinates for the unknown vertex v_* (object)
Each vertex v_1 to v_n's coordinates are well known (supplied by calling getX() and getY() on the vertex), and its possible to get the approximate range to v_* by calling; getApproximateDistance(v_*), function getApproximateDistance() returns two variables variables, that is; minDistance and maxDistance. - The actual distance lies in between these.
So what I've been trying to do to obtain the coordinates for v_*, is to use trilateration, however I can't seem to find a formula for doing trilateration with limits (lower and upperbound), so that's really what I'm looking for (not really good enough at math, to figure it out myself).
Note: is triangulation the way to go instead?
Note: I would possibly love to know a way to do, performance/accuracy trade-offs.
An example of data:
[Vertex . `getX()` . `getY()` . `minDistance` . `maxDistance`]
[`v_1` . 2 . 2 . 0.5 . 1 ]
[`v_2` . 1 . 2 . 0.3 . 1 ]
[`v_3` . 1.5 . 1 . 0.3 . 0.5]
Picture to show data: http://img52.imageshack.us/img52/6414/unavngivetcb.png
It's obvious that the approximate for v_1 can be better, than [0.5; 1], as the figure that the above data creates is small cut of a annulus (limited by v_3), however how would I calculate that, and possibly find the approximate within that figure (this figure is possibly concave)?
Would this be better suited for MathOverflow?
I would go for a simple discrete approach. The implicit formula for an annulus is trivial and the intersection of multiple annulus if the number of them is high can be computed somewhat efficently with a scanline based approach.
For getting high accuracy with a fast computation an option could be using a multiresolution approach (i.e. first starting in low-res and then recomputing in high-res only samples that are close to a valid point.
A small python toy I wrote can generate a 400x400 pixel image of the intersection area in about 0.5 secs (this is the kind of computation that would get a 100x speedup if done with C).
# x, y, r0, r1
data = [(2.0, 2.0, 0.5, 1.0),
(1.0, 2.0, 0.3, 1.0),
(1.5, 1.0, 0.3, 0.5)]
x0 = max(x - r1 for x, y, r0, r1 in data)
y0 = max(y - r1 for x, y, r0, r1 in data)
x1 = min(x + r1 for x, y, r0, r1 in data)
y1 = min(y + r1 for x, y, r0, r1 in data)
def hit(x, y):
for cx, cy, r0, r1 in data:
if not (r0**2 <= ((x - cx)**2 + (y - cy)**2) <= r1**2):
return False
return True
res = 400
step = 16
white = chr(255)
grey = chr(192)
black = chr(0)
img = [black] * (res * res)
# Low-res pass
cells = {}
for i in xrange(0, res, step):
y = y0 + i * (y1 - y0) / res
for j in xrange(0, res, step):
x = x0 + j * (x1 - x0) / res
if hit(x, y):
for h in xrange(-step*2, step*3, step):
for v in xrange(-step*2, step*3, step):
cells[(i+v, j+h)] = True
# High-res pass
for i in xrange(0, res, step):
for j in xrange(0, res, step):
if cells.get((i, j), False):
img[i * res + j] = grey
img[(i + step - 1) * res + j] = grey
img[(i + step - 1) * res + (j + step - 1)] = grey
img[i * res + (j + step - 1)] = grey
for v in xrange(step):
y = y0 + (i + v) * (y1 - y0) / res
for h in xrange(step):
x = x0 + (j + h) * (x1 - x0) / res
if hit(x, y):
img[(i + v)*res + (j + h)] = white
open("result.pgm", "wb").write(("P5\n%i %i 255\n" % (res, res)) +
"".join(img))
Another interesting option could be using a GPU if available. Starting from a white picture and drawing in black the exterior of each annulus will leave at the end the intersection area in white.
For example with Python/Qt the code for doing this computation is simply:
img = QImage(res, res, QImage.Format_RGB32)
dc = QPainter(img)
dc.fillRect(0, 0, res, res, QBrush(QColor(255, 255, 255)))
dc.setPen(Qt.NoPen)
dc.setBrush(QBrush(QColor(0, 0, 0)))
for x, y, r0, r1 in data:
xa1 = (x - r1 - x0) * res / (x1 - x0)
xb1 = (x + r1 - x0) * res / (x1 - x0)
ya1 = (y - r1 - y0) * res / (y1 - y0)
yb1 = (y + r1 - y0) * res / (y1 - y0)
xa0 = (x - r0 - x0) * res / (x1 - x0)
xb0 = (x + r0 - x0) * res / (x1 - x0)
ya0 = (y - r0 - y0) * res / (y1 - y0)
yb0 = (y + r0 - y0) * res / (y1 - y0)
p = QPainterPath()
p.addEllipse(QRectF(xa0, ya0, xb0-xa0, yb0-ya0))
p.addEllipse(QRectF(xa1, ya1, xb1-xa1, yb1-ya1))
p.addRect(QRectF(0, 0, res, res))
dc.drawPath(p)
and the computation part for an 800x800 resolution image takes about 8ms (and I'm not sure it's hardware accelerated).
If only the barycenter of the intersection is to be computed then there is no memory allocation at all. For example a "brute-force" approach is just a few lines of C
typedef struct TReading {
double x, y, r0, r1;
} Reading;
int hit(double xx, double yy,
Reading *readings, int num_readings)
{
while (num_readings--)
{
double dx = xx - readings->x;
double dy = yy - readings->y;
double d2 = dx*dx + dy*dy;
if (d2 < readings->r0 * readings->r0) return 0;
if (d2 > readings->r1 * readings->r1) return 0;
readings++;
}
return 1;
}
int computeLocation(Reading *readings, int num_readings,
int resolution,
double *result_x, double *result_y)
{
// Compute bounding box of interesting zone
double x0 = -1E20, y0 = -1E20, x1 = 1E20, y1 = 1E20;
for (int i=0; i<num_readings; i++)
{
if (readings[i].x - readings[i].r1 > x0)
x0 = readings[i].x - readings[i].r1;
if (readings[i].y - readings[i].r1 > y0)
y0 = readings[i].y - readings[i].r1;
if (readings[i].x + readings[i].r1 < x1)
x1 = readings[i].x + readings[i].r1;
if (readings[i].y + readings[i].r1 < y1)
y1 = readings[i].y + readings[i].r1;
}
// Scan processing
double ax = 0, ay = 0;
int total = 0;
for (int i=0; i<=resolution; i++)
{
double yy = y0 + i * (y1 - y0) / resolution;
for (int j=0; j<=resolution; j++)
{
double xx = x0 + j * (x1 - x0) / resolution;
if (hit(xx, yy, readings, num_readings))
{
ax += xx; ay += yy; total += 1;
}
}
}
if (total)
{
*result_x = ax / total;
*result_y = ay / total;
}
return total;
}
And on my PC can compute the barycenter with resolution = 100 in 0.08 ms (x=1.50000, y=1.383250) or with resolution = 400 in 1.3ms (x=1.500000, y=1.383308). Of course a double-step speedup could be implemented even for the barycenter-only version.
I would switch from "max/min" to trying to minimize an error function. That gets you to the problem discussed at Finding a point that best fits the intersection of n spheres which is more tractable than intersecting a series of complicated shapes. (And what if one robot's sensor is messed up and it gives an impossible value? That variation will still usually give a reasonable answer.)
Not sure about your case, but in a typical robotics application you're going to be reading sensors periodically and crunching the data. If that's the case, you're trying to estimate the location based on noisy data and that's a common problem. As a simple (less rigorous) method, you could take the existing position and adjust it toward or away from each known point. Take the measured distance to target minus the present distance to target, multiply that delta (error) by some value between 0 and 1, and move your estimated position that much toward the target. Repeat for each target. Then repeat each time you get a new set of measurements. The multiplier will have an effect like a low-pass filter, smaller values will give you a more stable position estimate with slower response to movement. For the distance, use the average of the min and max. If you can put tighter bounds on the range to one target, you can increase the multiplier closer to 1 for just that target.
This is of course a crude position estimator. The math guys can probably be more rigorous, but also more complicated. The solution is definitely not anything to do with intersecting areas and working with geometric shapes.

Resources