How to get a "random" number in OpenCL

How to get a "random" number in OpenCL - opencl

I'm looking to get a random number in OpenCL. It doesn't have to be real random or even that random. Just something simple and quick.
I see there is a ton of real random parallelized fancy pants random algorithms in OpenCL that are like thousand and thousands of lines. I do NOT need anything like that. A simple 'random()' would be fine, even if it is easy to see patterns in it.
I see there is a Noise function? Any easy way to use that to get a random number?

I was solving this "no random" issue for last few days and I came up with three different approaches:
Xorshift - I created generator based on this one. All you have to do is provide one uint2 number (seed) for whole kernel and every work item will compute his own rand number
// 'randoms' is uint2 passed to kernel
uint seed = randoms.x + globalID;
uint t = seed ^ (seed << 11);
uint result = randoms.y ^ (randoms.y >> 19) ^ (t ^ (t >> 8));
Java random - I used code from .next(int bits) method to generate random number. This time you have to provide one ulong number as seed.
// 'randoms' is ulong passed to kernel
ulong seed = randoms + globalID;
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
uint result = seed >> 16;
Just generate all on CPU and pass it to kernel in one big buffer.
I tested all three approaches (generators) in my evolution algorithm computing Minimum Dominating Set in graphs.
I like the generated numbers from the first one, but it looks like my evolution algorithm doesn't.
Second generator generates numbers that has some visible pattern but my evolution algorithm likes it that way anyway and whole thing run little faster than with the first generator.
But the third approach shows that it's absolutely fine to just provide all numbers from host (cpu). First I though that generating (in my case) 1536 int32 numbers and passing them to GPU in every kernel call would be too expensive (to compute and transfer to GPU). But it turns out, it is as fast as my previous attempts. And CPU load stays under 5%.
BTW, I also tried MWC64X Random but after I installed new GPU driver the function mul_hi starts causing build fail (even whole AMD Kernel Analyer crashed).

the following is the algorithm used by the java.util.Random class according to the doc:
(seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1)
See the documentation for its various implementations. Passing the worker's id in for the seed and looping a few time should produce decent randomness
or another metod would be to have some random operations occur that are fairly ceratain to overflow:
long rand= yid*xid*as_float(xid-yid*xid);
rand*=rand<<32^rand<<16|rand;
rand*=rand+as_double(rand);
with xid=get_global_id(0); and yid= get_global_id(1);

I am currently implementing a Realtime Path Tracer. You might already know that Path Tracing requires many many random numbers.
Before generating random numbers on the GPU I simply generated them on the CPU (using rand(), which sucks) and passed them to the GPU.
That quickly became a bottleneck.
Now I am generating the random numbers on the GPU with the Park-Miller Pseudorandom Number Generator (PRNG).
It is extremely simple to implement and achieves very good results.
I took thousands of samples (in the range of 0.0 to 1.0) and averaged them together.
The resulting value was very close to 0.5 (which is what you would expect). Between different runs the divergence from 0.5 was around 0.002. Therefore it has a very uniform distribution.
Here's a paper describing the algorithm:http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/ufeen8-15-m/p1192-parkmiller.pdf
And here's a paper about the above algorithm optimized for CUDA (which can easily be ported to OpenCL): http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/langdon_2009_CIGPU.pdf
Here's an example of how I'm using it:
int rand(int* seed) // 1 <= *seed < m
{
int const a = 16807; //ie 7**5
int const m = 2147483647; //ie 2**31-1
*seed = (long(*seed * a))%m;
return(*seed);
}
kernel random_number_kernel(global int* seed_memory)
{
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0); // Get the global id in 1D.
// Since the Park-Miller PRNG generates a SEQUENCE of random numbers
// we have to keep track of the previous random number, because the next
// random number will be generated using the previous one.
int seed = seed_memory[global_id];
int random_number = rand(&seed); // Generate the next random number in the sequence.
seed_memory[global_id] = *seed; // Save the seed for the next time this kernel gets enqueued.
}
The code serves just as an example. I have not tested it.
The array "seed_memory" is being filled with rand() only once before the first execution of the kernel. After that, all random number generation is happening on the GPU. I think it's also possible to simply use the kernel id instead of initializing the array with rand().

It seems OpenCL does not provide such functionality. However, some people have done some research on that and provide BSD licensed code for producing good random numbers on GPU.

This is my version of OpenCL float pseudorandom noise, using trigonometric function
//noise values in range if 0.0 to 1.0
static float noise3D(float x, float y, float z) {
float ptr = 0.0f;
return fract(sin(x*112.9898f + y*179.233f + z*237.212f) * 43758.5453f, &ptr);
}
__kernel void fillRandom(float seed, __global float* buffer, int length) {
int gi = get_global_id(0);
float fgi = float(gi)/length;
buffer[gi] = noise3D(fgi, 0.0f, seed);
}
You can generate 1D or 2D noize by passing to noise3D normalized index coordinates as a first parameters, and the random seed (generated on CPU for example) as a last parameter.
Here are some noise pictures generated with this kernel and different seeds:

GPU don't have good sources of randomness, but this can be easily overcome by seeding a kernel with a random seed from the host. After that, you just need an algorithm that can work with a massive number of concurrent threads.
This link describes a Mersenne Twister implementation using OpenCL: Parallel Mersenne Twister. You can also find an implementation in the NVIDIA SDK.

I had the same problem.
www.thesalmons.org/john/random123/papers/random123sc11.pdf
You can find the documentation here.
http://www.thesalmons.org/john/random123/releases/latest/docs/index.html
You can download the library here:
http://www.deshawresearch.com/resources_random123.html

why not? you could just write a kernel that generates random numbers, tough that would need more kernel calls and eventually passing the random numbers as argument to your other kernel which needs them

you cant generate random numbers in kernel , the best option is to generate the random number in host (CPU) and than transfer that to the GPU through buffers and use it in the kernel.

Related

Using vector types to improve OpenCL kernel performance

I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?

Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.

Method to do final sum with reduction

I take up the continuation of my first issue explained on this link.
I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,
Currently, I did a version for only one sum reduction (i.e one iteration
). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.
From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (i.e executing a second time the kernel code).
Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel) : it seems to be a recursive issue.
Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000, I will get nWorkGroups = 640000/16 = 40000 partial sums, so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size.
Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
...
If someone could explain what this above code snippet of kernel code does.
Is there a relation with the second stage of reduction, i.e the final sumation ?
Feel free to ask me more details if you have not understood my issue.
Thanks

As mentioned in the comment: The statement
if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
is not correct. You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. The numbers to choose here may be tailored for the target device. For example, the device may have a certain local memory size. This can be queried with clDeviceGetInfo:
cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
sizeof(cl_ulong), &localMemSize, nullptr);
This may be used to compute the size of a local work group, considering the fact that each work group will require
sizeof(cl_float) * workGroupSize
bytes of local memory.
Similarly, the number of work groups may be derived from other device specific parameters.
The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed. I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words:
As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). The result will be one accumulated value for each of the 24 threads. These are then reduced in parallel.

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S

To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.

The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

Use of qsrand, random method that is not random

I'm having a strange problem here, and I can't manage to find a good explanation to it, so I thought of asking you guys :
Consider the following method :
int MathUtility::randomize(int Min, int Max)
{
qsrand(QTime::currentTime().msec());
if (Min > Max)
{
int Temp = Min;
Min = Max;
Max = Temp;
}
return ((rand()%(Max-Min+1))+Min);
}
I won't explain you gurus what this method actually does, I'll instead explain my problem :
I realised that when I call this method in a loop, sometimes, I get the same random number over and over again... For example, this snippet...
for(int i=0; i<10; ++i)
{
int Index = MathUtility::randomize(0, 1000);
qDebug() << Index;
}
...will produce something like :
567
567
567
567...etc...
I realised too, that if I don't call qsrand everytime, but only once during my application's lifetime, it's working perfectly...
My question : Why ?

Because if you call randomize more than once in a millisecond (which is rather likely at current CPU clock speeds), you are seeding the RNG with the same value. This is guaranteed to produce the same output from the RNG.
Random-number generators are only meant to be seeded once. Seeding them multiple times does not make the output extra random, and in fact (as you found) may make it much less random.

If you make the call fast enough the value of QTime::currentTime().msec() will not change, and you're basically re-seeding qsrand with the same seed, causing the next random number generated to be the same as the prior one.

If you call the qsrand Qt function to initialize the seed, you must call the qrand Qt function to generate a random number, not the rand function from the standard library. the seed initialization for the rand function is srand.
Sorry for the dig up.

What you see is the effect of pseudo-randomness. You seed it with the time once, and it generates a sequence of numbers. Since you are pulling a series of random numbers very quickly after each other, you are re-seeding the randomizer with the same number until the next millisecond. And while a millisecond seems like a short time, consider the amount of calculations you're doing in that time.

modern Qt c++ 11
#include <random>
#include "QDateTime"
int getRand(int min, int max){
unsigned int ms = static_cast<unsigned>(QDateTime::currentMSecsSinceEpoch());
std::mt19937 gen(ms);
std::uniform_int_distribution<> uid(min, max);
return uid(gen);
}

Two problems:
1 As others have pointed out, the generator is being seed multiple times.
2 This is not a very good method to generate random numbers within a given range. (In fact it's very very bad for most generators )
You are assuming that the low-order bits from the generator are uniformly distributed . This is not the case with most generators. In most generators the randomness occurs in the high order bits.
By using the remainder after divisions you are in effect throwing out the randomness.
You should scale using multiplication and division. Not using the modulo operator.
eg
my_number= start_required + ( generator_output * range_required)/generator_maximum;
if generator_output is in [0, generator_maximum]
my_number will be in [start_required , start_required + range_required]

I've found the same action and solved it by using rand() instead the srand().
But I use it for checking my application. It just working in the cicle, so I don't need to look for it updates.
But if you going to do some king of game, it isn't a good way, because your randomizing will be the same.

How to test randomness (case in point - Shuffling)

First off, this question is ripped out from this question. I did it because I think this part is bigger than a sub-part of a longer question. If it offends, please pardon me.
Assume that you have a algorithm that generates randomness. Now how do you test it?
Or to be more direct - Assume you have an algorithm that shuffles a deck of cards, how do you test that it's a perfectly random algorithm?
To add some theory to the problem -
A deck of cards can be shuffled in 52! (52 factorial) different ways. Take a deck of cards, shuffle it by hand and write down the order of all cards. What is the probability that you would have gotten exactly that shuffle? Answer: 1 / 52!.
What is the chance that you, after shuffling, will get A, K, Q, J ... of each suit in a sequence? Answer 1 / 52!
So, just shuffling once and looking at the result will give you absolutely no information about your shuffling algorithms randomness. Twice and you have more information, Three even more...
How would you black box test a shuffling algorithm for randomness?

Statistics. The de facto standard for testing RNGs is the Diehard suite (originally available at http://stat.fsu.edu/pub/diehard). Alternatively, the Ent program provides tests that are simpler to interpret but less comprehensive.
As for shuffling algorithms, use a well-known algorithm such as Fisher-Yates (a.k.a "Knuth Shuffle"). The shuffle will be uniformly random so long as the underlying RNG is uniformly random. If you are using Java, this algorithm is available in the standard library (see Collections.shuffle).
It probably doesn't matter for most applications, but be aware that most RNGs do not provide sufficient degrees of freedom to produce every possible permutation of a 52-card deck (explained here).

Here's one simple check that you can perform. It uses generated random numbers to estimate Pi. It's not proof of randomness, but poor RNGs typically don't do well on it (they will return something like 2.5 or 3.8 rather ~3.14).
Ideally this would be just one of many tests that you would run to check randomness.
Something else that you can check is the standard deviation of the output. The expected standard deviation for a uniformly distributed population of values in the range 0..n approaches n/sqrt(12).
/**
* This is a rudimentary check to ensure that the output of a given RNG
* is approximately uniformly distributed. If the RNG output is not
* uniformly distributed, this method will return a poor estimate for the
* value of pi.
* #param rng The RNG to test.
* #param iterations The number of random points to generate for use in the
* calculation. This value needs to be sufficiently large in order to
* produce a reasonably accurate result (assuming the RNG is uniform).
* Less than 10,000 is not particularly useful. 100,000 should be sufficient.
* #return An approximation of pi generated using the provided RNG.
*/
public static double calculateMonteCarloValueForPi(Random rng,
int iterations)
{
// Assumes a quadrant of a circle of radius 1, bounded by a box with
// sides of length 1. The area of the square is therefore 1 square unit
// and the area of the quadrant is (pi * r^2) / 4.
int totalInsideQuadrant = 0;
// Generate the specified number of random points and count how many fall
// within the quadrant and how many do not. We expect the number of points
// in the quadrant (expressed as a fraction of the total number of points)
// to be pi/4. Therefore pi = 4 * ratio.
for (int i = 0; i < iterations; i++)
{
double x = rng.nextDouble();
double y = rng.nextDouble();
if (isInQuadrant(x, y))
{
++totalInsideQuadrant;
}
}
// From these figures we can deduce an approximate value for Pi.
return 4 * ((double) totalInsideQuadrant / iterations);
}
/**
* Uses Pythagoras' theorem to determine whether the specified coordinates
* fall within the area of the quadrant of a circle of radius 1 that is
* centered on the origin.
* #param x The x-coordinate of the point (must be between 0 and 1).
* #param y The y-coordinate of the point (must be between 0 and 1).
* #return True if the point is within the quadrant, false otherwise.
*/
private static boolean isInQuadrant(double x, double y)
{
double distance = Math.sqrt((x * x) + (y * y));
return distance <= 1;
}

First, it is impossible to know for sure if a certain finite output is "truly random" since, as you point out, any output is possible.
What can be done, is to take a sequence of outputs and check various measurements of this sequence against what is more likely. You can derive a sort of confidence score that the generating algorithm is doing a good job.
For example, you could check the output of 10 different shuffles. Assign a number 0-51 to each card, and take the average of the card in position 6 across the shuffles. The convergent average is 25.5, so you would be surprised to see a value of 1 here. You could use the central limit theorem to get an estimate of how likely each average is for a given position.
But we shouldn't stop here! Because this algorithm could be fooled by a system that only alternates between two shuffles that are designed to give the exact average of 25.5 at each position. How can we do better?
We expect a uniform distribution (equal likelihood for any given card) at each position, across different shuffles. So among the 10 shuffles, we could try to verify that the choices 'look uniform.' This is basically just a reduced version of the original problem. You could check that the standard deviation looks reasonable, that the min is reasonable, and the max value as well. You could also check that other values, such as the closest two cards (by our assigned numbers), also make sense.
But we also can't just add various measurements like this ad infinitum, since, given enough statistics, any particular shuffle will appear highly unlikely for some reason (e.g. this is one of very few shuffles in which cards X,Y,Z appear in order). So the big question is: which is the right set of measurements to take? Here I have to admit that I don't know the best answer. However, if you have a certain application in mind, you can choose a good set of properties/measurements to test, and work with those -- this seems to be the way cryptographers handle things.

There's a lot of theory on testing randomness. For a very simple test on a card shuffling algorithm you could do a lot of shuffles and then run a chi squared test that the probability of each card turning up in any position was uniform. But that doesn't test that consecutive cards aren't correlated so you would also want to do tests on that.
Volume 2 of Knuth's Art of Computer Programming gives a number of tests that you could use in sections 3.3.2 (Empirical tests) and 3.3.4 (The Spectral Test) and the theory behind them.

The only way to test for randomness is to write a program that attempts to build a predictive model for the data being tested, and then use that model to try to predict future data, and then showing that the uncertainty, or entropy, of its predictions tend towards maximum (i.e. the uniform distribution) over time. Of course, you'll always be uncertain whether or not your model has captured all of the necessary context; given a model, it'll always be possible to build a second model that generates non-random data that looks random to the first. But as long as you accept that the orbit of Pluto has an insignificant influence on the results of the shuffling algorithm, then you should be able to satisfy yourself that its results are acceptably random.
Of course, if you do this, you might as well use your model generatively, to actually create the data you want. And if you do that, then you're back at square one.

Shuffle alot, and then record the outcomes (if im reading this correctly). I remember seeing comparisons of "random number generators". They just test it over and over, then graph the results.
If it is truly random the graph will be mostly even.

I'm not fully following your question. You say
Assume that you have a algorithm that generates randomness. Now how do you test it?
What do you mean? If you're assuming you can generate randomness, there's no need to test it.
Once you have a good random number generator, creating a random permutation is easy (e.g. Call your cards 1-52. Generate 52 random numbers assigning each one to a card in order, and then sort according to your 52 randoms) . You're not going to destroy the randomness of your good RNG by generating your permutation.
The difficult question is whether you can trust your RNG. Here's a sample link to people discussing that issue in a specific context.

Testing 52! possibilities is of course impossible. Instead, try your shuffle on smaller numbers of cards, like 3, 5, and 10. Then you can test billions of shuffles and use a histogram and the chi-square statistical test to prove that each permutation is coming up an "even" number of times.

No code so far, therefore I copy-paste a testing part from my answer to the original question.
// ...
int main() {
typedef std::map<std::pair<size_t, Deck::value_type>, size_t> Map;
Map freqs;
Deck d;
const size_t ntests = 100000;
// compute frequencies of events: card at position
for (size_t i = 0; i < ntests; ++i) {
d.shuffle();
size_t pos = 0;
for(Deck::const_iterator j = d.begin(); j != d.end(); ++j, ++pos)
++freqs[std::make_pair(pos, *j)];
}
// if Deck.shuffle() is correct then all frequencies must be similar
for (Map::const_iterator j = freqs.begin(); j != freqs.end(); ++j)
std::cout << "pos=" << j->first.first << " card=" << j->first.second
<< " freq=" << j->second << std::endl;
}
This code does not test randomness of underlying pseudorandom number generator. Testing PRNG randomness is a whole branch of science.

For a quick test, you can always try compressing it. Once it doesn't compress, then you can move onto other tests.
I've tried dieharder but it refuses to work for a shuffle. All tests fail. It is also really stodgy, it wont let you specify the range of values you want or anything like that.

Pondering it myself, what I would do is something like:
Setup (Pseudo code)
// A card has a Number 0-51 and a position 0-51
int[][] StatMatrix = new int[52][52]; // Assume all are set to 0 as starting values
ShuffleCards();
ForEach (card in Cards) {
StatMatrix[Card.Position][Card.Number]++;
}
This gives us a matrix 52x52 indicating how many times a card has ended up at a certain position. Repeat this a large number of times (I would start with 1000, but people better at statistics than me may give a better number).
Analyze the matrix
If we have perfect randomness and perform the shuffle an infinite number of times then for each card and for each position the number of times the card ended up in that position is the same as for any other card. Saying the same thing in a different way:
statMatrix[position][card] / numberOfShuffle = 1/52.
So I would calculate how far from that number we are.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex