qrand is not generating a random number - qt

I have a QT app, running 2 more threads.
Inside the threads I use the qrand function to generate a random number. The following is the code used to get the number, where m_fluctuations max is a double.
int fluctuate = qrand() % (int)(m_FluctuationMax * 100);
I tried adding the following code in the main thread, and also inside the thread classes.
QTime now = QTime::currentTime();
qsrand(now.msec());
Now the problem is, that the values being generated are always the same, each time the application is started.
Shouldn't they be different, since the seed is set by 'currentTime()'.
Thanks

I had my qsrand() in the thread/class constructor. When i moved it to the run() function, it started to work randomly. Not sure why it would not work from the constructor though. Thanks everyone for your help.

This may help anyone who happened to have a similar problem:
qsrand(static_cast<quint64>(QTime::currentTime().msecsSinceStartOfDay()));
array<int, 5> arr = {qrand(), qrand(), qrand(), qrand(), qrand()};
for(auto i : arr)
cout << i << endl;

I had my qsrand() in the thread/class constructor. When i moved it to the run() function, it started to work randomly. Not sure why it would not work from the constructor though.
qsrand() uses thread-local storage to store the seed which is actually the pseudorandom number generator state that also gets updated on each call to qrand(). If you seed the PRNG outside the thread where you will be using it, that seed does not influence the outcome. Thread-local storage usually defaults to zero so that way you would get the same sequence of pseudorandoms every time because the seed is always the same.

The first thing I'd be checking is the value of now.msec(). It only returns the millisecond part of the current time and the doco states:
Note that the accuracy depends on the accuracy of the underlying operating system; not all systems provide 1-millisecond accuracy.
It may be that your platform always returns the same value for msec(). If that's the case, you could try using minutes and seconds combined somehow (assuming you're not running your code multiple times every second).
You haven't stated which platform you're running on but the Qt source code only supports sub-second resolution if either Q_OS_WIN or Q_OS_UNIX is set.
Keep in mind that the random numbers are per-thread so you should probably do the qsrand in each thread, lest it be automatically seeded with 1.

Related

How do I stop all 262,144 kernels if I find my answer

I am using pyopencl to find a certain pixel in a 512 x 512 (262,144 pixels) image. I am starting (512,512), when I run my program and comparing the pixel's neighbors to a known group of neighbors. I am doing image synthesis. I don't want to wait around for the remaining kernels to run if I find my group of pixels within a kernel. Is there a way to terminate the rest of the running kernels with a kernel program ?
Thanks
Tim
When you queue a kernel with many work items, it gets divided up into work groups and threads which keep the GPU busy. Really large global sizes start as many threads as they can and issue new ones when the old ones finish. So you could find the smallest global size that still performs well, and queue many of those (instead of one large one), but also be checking on the results of the previous ones you queued (use events to know when they are done, and read back memory to get their results). When you get the correct answer, stop queueing kernels.
so instead of this:
queue entire job (say, 4096 x 4906)
do:
do
{
queue some work (say, 32 x 32)
check if any of the prior work queued is done and check if it got the answer
}
while (no more work OR answer found)
You'll need to figure out the right tradeoff between the size of the smaller jobs and the overhead of checking their results versus extra work done.
Your question is a big issue and problem of parallelism.
What to do when one of your parallel threads has already the answer to the problem?
OpenCL does not allow to control the kernel execution. Not even at host level. And this is a big problem. However it is how it has to be, since, if the work items do not run freely detached one from another then it is not fully parallel.
The only solution is to split the computation into small parts and check the completion of each of them. But, sometimes the parts are already very small (like in your case 512x512 is quite small).
In your specific case I would process everything (512x512), after that I would use another kernel to get the final results out of the 512x512 set.
First thought it to have some sort of global memory flag that each kernel can read and set. This approach requires atomicity, so make sure to use the atomic_ functions.
__kernel void t(__global int *Data,
__global int *Flag){
if(atomic_max(*Flag, 0) == 0){
//perform calc on Data
if(PixelsFound){
//Set the flag to +1
*Flag = atomic_inc(*Flag);
}
}
}
Community, feel free to comment if this is known not to work!

How to use async_work_group_copy in OpenCL?

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

Why is there a CL_DEVICE_MAX_WORK_GROUP_SIZE?

I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(i);
barrier();
moreCode(i);
barrier();
finalCode(i);
}
then it could be converted automatically to an execution with work group size 100 on this kernel:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(5*i);
someCode(5*i+1);
someCode(5*i+2);
someCode(5*i+3);
someCode(5*i+4);
barrier();
moreCode(5*i);
moreCode(5*i+1);
moreCode(5*i+2);
moreCode(5*i+3);
moreCode(5*i+4);
barrier();
finalCode(5*i);
finalCode(5*i+1);
finalCode(5*i+2);
finalCode(5*i+3);
finalCode(5*i+4);
}
However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?
I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.
Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.
As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.
Those constants are queried from the device by the compiler in order to determine a suitable work group size at compile-time (where compiling of course refers to compiling the kernel). I might be getting you wrong, but it seems you're thinking of setting those values by yourself, which wouldn't be the case.
The responsibility is within your code to query the system capabilities to be prepared for whatever hardware it will run on.

How to properly write a SIGPROF handler that invokes AsyncGetCallTrace?

I am writing a short and simple profiler (in C), which is intended to print out stack traces for threads in various Java clients at regular intervals. I have to use the undocumented function AsyncGetCallTrace instead of GetStackTrace to minimize intrusion and allow for stack traces regardless of thread state. The source code for the function can be found here: http://download.java.net/openjdk/jdk6/promoted/b20/openjdk-6-src-b20-21_jun_2010.tar.gz
in hotspot/src/share/vm/prims/forte.cpp. I found some man pages documenting JVMTI, signal handling, and timing, as well as a blog with details on how to set up the AsyncGetCallTrace call: http://jeremymanson.blogspot.com/2007/05/profiling-with-jvmtijvmpi-sigprof-and.html
What this blog is missing is the code to actually invoke the function within the signal handler (the author assumes the reader can do this on his/her own). I am asking for help in doing exactly this. I am not sure how and where to create the struct ASGCT_CallTrace (and the internal struct ASGCT_CallFrame), as defined in the aforementioned file forte.cpp. The struct ASGCT_CallTrace is one of the parameters passed to AsyncGetCallTrace, so I do need to create it, but I don't know how to obtain the correct values for its fields: JNIEnv *env_id, jint num_frames, and JVMPI_CallFrame *frames. Furthermore, I do not know what the third parameter passed to AsyncGetCallTrace (void* ucontext) is supposed to be?
The above problem is the main one I am having. However, other issues I am faced with include:
SIGPROF doesn't seem to be raised by the timer exactly at the specified intervals, but rather a bit less frequently. That is, if I set the timer to send a SIGPROF every second (1 sec, 0 usec), then in a 5 second run, I am getting fewer than 5 SIGPROF handler outputs (usually 1-3)
SIGPROF handler outputs do not appear at all during a Thread.sleep in the Java code. So, if a SIGPROF is to be sent every second, and I have Thread.sleep(5000);, I will not get any handler outputs during the execution of that code.
Any help would be appreciated. Additional details (as well as parts of code and sample outputs) will be posted upon request.
I finally got a positive result, but since little discussion was spawned here, my own answer will be brief.
The ASGCT_CallTrace structure (and the underlying ASGCT_CallFrame array) can simply be declared in the signal handler, thus existing only the stack:
ASGCT_CallTrace trace;
JNIEnv *env;
global_VM_pointer->AttachCurrentThread((void **) &env, NULL);
trace.env_id = env;
trace.num_frames = 0;
ASGCT_CallFrame storage[25];
trace.frames = storage;
The following gets the uContext:
ucontext_t uContext;
getcontext(&uContext);
And then the call is just:
AsyncGetCallTrace(&trace, 25, &uContext);
I am sure there are some other nuances that I had to take care of in the process, but I did not really document them. I am not sure I can disclose the full current code I have, which successfully asynchronously requests for and obtains stack traces of any java program at fixed intervals. But if someone is interested in or stuck on the same problem, I am now able to help (I think).
On the other two issues:
[1] If a thread is sleeping and a SIGPROF is generated, the thread handles that signal only after waking up. This is normal, since it is the thread's job to handle the signal.
[2] The timer imperfections do not seem to appear anymore. Perhaps I mis-measured.

Should I care about thread safe of static int (4 bytes) variable in ASP .NET

I have the feeling that I should not care about thread safe accessing / writing to an
public static int MyVar = 12;
in ASP .NET.
I read/write to this variable from various user threads. Let's suppose this variable will store the numbers of clicks on a certain button/link.
My theory is that no thread can read/write to this variable at the same time. It's just a simple variable of 4 bytes.
I do care about thread safe, but only for refference objects and List instances or other types that take more cycles to read/update.
I am wrong with my presumption ?
EDIT
I understand this depend of my scenario, but wasn't that the point of the question. The question is: it is right that can be written thread safe code with an (static int) variable without using lock keyword ?
It is my problem to write correct code. The answer seems to be: Yes, if you write correct and simple code, and not to much complicated, you can create thread safe functions without the need of lock keyword.
If one thread simply sets the value and another thread reads the value, then a lock is not necessary; the read and write are atomic. But if multiple threads might be updating it and are also reading it to do the update (e.g., increment), then you definitely do need some kind of synchronization. If only one thread is ever going to update it even for an increment, then I would argue that no synchronization is necessary.
Edit (three years later) It might also be desirable to add the volatile keyword to the declaration to ensure that reads of the value always get the latest value (assuming that matters in the application).
The concept of thread 'safety' is too vague to be meaningful unfortunately. If you're asking whether you can read and write to it from multiple threads without the program crashing during the operation, the answer is almost certainly yes. If you're also asking if the variable is guaranteed to either be the old value or the new value without ever storing any broken intermediate values, the answer for this data type is again almost certainly yes.
But if your question is "will my program work correctly if I access this from multiple threads", then the answer depends entirely on what your program is doing. For example, if you run the following pseudo code in 2 threads repeatedly in most programming languages, eventually you'll hit the assertion.
if MyVar >= 1:
MyVar = MyVar - 1
assert MyVar >= 0
Primitives like int are thread-safe in the sense that reads/writes are atomic. But as with most any type, it's left to you to do proper checking with more complex operations. For example, if (x > 0) x--; would be problematic in a multi-threaded scenario because x might change in between the if condition check and decrement.
A simple read or write on a field of 32 bits or less is always atomic. But you should provide your read/write code to make sure that it is thread safe.
Check out this post: http://msdn.microsoft.com/en-us/magazine/cc163929.aspx
It explains why you need to synchronize access to the integers in this scenario
Try Interlocked.Increment() or Interlocked.Add() and you'll be right. Your code complexity will be the same but you truly won't have to worry. If you're not worried about losing a few clicks in your counter, you can continue as you are.
Reading or writing integers is atomic. However, reading and then writing is not atomic. So, if you have one thread that writes and many that read, you may be able to get away without locks.
However, even though the operations are atomic, there are still potential multi-threading issues. In order for one thread to be guaranteed that another thread can see values it writes, you need a memory barrier. Otherwise, the compiler can optimize the code so that the variable stays in a register (or even optimize the operation away completely), so changes would be invisible from one thread to another.
You can establish a memory barrier explicitly (volatile or Thread.MemoryBarrier), or with the Interlocked class -- or with the lock statement (Monitor).

Resources