How should I view global and local work sizes - opencl

I've been using OpenCL for a little while now for hobby purposes. I was wondering if someone could explain how i should view global and local work spaces. I've been playing around with it for a bit but i cannot seem to wrap my head around it.
I have this piece of code, the kernel has a global work size of 8 and the local work size of 4
__kernel void foo(__global int *bar)
{
bar[get_global_id(0)] = get_local_id(0);
}
This result in bar looks like this:
{0, 1, 2, 3, 0, 1, 2, 3, 4}
I know why it is happening because of the work sizes I've used. But i can't seem to wrap my head around how i should view this.
Does this mean that there are 4 threads working locally and 8 globally so i have 4 * 8 threads running in total? and if so what makes those 4 working locally special?
Or does this mean the main body of the kernel just has two counters? one from local and one global but what is the point of that?
I know i might be a bit vague and my question might seem dumb. But i don't know how i can use this more optimally and how i should view this?

Global size is the total number of work items.
Work groups subdivide this total workload, and local size defines the size of each group within the global size.
So for a global work size of 8 and a local size of 4, each in 1 dimension, you will have 2 groups. Your get_global_id(0) will be different for each thread: 0…7. get_local_id(0) will return 0…3 for the 4 different threads within each group. This is what you're seeing in indices 0 through 7 of your output.
This also means that if your global work size is 8, only the first 8 items of bar will be set by your kernel. So anything beyond that (the value 4 at index 8 in your output) is undefined.
Does this mean that there are 4 threads working locally and 8 globally so i have 4 * 8 threads running in total? and if so what makes those 4 working locally special?
You're overthinking it. There are 8 threads in total. They are subdivided into 2 groups of 4 threads. What is "local" about the threads in those groups is that they share access to the same local memory. Threads which are not in the same group can only "communicate" via global memory.
Using local memory can hugely improve efficiency for some workloads:
It's very fast.
Threads in a work group can use barriers to ensure they are in lock-step, i.e. they can wait for one another to guarantee another thread has written to a specific local memory location. (Threads in different groups cannot wait for each other.)
But:
Local memory is small (typically a few KiB) - and using all of it in one group usually has further efficiency penalties.
Local memory must be filled with data inside the kernel, and its contents is lost when the kernel completes. (Except for device-scheduled kernels in OpenCL 2)
There are tight limits on group size due to hardware limitations.
So if you are not using local memory, work groups and therefore local work size are essentially irrelevant to you.

Related

JupyterLab Kernel Restarts when I load too much data

I'm running a Notebook on JupyterLab. I am loading in some large Monte Carlo chains as numpy arrays which have the shape (500000, 150). I have 10 chains which I load into a list in the following way:
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
If I load 5 chains then all works well. If I try to load 10 chains, after about 6 or 7 I get the error:
Kernel Restarting
The kernel for my_code.ipynb appears to have died. It will restart automatically.
I have tried loading the chains in different orders to make sure there is not a problem with any single chain. It always fails when loading number 6 or 7 no matter the order, so I think the chains themselves are fine.
I have also tried to load 5 chain in one list and then in the next cell try to load the other 5, but the fail still happens when I get to 6 or 7, even when I split like this.
So it seems like the problem is that I'm loading too much data into the Notebook or something like that. Does this seem right? Is there a work around?
It is indeed possible that you are running out of memory, though unlikely that it's actually your system that is running out of memory (unless it's a very small system). It is typical behavior that if jupyter exceeds its memory limits, the kernel will die and restart, see here, here and here.
Consider that if you are using the float64 datatype by default, the memory usage (in megabytes) per array is:
N_rows * N_cols * 64 / 8 / 1024 / 1024
For N_rows = 500000 and N_cols = 150, that's 572 megabytes per array. You can verify this directly using numpy's dtype and nbytes attributes (noting that the output is in bytes, not bits):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
print(chain.dtype)
print(chain.nbytes / 1024 / 1024)
If you are trying to load 10 of these arrays, that's about 6 gigabytes.
One workaround is increasing the memory limits for jupyter, per the posts referenced above. Another simple workaround is using a less memory-intensive floating point datatype. If you don't really need the digits of accuracy afforded by float64 (see here) you could simply use a smaller floating point representation, e.g. float32
chain = np.loadtxt('my_chain_{}.txt'.format(i), dtype=np.float32)
chains.append(chain)
Given that you can get to 6 or 7 already, halving the data usage of each chain should be enough to get you to 10.
You could be running out of memory.
Try to load the chains one by one and then concatenate them.
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
if i > 0:
chains = np.concatenate((chains[0],
chains[1]), axis=0)
chains.pop(1)

Foreach in R: optimise RAM & CPU use by sorting tasks (objects)?

I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!
I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores

OpenCL - Setting up local memory for a large dataset

I have the following problem. I have 6000 * 1000 elements that i need to work on in parallel (for most of the part). However, at some part of the kernel, those 6000 items have to summed together.
When I tried to setup my kernel inputs where (globalThreads = 6000 * 1000, localThreads = 6000), it seemed to throw an error (CL_INVALID_WORK_GROUP_SIZE). It seems that the maximum number of local elements in a workgroup is limited.
How can I work around this problem?
You can't set local threads that high. Most hardware can only do 128 to 1024 or so local threads (clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE will tell you for your device). You can leave the local size NULL and the runtime will pick a size for you, but if your global size is not multiple of your devices work group size, this might not give you optimal performance. For top performace you can experiment with different local sizes, and then specify both, but the global must be a multiple of the local size in OpenCL 1.x. Round up the global and then check the work item index in your kernel to see if it is below your real work size.

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

Work_dim in NDRange

I can not understand what work_dim is for in clEnqueueNDRangeKernel()?
So, what is the difference between work_dim=1 and work_dim=2?
And why work items are grouped into work groups?
A work item or a work group is a thread running on the device (or neither)?
Thanks ahead!
work_dim is the number of dimensions for the clEnqueueNDRangeKernel() execution.
If you specify work_dim = 1, then the global and local work sizes are unidimensional. Thus, inside the kernels you can only access info in the first dimension, e.g. get_global_id(0), etc.
If you specify work_dim = 2 or 3, then you must also specify 2 or 3 dimensional global and local worksizes; in such case, you can access info inside the kernels in 2 or 3 dimensions, e.g. get_global_id(1), or get_group_id(2).
In practice you can do everything in 1D, but for dealing with 2D or 3D data, it maybe simpler to directly use 2/3 dimensional kernels; for example, in the case of 2D data, such as an image, if each thread/work-item is to deal with a single pixel, each thread/work-item could deal with the pixel at coordinates (x,y), with x = get_global_id(0) and y = get_global_id(1).
A work-item is a thread, while work-groups are groups of work-items/threads.
I believe the division work-groups / work-items is related with the hardware architecture of GPUs and other accelerators (e.g. Cell/BE); you can map the execution of work-groups to GPU Stream Multiprocessors (in NVIDIA talk) or SPUs (in IBM/Cell talk), while the corresponding work-itens would run inside the execution units of the Stream MultiProcessors and/or SPUs. It's not uncommon to have work group size = 1 if you are executing kernels in a CPU (e.g. for a quad-core, you would have 4 work groups, each one with one work item - though in my experience it's usually better to have more workgroups than CPU cores).
Check the OpenCL reference manual, as well as the OpenCl manual for whichever device your are programming. The quick reference card is also very helpful.

Resources