-module(demo).
-export([factorial/1]).
factorial(0) -> 1;
factorial(N) ->
N * factorial(N-1).
The factorial is not tail recursive but why is it not overflowing the stack? I am able to get factorial of 100,000 without stack overflow but takes some time to compute.
An Erlang process's "stack" is not stored in the stack given by the system to the process (which is usually a few megabytes) but in the heap. As far as I know, it will grow unbounded until the system refuses to give the VM more memory.
The size includes 233 words for the heap area (which includes the stack). The garbage collector increases the heap as needed.
The main (outer) loop for a process must be tail-recursive. Otherwise, the stack grows until the process terminates.
Source
If you monitor the Erlang VM process in an process monitor like Activity Monitor on OSX or top on other UNIX-like systems, you'll see that the memory usage will keep on increasing until the calculation is complete, at which point a part of the memory (the one where the "stack" is stored) will be released (this happens gradually over a few seconds after the function returns for me).
Related
Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html
For a university project I'm currently working on a slurm supercomputer cluster and have written a number of C programs using MPI.
While profiling one of them I have observed that the time elapsed between an MPI_Send and an accompanying MPI_Recv operation is a mostly linear function of the message length. However, at around 32 MiB there is a sudden jump in latency from around 10ms to around 20ms. This happens both for two processes on the same node and two processes on separate nodes.
Now I would like to find out why this happens. I'm aware that this is not an MPI intrinsic phenomenon but must be related to the underlying hardware setup, but I'm not sure where to begin looking for an explanation. What are some possible explanations for this and how could I check whether they apply in my case?
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);//line no 1
myreal pCoef = coef[gid];//line no 2
myreal pRow = row[gid];//line no 3
pCoef = pCoef - (pRow * ratio);//line no 4
coef[gid] = pCoef;//line no 5
}
Do all work items in a work group begin executing line no 1 at the
same time?
Do all work items in a work group begin executing line no 2 at the
same time?
Suppose different work items in a work group finish executing line
no 4 at different times. Do the early finished ones wait so that,
all work items transfer the data to global memory at the same time
in line no 5?
Do all work items exit the compute unit simultaneously such that
early finished work items have to wait until all work items have
finished executing?
Suppose each kernel has to perform 2 reads from global memory. Is it
better to execute these statements one after the other or is it
better to execute some computation statements between the 2 read
executions?
The above shown kernel is memory bound for GPU. Is there any way by
which performance can be improved?
Are there any general guidelines to avoid memory bounds?
Find my answers below: (thanks sharpneli for the good comment of AMD GPUs and warps)
Normally YES. But depends on the hardware. You can't directly expect that behavior and design your algorithm on this "ordered execution". That's why barriers and mem_fences exists. For example, some GPU execute in order only a sub-set of the WG's WI. In CPU it is even possible that they run completely free of order.
Same as answer 1.
As in the answer 1, they will really unlikely finish at different times, so YES. However you have to bear in mind that this is a good feature, since 1 big write to memory is more efficient than a lot of small writes.
Typically YES (see answer 1 as well)
It is better to intercalate the reads with operations, but the compiler will already account for this and reorder the operation order to hide the latency of reading/writting effects. Of course the compiler will never move around code that can change the result value. Unless you disable manually the compiler optimizations this is a typical behavior of OpenCL compilers.
NO, it can't be improved in any way from the kernel point of view.
The general rule is, each memory cell of the input is used by more than one WI?
NO (1 global->1 private) (this is the case of your kernel in the question)
Then that memory is global->private, and there is no way to improve it, don't use local memory since it will be a waste of time.
YES (1 global-> X private)
Try to move the global memory lo local memory first, then read directly from local to private for each WI. Depending on the reuse amount (maybe only 2 WIs use the same global data) it may not even be worth if the computation amount is already high. You have to consider the tradeoff between extra memory usage and global access gain. For image procesing it is typically a good idea, for other types of processes not so much.
NOTE: The same process applies if you try to write to global memory. It is always better to operate in local memory by many WI before writing to global. But if each WI writes to an unique address in global, then write directly.
as you know there are two kind of process, i/o bound and cpu bound...
i need a cpu bound program that never terminates itself...
for example; is it like i wanted?
while(1){
for(int i=0;i<1000; i++);
}
for(;;); should do it!
First of all, why do you want a never terminating CPU bound program?
And yes, that would work, but you don't really need the inner for-loop. The while-loop will run forever on its own (assuming the compiler doesn't optimize it away).
There aren't only two kinds of processes. Even if you consider what resource is bounding the performance, there are more than two. The classic other ones are bandwidth, memory, database connections -- any finite resource or blocking one can be a bottleneck.
But, yes, your process is CPU-bound -- you can see that by looking at your task manager (Windows) or Activity Monitor (Mac) or top (Linux) and seeing it take 100% of your CPU.
Those programs probably aren't CPU bound.
I suggest implementing the Sieve of Eratosthenes, or something like that. How about a program that takes a number (say 42), divides it by Pi 1000 times, multiplies it by Pi 1000 times, subtracts the result from the original number, adds it to a variable and increments a counter. Then repeat that indefinitely. I suppose you might overflow one of the numeric values, but that should be fixable / preventable.