We were recently reading the BEAM Book as part of a reading group.
In appendix B.3.3 it states that the call_last instruction has the following behavior
Deallocate Deallocate words of stack, then do a tail recursive
call to the function of arity Arity in the same module at label
Label
Based on our current understanding, tail-recursive would imply that the memory allocated on the stack can be reused from the current call.
As such, we were wondering what is being deallocated from the stack.
Additionally, we were also wondering why there is a need to deallocate from the stack before doing the tail-recursive call, instead of directly doing the tail recursive call.
In asm for CPUs, an optimized tailcall is just a jump to the function entry point. I.e. running the whole function as a loop body in the case of tail-recursion. (Without pushing a return address, so when you reach the base-case it's just one return back to the ultimate parent.)
I'm going to take a wild guess that Erlang / BEAM bytecode is remotely similar, even though I know nothing about it specifically.
When execution reaches the top of a function, it doesn't know whether it got there by recursion or a call from another function and would thus have to allocate more space if it needed any.
If you want to reuse already-allocated stack space, you'd have to further optimize the tail-recursion into an actual loop inside the function body, not recursion at all anymore.
Or to put it another way, to tailcall anything, you need the callstack in the same state it was in on function entry. Jumping instead of calling loses the opportunity to do any cleanup after the called function returns, because it returns to your caller, not to you.
But can't we just put the stack-cleanup in the recursion base-case that does actually return instead of tailcalling? Yes, but that only works if the "tailcall" is to a point in this function after allocating new space is already done, not the entry point that external callers will call. Those 2 changes are exactly the same as turning tail-recursion into a loop.
(Disclaimer: This is a guess)
Tail-recursion calls does not mean that it cannot perform any other call previously or use the stack in the meantime. In that case, the allocated stack for those calls must be deallocated before performing the tail-recursion. The call_last deallocates surplus stack before behaving like call_only.
You can see an example if you erlc -S the following code:
-module(test).
-compile(export_all).
fun1([]) ->
ok;
fun1([1|R]) ->
fun1(R).
funN() ->
A = list(),
B = list(),
fun1([A, B]).
list() ->
[1,2,3,4].
I've annotated the relevant parts:
{function, fun1, 1, 2}.
{label,1}.
{line,[{location,"test.erl",4}]}.
{func_info,{atom,test},{atom,fun1},1}.
{label,2}.
{test,is_nonempty_list,{f,3},[{x,0}]}.
{get_list,{x,0},{x,1},{x,2}}.
{test,is_eq_exact,{f,1},[{x,1},{integer,1}]}.
{move,{x,2},{x,0}}.
{call_only,1,{f,2}}. % No stack allocated, no need to deallocate it
{label,3}.
{test,is_nil,{f,1},[{x,0}]}.
{move,{atom,ok},{x,0}}.
return.
{function, funN, 0, 5}.
{label,4}.
{line,[{location,"test.erl",10}]}.
{func_info,{atom,test},{atom,funN},0}.
{label,5}.
{allocate_zero,1,0}. % Allocate 1 slot in the stack
{call,0,{f,7}}. % Leaves the result in {x,0} (the 0 register)
{move,{x,0},{y,0}}.% Moves the previous result from {x,0} to the stack because next function needs {x,0} free
{call,0,{f,7}}. % Leaves the result in {x,0} (the 0 register)
{test_heap,4,1}.
{put_list,{x,0},nil,{x,0}}. % Create a list with only the last value, [B]
{put_list,{y,0},{x,0},{x,0}}. % Prepend A (from the stack) to the previous list, creating [A, B] ([A | [B]]) in {x,0}
{call_last,1,{f,2},1}. % Tail recursion call deallocating the stack
{function, list, 0, 7}.
{label,6}.
{line,[{location,"test.erl",15}]}.
{func_info,{atom,test},{atom,list},0}.
{label,7}.
{move,{literal,[1,2,3,4]},{x,0}}.
return.
EDIT:
To actually answer your questions:
The thread's memory is used for both the stack and the heap, which use the same memory block in opposite sides, growning towards each other (the thread's GC triggers when they meet).
"Allocating" in this case means increasing the space used for the stack, and if that space is not going to be used anymore, it must be deallocated (returned to the memory block) in order to be able to use it again later (either as heap or as stack).
I try to use this code.
But kernel exits after executing cycle only once.
If I remove "while(...)" line - cycle works, but results of course are mess.
If I state "volatile __global uint *g_barrier" it freezes a PC with black screen for a while and then program deadlocks.
__kernel void Some_Kernel(__global uint *g_barrier)
{
uint i, t;
for (i = 1; i < MAX; i++) {
// some useful code here
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_local_id(0) == 0) atomic_add(g_barrier, 1);
t = i*get_num_groups(0);
while(*g_barrier < t); // try to sync it all
}
}
You seem to be expecting all work groups to be scheduled to run in parallel. OpenCL does not guarantee this to happen. Some work groups may not start until some other work groups have entirely completed running the kernel.
Moreover, barriers only synchronise within a work group. Atomic operations on global memory are atomic with regard to other work groups too, but there is no guarantee about order.
If you need other work groups to complete some code before running some other code, you will need to enqueue each of those chunks of work separately on a serial command queue (or appropriately connect them using events on an out-of-order queue). So for your example code, you need to remove your for and while loops, and enqueue your kernel MAX-1 times and pass i as a kernel argument.
Depending on the capabilities of your device and the size of your data set, your other option is to submit only one large work group, though this is unlikely to give you good performance unless you have a lot of such smaller tasks which are independent from one another.
(I will point out that there is a good chance your question suffers from the XY problem - you have not stated the overall problem your code is trying to solve. So there may be better solutions than the ones I have suggested.)
Some folks and me are trying to make a simulation of an amusement park and we have almost everything done except one thing: we need to implement a barrier for sync but we need a comunicator fot that, and it needs to encompass evey process except the one with rank zero. I'm using MPI_Group_excl() to tell that a group should not have process zero. Here is a fragment of my code that creates the group and the comunicator:
MPI_Group nonzero_group, world;
MPI_Comm_group(MPI_COMM_WORLD,&world);
int zero[1];
zero[0]=0;
MPI_Group_excl(world,1,zero,&nonzero_group);
MPI_Comm nonzero;
MPI_Comm_create(MPI_COMM_WORLD,world,&nonzero);
But when I test my program using a MPI_Bcast() from process 1 to all process in 'nonzero' communicator, process zero performs the broadcast and gets the buffer.
How can I make a group that has all proces from 1 to N without process zero?
That can be achieved with MPI_Comm_split()
int world_rank;
MPI_Comm comm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split(MPI_COMM_WORLD, (0 == rank)?MPI_UNDEFINED:0, 0, &comm);
on MPI_COMM_WORLD rank 0, comm is MPI_COMM_NULL. It is the valid communicator you expect on the other ranks.
I am solving this problem. I am implementing cycling mapping, I have 4 processors, so one task is mapped on processor 1 (root), and then three others are workers. I am using cyclic mapping, and I have as a input several integers, e.g. 0-40. I want from each worker to receive (in this case it would be 10 integers for each worker), do some counting and save it.
I am using MPI_Send to send integers from root, but I don't how to multiply receive some numbers from the same process (root). Also I send int with buffer size fixed on 1, when there is number e.g. 12, it would do bad things. How to check length of an int?
Any advice would be appreciated. Thanks
I'll assume you're working in C++, though your question doesn't say. Anyway, let's look at the arguments of MPI_Send:
MPI_SEND(buf, count, datatype, dest, tag, comm)
The second argument specifies how many data items you want to send. This call basically means "buf points to a point in memory where there are count number of values, all of them of type datatype, one after the other: send them". This lets you send the contents of an entire array, like this:
int values[10];
for (int i=0; i<10; i++)
values[i] = i;
MPI_Send(values, 10, MPI_INTEGER, 1, 0, MPI_COMM_WORLD);
This will start reading memory at the start of values, and keep reading until 10 MPI_INTEGERs have been read.
For your case of distributing numbers between processes, this is how you do it with MPI_Send:
int values[40];
for (int i=0; i<40; i++)
values[i] = i;
for (int i=1; i<4; i++) // start at rank 1: don't send to ourselves
MPI_Send(values+10*i, 10, MPI_INTEGER, i, 0, MPI_COMM_WORLD);
However, this is such a common operation in distributed computing that MPI gives it its very own function, MPI_Scatter. Scatter does exactly what you want: it takes one array and divides it up evenly between all processes who call it. This is a collective communication call, which is a slightly advanced topic, so if you're just learning MPI (which it sounds like you are), then feel free to skip it until you're comfortable using MPI_Send and MPI_Recv.
I have declared int value in my main, and all of the processes has inicialized this value. All of them are storing value, which I want to write on the screen after computing is finished. Is Broadcast a solution? E.g. how to implement it?
int i;
int value;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD;&myrank);
left = (myrank - 1); if (left < 0) left = numtasks-1;
right = (myrank + 1); if (right >= numtasks) right = 0;
if(myrank==0){
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
}
else if(myrank==(numtasks-1)){
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
else{
MPI_Recv(value,MPI_INT,left,99,MPI_COMM_WORLD,&status);
value=value+myrank;
MPI_Send(value,MPI_INT,right,99,MPI_COMM_WORLD);
}
Theese should make logical circle. I do one computing (sum of all ranks), and in process 0 I get result. This result (for 4 processes it will be 6) I want to be printed by each of the processes after this computing. But I don't see how to use barrier exactly and where.
There is also one thing, after all N-1 sendings (where N is number of processes) I should have sum of all ranks in each of processes. In my code I get this sum only into process 0... It might be a bad approach :-(
Some more detail about the structure of your code would help, but it sounds like you can just use MPI_Barrier. Your processes don't need to exchange any data, they just have to wait until everyone reached the point in your code where you want the printing to happen, which is exactly what Barrier does.
EDIT: In the code you posted, the Barrier would go at the very end (after the if statement), followed by printf(value).
However, your code will not compute the total sum of all ranks in all nodes, since process i only receives the summed ranks of the first i-1 processes. If you want each process to have the total sum at the end, then replacing Barrier with Broadcast is indeed the best option. (In fact, the entire if statement AND the Broadcast could be replaced by a single MPI_Reduce() call, but that wouldn't really help you learn MPI. :) )