In metal does one vertex shader complete before the next vertex shader executes?

In metal does one vertex shader complete before the next vertex shader executes? - vertex-shader

Suppose a Metal vertex shader A updates a buffer buf. Also suppose I have a second vertex shader B that is encoded after A. Can B use the results in buf or is it possible that B will begin executing before A has finished, meaning the contents of the buffer are not ready?

The second vertex shader B is free to execute before vertex shader A if they encoded in the same MTLRenderCommandEncoder. If you'd like to read the output of A in B, then they must be encoded by separate MTLRenderCommandEncoder's.
Note, however, the same is not true of compute dispatches within a MTLComputeCommandEncoder. The relevant part of the doc states:
Executing a Compute Command
To encode a command to execute a compute
function, call the dispatchThreadgroups:threadsPerThreadgroup: method
of MTLComputeCommandEncoder and specify the threadgroup dimensions and
the number of threadgroups. You can query the threadExecutionWidth and
maxTotalThreadsPerThreadgroup properties of MTLComputePipelineState to
optimize the execution of the compute function on this device.
For most efficient execution of the compute function, set the total number
of threads specified by the threadsPerThreadgroup argument to the
dispatchThreadgroups:threadsPerThreadgroup: method to a multiple of
threadExecutionWidth. The total number of threads in a threadgroup is
the product of the components of threadsPerThreadgroup:
threadsPerThreadgroup.width * threadsPerThreadgroup.height *
threadsPerThreadgroup.depth. The maxTotalThreadsPerThreadgroup
property specifies the maximum number of threads that can be in a
single threadgroup to execute this compute function on the device.
Compute commands are executed in the order in which they are encoded
into the command buffer. A compute command finishes execution when all
threadgroups associated with the command finish execution and all
results are written to memory. Because of this sequencing, the results
of a compute command are available to any commands encoded after it in
the command buffer.
To end encoding commands for a compute command encoder, call the endEncoding method of MTLComputeCommandEncoder.
After ending the previous command encoder, you can create a new
command encoder of any type to encode additional commands into the
command buffer.

Related

Is hierachical parallelism possible with MPI libraries?

I'm writing a computational code with MPI. I have a few parts of the software each compute different part of the problem. Each part is written with MPI thus could be run as an independent module. Now I want to combine these parts to be run together within one program, and all parts of the code run in parallel while each part itself is also running in parallel.
e.g. Total number of nodes = 10, part1 run with 6 nodes and part 2 run with 4 nodes and both running together.
Is there ways that I can mpirun with 10 nodes and mpi_init each part with desired number of node without rewritten the overall program to allocate process for each part of code?

This is not straightforward.
One option is to use an external program that with MPI_Comm_spawn() (twice) your sub-programs. The drawback is this requires one slot.
An other option needs some rewriting, since all the tasks will end up in the same MPI_COMM_WORLD, it is up to them to MPI_Comm_split() based on who they are, and use the resulting communicator instead of MPI_COMM_WORLD.

How to merge variables from parallel flows in Activiti?

Currently I have a sub-process which uses fork/join mechanism to create parallel flows. Lest assume that there are two flows: A, B. Each of that flows takes as input variables complex object CONTEXT. Also each of that flows make some calculation and updates CONTEXT inside. As a output each of flows return updated CONTEXT. The problem here is that in Join point, last result of CONTEXT overrides previous one. Lets assume that flow A fill be finished first with result CONTEXT_1 and flow B will return CONTEXT_2. So final result will be CONTEXT_2 and all changes from flow A will be lost.
The question here is - how to merge results from two flows?
UPDATE:
From my observations passed variable (CONTEXT) from SuperProcess to SubProcess are copied(CONTEXT') and after subProcess is finished, new value of passed variable(CONTEXT') will take place of original (CONTEXT).
In the example below I mean that all passed variables have the same name.
Example:
SuperProcess P1 (Variable: CONTEXT) calls SubProcess P2(variables are passed by copy);
SubProcess P2 (Variable: CONTEXT') creates two parallel flows(Tasks) A, B(variables are passed by copy);
A Task (Variable: CONTEXT_1) updates value of variable, finishes execution and returns variable;
3.1. CONTEXT_1 takes place of variable CONTEXT' so P2 can see only this new value as names of this variables the same;
Meanwhile B Task (Variable: CONTEXT_2) is still working and after some time updates variable, finishes execution and returns variable;
4.1. CONTEXT_2 takes place of variable CONTEXT_1 so P2 can see only this new value as names of this variables the same;
SubProcess P2 (Variable: CONTEXT_2) finish the execution and returns new veriable to SuperProcess.
Result -> CONTEXT_1 is lossed.
My aim scenario:
SuperProcess P1 (Variable: CONTEXT) calls SubProcess P2(variables are passed by copy);
SubProcess P2 (Variable: CONTEXT') creates two parallel flows(Tasks) A, B(variables are passed by copy);
A Task (Variable: CONTEXT_1) updates value of variable, finishes execution and returns variable;
3.1. CONTEXT_1 and CONTEXT are merged into CONTEXT_M1, in other words, only new changes of CONTEXT_1 will be applied to CONTEXT.
Meanwhile B Task (Variable: CONTEXT_2) is still working and after some time updates variable, finishes execution and returns variable;
4.1. CONTEXT_2 and CONTEXT_M1 are merged into CONTEXT_M2, in other words, only new changes of CONTEXT_2 will be applied to CONTEXT_M1 so previous update will be not lost;
SubProcess P2 (Variable: CONTEXT_M2) finish the execution and returns new veriable to SuperProcess.
Result -> CONTEXT_M2. All changes are saved.

After couple days of investigation we figured out that copying variables from SuperProcess to SubProcess is default behavior (link):
"You can pass process variables to the sub process and vice versa. The
data is copied into the subprocess when it is started and copied back
into the main process when it ends."
As the decision we pass variables into SubProcess under different name and merge with source variable after SubProcess finish:

When you say merge? What do you mean exactly?
What is your desired behavior at the emerge point?
If you want to maintain both contexts then use a map with the execution ID as the key, however, I doubt that is what you want.
Greg

Append OpenCL result to list / Reduce solution room

I have an OpenCL Kernel with multiple work items. Let's assume for discussion, that I have a 2-D Workspace with x*y elements working on an equally sized, but sparce, array of input elements. Few of these input elements produce a result, that I want to keep, most don't. I want to enqueue another kernel, that only takes the kept results as an input.
Is it possible in OpenCL to append results to some kind of list to pass them as input to another Kernel or is there a better idea to reduce the volume of the solution space? Furthermore: Is this even a good question to ask with the programming model of OpenCL in mind?

What I would do if the amount of result data is a small percentage (ie: 0-10%) is use local atomics and global atomics, with a global counter.
Data interface between kernel 1 <----> Kernel 2:
int counter //used by atomics to know where to write
data_type results[counter]; //used to store the results
Kernel1:
Create a kernel function that does the operation on the data
Work items that do produce a result:
Save the result to local memory, and ensure no data races occur using local atomics in a local counter.
Use the work item 0 to save all the local results back to global memory using global atomics.
Kernel2:
Work items lower than "counter" do work, the others just return.

Is there a size limit of variable in MPI_bcast?

I have the latest MPICH2 (3.0.4) compiled with intel fort compiler in a quad-core, dual CPU (Intel Xeon) machine.
I am encountering one MPI_bcast problem where, I am unable to broadcast the array
gpsi(1:201,1:381,1:38,1:20,1:7)
making it an array of size 407410920. When I try to broadcast this array I have the following error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7f506d811010, count=407410920,
MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
rank 1 in job 31 Grace_52261 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
MPI launch string is: mpiexec -n 2 %B/tvdbootstrap
Testing MPI configuration with 'mpich2version'
Exit value was 127 (expected 0), status: execute_command_t::exited
Launching MPI job with command: mpiexec -n 2 %B/tvdbootstrap
Server args: -callback 127.0.0.1:4142 -set_pw 65f76672:41f20a5c
So the question, is there a limit in the size of variable in MPI_bcast or is the size of my array is more than what it can handle?

As John said, your array is too big because it can no longer be described by an int variable. When this is the case, you have a few options.
Use multiple MPI calls to send your data. For this option, you would just divide your data up into chunks smaller than 2^31 and send them individually until you've received everything.
Use MPI datatypes. With this option, you need to create a datatype to describe some portion of your data, then send multiples of that datatype. For example, if you are just sending an array of 100 integers, you can create a datatype of 10 integers using MPI_TYPE_VECTOR, then send 10 of that new datatype. Datatypes can be a bit confusing when you're first taking a look at them, but they are very powerful for sending either large data or non-contiguous data.

Yes, there is a limit. It's usually 2^31 so about two billion elements. You say your array has 407 million elements so it seems like it should work. However, if the limit is two billion bytes, then you are exceeding it by about 30%. Try cutting your array size in half and see if that works.
See: Maximum amount of data that can be sent using MPI::Send

In MPI how to communicate a part of a "shared" array to all other ranks?

I have an MPI program with some array of data. Every rank needs all the array to do its work, but will only work on a patch of the array. After a calculation step I need every rank to communicate its computed piece of the array to all other ranks.
How do I achieve this efficiently?
In pseudo code I would do something like this as a first approach:
if rank == 0: // only master rank
initialise_data()
end if
MPI_Bcast(all_data,0) // from master to every rank
compute which part of the data to work on
for ( several steps ): // each rank
execute_computation(part_of_data)
for ( each rank ):
MPI_Bcast(part_of_data, rank_number) // from every rank to every rank
end for
end for
The disadvantage is that there is as many broadcasts, i.e. barriers as there is ranks. So how would I replace the MPI_Bcasts ?
edit: I just might have found a hint... Is it MPI_Allgather I am looking for?

Yes, you are looking for MPI_Allgather. Note that recvcount is not the length of the whole recieve buffer, but the amount of data should be recieved from one process. Analogically, in MPI_Allgatherv recvcount[i] is the amount of data you want to recieve from i-th process. Moreover, recvcount should be equal (not less) to the respective sendcount. I tested it on my implemetation (OpenMPI), and if I tried to recieve less elements that were sent, I got MPI_ERR_TRUNCATE error.
Also in some rare cases I used MPI_Allreduce for that puprose. For example if we have the following arrays:
process0: AA0000
process1: 0000BB
process2: 00CC00
then we can do Allreduce with MPI_SUM operation and get AACCBB in all processes. Obviously, the same trick can be done with ones instead of zeros and MPI_PROD instead of MPI_SUM.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

In metal does one vertex shader complete before the next vertex shader executes? - vertex-shader

Suppose a Metal vertex shader A updates a buffer buf. Also suppose I have a second vertex shader B that is encoded after A. Can B use the results in buf or is it possible that B will begin executing before A has finished, meaning the contents of the buffer are not ready?

Related

Is hierachical parallelism possible with MPI libraries?

How to merge variables from parallel flows in Activiti?

Append OpenCL result to list / Reduce solution room

Is there a size limit of variable in MPI_bcast?

In MPI how to communicate a part of a "shared" array to all other ranks?

Categories

Resources