Does Julia use SIMD instructions for broadcast operations? - julia

I see #simd can be used to vectorize for loops. How about for broadcast()?
Will broadcast(f,A) translate to SIMD instructions of f operating on the elements A?
Will multiple instances of f be sent to multiple threads?

As per this discussion:
Julia 0.5 will automatically use SIMD instructions if LLVM decides to auto-vectorize the underlying broadcast() loop. There is currently no way to explicitly demand this.
The underlying loop will use a single thread.

Related

The best way to write code in Julia working on GPU's via ArrayFire

In Julia, I saw principally that to acelerate and optimizing codes when I work on a matrix, es better e.g.
-work by columns instead of by rows, this is for the way which Julia store the matrix.
-On loops could use #inbounds and #simd macros
-any function, macros or methods you could recommend it's welcome :D
But it seems that the above examples do not work when I use the ArrayFire package with a matrix stored on the GPU, similar codes in the CPU and GPU do not seem to favor the GPU that runs much slower in some cases, I think it shouldn't be like that, I think the problem is in the way of writing the code. Any help will be welcome
GPU computing should be done on optimized GPU kernels as much as possible. Indexing a GPU array is a small kernel that copies one value back to the CPU. This is really really bad for performance, so you should almost never index a GPUArray unless you have to (this is true for any implementation! It's just a hardware problem!)
Thus, instead of writing looping code for GPUs, you should write broadcasting ("vectorized") code. With the v0.6 broadcast changes, broadcasted operations are nearly as efficient as loops anyways (unless you hit this bug), so there's no reason to avoid them in generic code. However, there are cases where broadcasting is faster than looping, and GPUs is one big case.
Let me explain a little bit why. When you do the code:
#. A = B*C + D*E
it lowers to
A .= B.*C .+ D.*E
which then lowers to:
broadcast!((b,c,d,e)->b*c + d*e,A,B,C,D,E)
Notice that in there you have a fused anonymous function for the entire broadcast. For GPUArrays, this is then overwritten so that way a single GPU kernel is automatically created that performs this fused operation element-wise. Thus only one GPU kernel is required to do this whole operation! Notice that this is even more efficient than the R/Python/MATLAB way of doing GPU computing since those vectorized forms have temporaries and would require 4 kernels here, but this has no temporary arrays and is a single kernel, which is pretty much exactly how you'd write it if you were writing the kernel yourself. So if you exploit broadcast, then your GPU calculations will be fast.

How to make the most of SIMD in OpenCL?

In the optimization guide of Beignet, an open source implementation of OpenCL targeting Intel GPUs
Work group Size should be larger than 16 and be multiple of 16.
As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD
lanes, we need to follow this rule.
Also mentioned in the Compute Architecture of Intel Processor Graphics Gen7.5:
For Gen7.5 based products, each EU has seven threads for a total of 28 Kbytes of general purpose register file (GRF).
...
On Gen7.5 compute architecture, most SPMD programming models employ
this style code generation and EU processor execution. Effectively,
each SPMD kernel instance appears to execute serially and independently within its own SIMD lane.
In actuality, each thread executes a SIMD-Width number of kernel instances >concurrently. Thus for a SIMD-16 compile of a compute
kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances
to be executing concurrently on a single EU. Similarly, for SIMD-32 x
7 threads = 224 kernel instances executing concurrently on a single
EU.
If I understand it correctly, using the SIMD-16 x 7 threads = 112 kernel instances as a example, in order to run 224 threads on one EU, the work group size need to be 16. Then the OpenCL compiler will fold 16 kernel instances into a 16 lane SIMD thread, and do this 7 times on 7 work groups, and run them on a single EU?
Question 1: am I correct until here?
However OpenCL spec also provide vector data types. So it's feasible to make full use of the SIMD-16 computing resources in a EU by conventional SIMD programming(as in NEON and SSE).
Question 2: If this is the case, using vector-16 data type already makes explicit use of the SIMD-16 resources, hence removes the at-least-16-item-per-work-group restrictions. Is this the case?
Question 3: If all above are true, then how does the two approach compare with each other: 1) 112 threads fold into 7 SIMD-16 threads by OpenCL compiler; 2) 7 native threads coded to explicitly use vector-16 data types and SIMD-16 operations?
Almost. You are making the assumptions that there is one thread per workgroup (N.B. thread in this context is what CUDA calls a "wave". In Intel GPU speak a work item is a SIMD channel of a GPU thread). Without subgroups, there is no way to force a workgroup size to be exactly a thread. For instance, if you choose a WG size of 16, the compiler is still free to compile SIMD8 and spread it amongst two SIMD8 threads. Keep in mind that the compiler chooses the SIMD width before the WG size is known to it (clCompileProgram precedes clEnqueueNDRange). The subgroups extension might allow you to force the SIMD width, but is definitely not implemented on GEN7.5.
OpenCL vector types are an optional explicit vectorization step on top of the implicit vectorization that already happens automatically. Were you to use float16 for example. Each of the work items would be processing 16 floats each, but the compiler would still compile at least SIMD8. Hence each GPU thread would be processing (8 * 16) floats (in parallel though). That might be a bit overkill. Ideally we don't want to have to explicitly vectorize our CL by using explicit OpenCL vector types. But it can be helpful sometimes if the kernel is not doing enough work (kernels that are too short can be bad). Somewhere it says float4 is a good rule of thumb.
I think you meant 112 work items? By native thread do you mean CPU threads or GPU threads?
If you meant CPU threads, the usual arguments about GPUs apply. GPUs are good when your program doesn't diverge much (all instances take similar paths) and you use the data enough times to mitigate the cost transferring it to and from the GPU (arithmetic density).
If you meant GPU threads (the GEN SIMD8 or SIMD16 critters). There is no (publicly visible) way to program the GPU threads explicitly at the moment (EDIT see the subgroups extension (not available on GEN7.5)). If you were able to, it'd be a similar trade off to assembly language. The job is harder, and the compiler sometimes just does a better job than we can, but when you are solving a specific problem and have better domain knowledge, you can generally do better with enough programming effort (until hardware changes and your clever program's assumptions becomes invalidated.)

SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required.
Why use SIMD if we have GPGPU?
SIMD intrinsics - are they usable on gpus?
CPU SIMD vs GPU SIMD?
Are following points correct,if I compile the code in SIMD-8 mode ?
1) it means 8 instructions of different work items are getting executing in parallel.
2) Does it mean All work items are executing the same instruction only?
3) if each wrok item code contains vload16 load then float16 operations and then vstore16 operations only. SIMD-8 mode will still work. I mean to say is it true GPU is till executing the same instruction (either vload16/ float16 / vstore16) for all 8 work items?
How should I understand this concept?
In the past many OpenCL vendors required to use vector types to be able to use SIMD. Nowadays OpenCL vendors are packing work items into SIMD so there is no need to use vector types. Whether is preffered to use vector types can be checked by querying for: CL_DEVICE_PREFERRED_VECTOR_WIDTH_<CHAR, SHORT, INT, LONG, FLOAT, DOUBLE>.
On Intel if vector type is used the vectorizer first scalarize them and then re-vectorize to make use of the wide instruction set. This is probably going to be similar on the other platforms.

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Is there a programming language where every function is essentially run as a separate actor?

Is there a programming language where you don't have to define actors yourself - every function is just ran as a separate actor (which can mean a separate thread if there are free cores available) by default?
For example it means that if I write something as simple as
v = fA(x) + fB(y)
then fA and fB could be calculated simultaneously before the sum of their results was assigned to v.
I don't think there is anything this extreme, since the context switching and comunication overhead would be too big.
The closest I can think of to what you are asking is data-parallel programing, where the program is mostly written in the same style as a sequential version but parts of it are ran in parallel where possible.
Examples are loop vectorization in Fortran and "par" magic in Haskell.
Haskell's par combinator lets you evaluate expressions concurrently (which can mean in separate threads if there are free cores available). All you have to do is:
x `par` y
Which will evaluate x and y concurrently, and return the value of y. Note that x and y can be programs of arbitrary complexity.
Joule is a pure asynchronous message passing language:
http://en.wikipedia.org/wiki/Joule_%28programming_language%29
http://www.erights.org/history/joule/MANUAL.BK5.pdf
ActorScript is a pure Actor message-passing language, but appears to only exist as a specification:
http://arxiv.org/abs/1008.2748

Resources