Clarification on benfefits of threads useage and process - julia

I'm new in using Julia and after some courses about numeric analysis programming became a hobby of mine.
I ran some tests with all my cores and did the same with threads to compare. I noticed that doing heavier computation went better with the threaded loop than with the process, But it was about the same when it came to addition. (operations were randomly selected for example)
After some research its all kinda vague and I ultimately want some perspective from someone that is using the same language if it matters at all.
Some technical info: 8 physical cores, julia added vector of 16 after addprocs() and nthreads() is 16
using Distributed
addprocs()
#everywhere using SharedArrays;
#everywhere using BenchmarkTools;
function test(lim)
r = zeros(Int64(lim / 16),Threads.nthreads())
Threads.#threads for i in eachindex(r)
r[Threads.threadid()] = (BigInt(i)^7 +5)%7;
end
return sum(r)
end
#btime test(10^4) # 1.178 ms (240079 allocations: 3.98 MiB)
#everywhere function test2(lim)
a = SharedArray{Int64}(lim);
#sync #distributed for i=1:lim
a[i] = (BigInt(i)^7 +5)%7;
end
return sum(a)
end
#btime test2(10^4) # 3.796 ms (4413 allocations: 189.02 KiB)

Note that your loops do very different things.
Int the first loop each thread keeps updating the same single cell the Array. Most likely since only a single memory cell is update in a single thread, the processor caching mechanism can be used to speed up things.
On the other hand the second loop each process is updating several different memory cells and such caching is not possible.
The first Array holds Float64 values while the second holds Int64 values
After correcting those things the difference gets smaller (this is on my laptop, I have only 8 threads):
julia> #btime test(10^4)
2.781 ms (220037 allocations: 3.59 MiB)
29997
julia> #btime test2(10^4)
4.867 ms (2145 allocations: 90.14 KiB)
29997
Now the other issue is that when Distributed is used you are doing inter-process communication which does not occur when using Threads.
Basically, the inter-process processing does not make sense to be used for jobs lasting few milliseconds. When you try to increase the processing volumes the difference might start to diminish.
So when to use what - it depends.. General guidelines (somewhat subjective) are following:
Processes are more robust (threads are still experimental)
Threads are easier as long as you do not need to use locking or atomic values
When the parallelism level is beyond 16 threads become inefficient and Distributed should be used (this is my personal observation)
When writing utility packages for others use threads - do not distribute code inside a package. Explanation: If you add multi-threading to a package it's behavior can be transparent to the user. On the other hand Julia's multiprocessing (Distributed package) abstraction does not distinguish between parallel and distributed - that is your workers can be either local or remote. This makes fundamental difference how code is designed (e.g. SharedArrays vs DistributedArrays), moreover the design of code might also depend on e.g. number of servers or possibilities of limiting inter-node communication. Hence normally, Distributed-related package logic should be separated from from standard utility package while the multi-threaded functionality can just be made transparent to the package user. There are of course some exceptions to this rule such as providing some distributed data processing server tools etc. but this is a general rule of thumb.
For huge scale computations I always use processes because you can easily go onto a computer cluster with them and distribute the workload across hundreds of machines.

Related

Efficiently loop through structs in Julia

I have a simple question. I have defined a struct, and I need to inititate a lot (in the order of millions) of them and loop over them.
I am initiating one at a time and going through the loop as follows:
using Distributions
mutable struct help_me{Z<:Bool}
can_you_help_me::Z
millions_of_thanks::Z
end
for i in 1:max_iter
tmp_help = help_me(rand(Bernoulli(0.5),1)[1],rand(Bernoulli(0.99),1)[1])
# many follow-up processes
end
The memory allocation scales up in max_iter. For my purpose, I do not need to save each struct. Is there a way to "re-use" the memory allocation used by the struct?
Your main problem lies here:
rand(Bernoulli(0.5),1)[1], rand(Bernoulli(0.99),1)[1]
You are creating a length-1 array and then reading the first element from that array. This allocates unnecessary memory and takes time. Don't create an array here. Instead, write
rand(Bernoulli(0.5)), rand(Bernoulli(0.99))
This will just create random scalar numbers, no array.
Compare timings here:
julia> using BenchmarkTools
julia> #btime rand(Bernoulli(0.5),1)[1]
36.290 ns (1 allocation: 96 bytes)
false
julia> #btime rand(Bernoulli(0.5))
6.708 ns (0 allocations: 0 bytes)
false
6 times as fast, and no memory allocation.
This seems to be a general issue. Very often I see people writing rand(1)[1], when they should be using just rand().
Also, consider whether you actually need to make the struct mutable, as others have mentioned.
If the structure is not needed anymore (i.e. not referenced anywhere outside the current loop iteration), the Garbage Collector will free up its memory automatically if required.
Otherwise, I agree with the suggestions of Oscar Smith: memory allocation and garbage collection take time, avoid it for performance reasons if possible.

How can I measure the RAM consumption and the time of computing in Julia?

I'm developing different discretization schemes and in order to find out which is the most efficient one I would like to determine the maximum RAM consumption and the time that takes to do an specific task, such as solving a system of equations, overwriting a matrix or writing the data to a file.
Is there any kind of code or something for doing what I need?
I'm using Julia in Ubuntu by the way, but I could do it in Windows as well.
Thanks a lot
I love using the built-in #time for this kind of thing. See "Measure performance with #time and pay attention to memory allocation". Example:
julia> #time myAwesomeFunction(tmp);
1.293542 seconds (22.08 M allocations: 893.866 MiB, 6.62% gc time)
This prints out time, the number of memory allocations, the size of memory allocations, and the percent time spent garbage collecting ("gc"). Always run this at least twice—the first run will be dominated by compile times!
Also consider BenchmarkTools.jl. This will run the code multiple times, with some cool variable interpolation tricks, and give you better runtime/memory estimates:
julia> using BenchmarkTools, Compat
julia> #btime myAwesomeFunction($tmp);
1.311 s (22080097 allocations: 893.87 MiB)
(My other favorite performance-related thing is the #code_* family of functions like #code_warntype.)
I think that BenchmarkTools.jl measures total memory use, not peak. I haven't found pure Julia code to measure this, but perhaps this thread is relevant.

How to make the most of SIMD in OpenCL?

In the optimization guide of Beignet, an open source implementation of OpenCL targeting Intel GPUs
Work group Size should be larger than 16 and be multiple of 16.
As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD
lanes, we need to follow this rule.
Also mentioned in the Compute Architecture of Intel Processor Graphics Gen7.5:
For Gen7.5 based products, each EU has seven threads for a total of 28 Kbytes of general purpose register file (GRF).
...
On Gen7.5 compute architecture, most SPMD programming models employ
this style code generation and EU processor execution. Effectively,
each SPMD kernel instance appears to execute serially and independently within its own SIMD lane.
In actuality, each thread executes a SIMD-Width number of kernel instances >concurrently. Thus for a SIMD-16 compile of a compute
kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances
to be executing concurrently on a single EU. Similarly, for SIMD-32 x
7 threads = 224 kernel instances executing concurrently on a single
EU.
If I understand it correctly, using the SIMD-16 x 7 threads = 112 kernel instances as a example, in order to run 224 threads on one EU, the work group size need to be 16. Then the OpenCL compiler will fold 16 kernel instances into a 16 lane SIMD thread, and do this 7 times on 7 work groups, and run them on a single EU?
Question 1: am I correct until here?
However OpenCL spec also provide vector data types. So it's feasible to make full use of the SIMD-16 computing resources in a EU by conventional SIMD programming(as in NEON and SSE).
Question 2: If this is the case, using vector-16 data type already makes explicit use of the SIMD-16 resources, hence removes the at-least-16-item-per-work-group restrictions. Is this the case?
Question 3: If all above are true, then how does the two approach compare with each other: 1) 112 threads fold into 7 SIMD-16 threads by OpenCL compiler; 2) 7 native threads coded to explicitly use vector-16 data types and SIMD-16 operations?
Almost. You are making the assumptions that there is one thread per workgroup (N.B. thread in this context is what CUDA calls a "wave". In Intel GPU speak a work item is a SIMD channel of a GPU thread). Without subgroups, there is no way to force a workgroup size to be exactly a thread. For instance, if you choose a WG size of 16, the compiler is still free to compile SIMD8 and spread it amongst two SIMD8 threads. Keep in mind that the compiler chooses the SIMD width before the WG size is known to it (clCompileProgram precedes clEnqueueNDRange). The subgroups extension might allow you to force the SIMD width, but is definitely not implemented on GEN7.5.
OpenCL vector types are an optional explicit vectorization step on top of the implicit vectorization that already happens automatically. Were you to use float16 for example. Each of the work items would be processing 16 floats each, but the compiler would still compile at least SIMD8. Hence each GPU thread would be processing (8 * 16) floats (in parallel though). That might be a bit overkill. Ideally we don't want to have to explicitly vectorize our CL by using explicit OpenCL vector types. But it can be helpful sometimes if the kernel is not doing enough work (kernels that are too short can be bad). Somewhere it says float4 is a good rule of thumb.
I think you meant 112 work items? By native thread do you mean CPU threads or GPU threads?
If you meant CPU threads, the usual arguments about GPUs apply. GPUs are good when your program doesn't diverge much (all instances take similar paths) and you use the data enough times to mitigate the cost transferring it to and from the GPU (arithmetic density).
If you meant GPU threads (the GEN SIMD8 or SIMD16 critters). There is no (publicly visible) way to program the GPU threads explicitly at the moment (EDIT see the subgroups extension (not available on GEN7.5)). If you were able to, it'd be a similar trade off to assembly language. The job is harder, and the compiler sometimes just does a better job than we can, but when you are solving a specific problem and have better domain knowledge, you can generally do better with enough programming effort (until hardware changes and your clever program's assumptions becomes invalidated.)

parallel sum reduction implementation in opencl

I am going through the sample code of NVIDIA provided at link
In the sample kernels code (file oclReduction_kernel.c) reduce4 uses the technique of
1) unrolling and removing synchronization barrier for thread id < 32.
2) Apart from this the code uses the blockSize checks to sum the data in local memory. I think there in OpenCL we have get_local_size(0/1) to know the work group size. Block Size is confusing me.
I am not able to understand both the points mentioned above. Why and how these things helping out in optimization? Any explanation on reduce5 and reduce6 will be helpful as well.
You have that pretty much explained in slide 21 and 22 of https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf which #Marco13 linked in comments.
As reduction proceeds, # “active” threads decreases
When s <= 32, we have only one warp left
Instructions are SIMD synchronous within a warp.
That means when s <= 32:
We don’t need to __syncthreads()
We don’t need “if (tid < s)” because it doesn’t save any work
Without unrolling, all warps execute every iteration of the for loop
and if statement
And by https://www.pgroup.com/lit/articles/insider/v2n1a5.htm:
The code is actually executed in groups of 32 threads, what NVIDIA
calls a warp.
Each core can execute a sequential thread, but the cores execute in
what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion;
all cores in the same group execute the same instruction at the same
time, much like classical SIMD processors.
Re 2) blockSize there looks to be size of the work group.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Resources