Erlang file IO and asynchronous threads - asynchronous

I was reading the erlang documentation about file io and saw this:
On operating systems with thread support, it is possible to let file
operations be performed in threads of their own, allowing other Erlang
processes to continue executing in parallel with the file operations.
See the command line flag +A in erl(1).
so what I expected was that the time required for an IO operation would be reduced if I added asynchronous threads.
instead, when I tried running erl +A1,erl +A6 or erl +A12 (on a 6-core machine) the time required to write in a file increased 5-10 times.
I used timer:tc/3 to measure time and I used io:write/2, file:write/2 (converted the term to binary) and file:write/2 while opening the file with the raw flag. the term was ~170kb in size and was written 1000 times. Used R14B04 (but I got similar results with R15A too).
Am I doing something wrong, either in utilizing the asynchronous IO or in measuring its efficiency?
Could it be that the overhead introduced by passing the term (perhaps because its size is small) out-weights the speedup gained?
the (not so elegant :$) code:
-module(test).
-compile(export_all).
test()->
{ok,F}=file:open(foo,[raw,write]), % or just [write]
{T,ok}=timer:tc(test,t,[F,1000]),
file:close(F),
T.
t(_,0)->ok;
t(F,A)->
B=dsafasfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdfdsafasfagafssadagfsdsaasdf,
file:write(F,
term_to_binary([B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B])),
%or io:write(F,[B,..])
t(F,A-1).
I am interested in minimizing IO overhead (basically just output) because I want to write some data to disk during profiling; that's why sending the data to some other process that will do the writing is not helpful (unless I could somehow devote a core to a process). So far, the best method seems to be opening a raw file, accumulating data and then writing them; any tips would be appreciated :)

When I run the test I get about the same performance with +A1 and without. This is on OS X Snow Leopard with a dual core processor. This is also what I would expect using async threads.
Adding async threads only adds additional threads to do IO with and thus increase performance of parallel IO jobs and it also allows normal erlang processes to be run at a faster rate as the process threads are not busy doing IO jobs.
If you run a test with many parallel jobs you should see a performance gain from using async threads.
Why you are seeing a performance decrease in your sequential tests is a mystery though.

Related

Optimal paralleism for a given project with gnu make

I'd like to know optimal number of cores needed to build a project with GNU make.
I can use --max-load to tune for an existing system, but I want to know if doubling or tripling the core count and memory would improve build wall clock times.
If I could collect statistics on how many recipes make holds waiting for a free core to execute and how long they occupy the core, this could be turned into a standard job scheduling problem.
I don't think there's any way to answer your question, really. Maybe you can be more specific about what you'd like to know.
Obviously the more cores you have, assuming sufficient memory to support them, then the more recipes make can invoke in parallel without crushing your system.
If you have 2 cores and you run make -j300 then make will dutifully invoke 300 jobs at once and your operating system will dutifully attempt to run all of them at the same time. Most likely, your system will be swapping and context switching so much that it will make very little progress and it would take less wall clock time to run make -j2 instead.
On the other hand, if you have 256 cores then make -j300 is probably quite reasonable... assuming you have enough memory to ensure that all those jobs don't wait swapping memory out.
And of course, at some point (but probably far away from any reasonable number of cores you have unless you have a lot of money to spend) you will run into disk IO issues with so many compiler processes running at the same time trying to read source from the disk to compile.
My goto number is num cpus + 1. This is based on a lot of informal benchmarks, and is usually very close to the optimal number. -j9 on a hyper-threaded four core laptop, and -j49 on my usual production build server.
The + 1 means that make keeps all the CPUs occupied, even as jobs are being retired, and is usually a teensy-weensy bit faster than without the increment.
It also means that other users can use the same multiplier without melting the machine.
Be aware though, that although -j49 ensures there are only 49 processes actually running, the parent make will potentially have many more child processes than that. For instance, a single compile may mean the shell is called, which calls a shell script, which calls the compiler driver, which calls the correct compiler stage. On some toolchains my -j49 builds have a peak of 245 child processes. A bit annoying when my ulimit max user processes is only 512.

Using Linux's time utility to measure performance of MPI program

I'm benchmarking an MPI program with different compiler setups.
Right now I use Linux's time to do so:
$> $(which time) mpirun -v [executable]
The values I get look ok in terms of what I expected.
Are there any reasons why I should not be using time for this?
Measuring the needed CPU time is of main interest here.
I'm aware that benchmarking on a single machine is not necessarily going to be consistent with what's happening on a cluster, but this is out of scope.
You should not use time for the purpose of getting the CPU time of a MPI program.
Firstly, that is not going to work in a distributed setup. Now your question is not clear whether you target a single node or a cluster, but that doesn't even matter. An MPI implementation may use whatever mechanism for launching even on a single node. So time may or may not include the CPU time of the actual application processes.
But there is more conceptional issues: What does CPU time for an MPI program mean? That would be the sum of CPU time of all processes. That is a bad metric for benchmarking: It does not quantify improvement, and it does not correlate to overall runtime. For instance a very imbalanced version of your code may use less CPU time but more wall time than a balanced one. Or enabling busy waiting instead of blocking may improve overall runtime, but also increase used CPU time. To really understand what is happenening, and which process uses what kind of resources, you should resort to a proper parallel performance analysis tool.
In HPC, you are not going to be budgeted by CPU time but rather by reserved CPUs * walltime. So if you must use a one dimensional metric, then walltime is the way to go. Now you can use time mpirun ... to get that, although accuracy won't be great for short running applications.

Difference between processor and process in parallel computing?

Every time I come across something like "process 0 does x task" , I am inclined to think they mean processor.
After reading a bit more about it, I find that there are two memory classifications, shared memory and distributed memory:
A shared memory executes something like a thread (implying same data is available to all processors- hence it makes sense to call it a process) However, even for distributed memory it is called a process instead of a processor. For example: "Process 0 is computing the partial dot product"
Why is this so? Why is it called a process and not a processor?
PS. I hope this question is not trivial :)
These other answers are all pretty spot on. Processors are physical, processes are software. So a quad core CPU will have 4 processors, but can run many more processes.
Your confusion around distributed terminology is fair though. In distributed computing, typically X number of processes will be executed equal to the number of hardware processors. In this scenario, each processes gets an ID in software often called a rank. Ranks are independent of processors, and different ranks will have different tasks. So when you report a status, information is relative to the process rank, and not the physical processor.
To rephrase, in distributed computing there will usually be one process running on each processor. The process will have a unique id that is more important in the software than the physical processor it is running on, so status information is given about the process. As the number of processes and processors are equal, this distinction can get a bit blurred.
The distinction is hardware vs software.
The process is the logical instance of your program. The processor is the hardware entity that runs the process. Most of the time, you don't care about the actual processor, only the process that's executing.
For instance, the OS may decide to temporarily put your processes to sleep in order to give other applications runtime, and later it may awaken them on different processors. As long as your processes produce the expected results, this should not be of any interest to you: all you care about is the computation, not where it's happening.
For me, processor refers to machine, that is responsible for computing operations. Process is a single instance of some program. (I hope i understood what you meant).
I would say that they use the terms indistinctly because most of the time the context allows it and the difference may be subtle to some extent. That is, since each process (when it is single threaded) executes on a processor, people typically does not want to make the distinction between the physical entity (processor) and the logical entity (process).
This assumption might be wrong when considering processors with multithreading capabilities (SMT, and Hyper-Threading for Intel processors) and/or executing multi-threaded applications because processes run on any available processor (or thread). In those situations, people should be stricter when making this affirmations. Still, since it is possible to bind one process (and even one thread) to a processor (or processor thread) using affinity commands, they can use indistinctly both terms under these circumstances.

OpenCL Execution model multiple queued kernels

I was curious as to how the GPU executes the same kernel multiple times.
I have a kernel which is being queued hundreds (possibly thousands) of times in a row, and using the AMD App Profiler I noticed that it would execute clusters of kernels extremely fast, then like clockwork every so often a kernel would "hang" (i.e. take orders of magnitude longer to execute). I think it's every 64th kernel that hangs.
This is odd because each time through the kernel performs the exact same operations with the same local and global sizes. I'm even re-using the same buffers.
Is there something about the execution model that I'm missing (perhaps other programs/the OS accessing the GPU or the timing frequency of the GPU memory). I'm testing this on an ATI HD5650 card under Windows 7 (64-bit), with AMD App SDK 2.5 with in-order queue execution.
As a side note, if the I don't have any global memory accesses in my kernel (a rather impractical prospect), the profiler puts a gap in between the quick executing kernels and where the slow executing kernels were before is now a large empty gap where none of my kernels are being executed.
As a follow-up question, is there anything that can be done to fix this?
It's probable you're seeing the effects of your GPU's maximum number of concurrent tasks. Each task enqueued is assigned to one or more multiprocessors, which are frequently capable of running hundreds of workitems at a time - of the same kernel, enqueued in the same call. Perhaps what you're seeing is the OpenCL runtime waiting for one of the multiprocessors to free up. This relates most directly to the occupancy issue - if the work size can't keep the multiprocessor busy, through memory latencies and all, it has idle cycles. The limit here depends on how many registers (local or private memory) your kernel requires. In summary, you want to write your kernel to operate on multiple pieces of data more so than queueing it many times.
Did your measurement include reading back results from the apparently fast executions?

Multitasking on Linux with multiple CPUs

I feel my question is quite basic, but I couldn't find any related SO question.
I need to run a program a few thousands of times (different input each time), and currently it is done by a shell script. The machine runs Ubuntu and has 8 CPUs (as revealed by cat /proc/cpuinfo). Using top I see that only 1 CPU is utilized. In order to speed thing up, I want to utilize all 8 CPUs. I know I can start the program in the background, and then call it again (and indeed top reveals that 2 CPUs are utilized in that case), so I can change my shell script to call the program in groups of 8. My question is, is that a recommended way to utilize all CPUs, or is there another, somewhat 'cleaner' way?
You can use cpu affinity to be explicit about the processor for the processes.
http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
However, if each process runs on a cpu (as it should, the kernel will make sure that things are running as efficiently as possible), then just fire n processes off (8 in your case, or make your shell script figure out what n is so your script is a bit more robust, or make it a command line option) and let the kernel do it for you. Each time a process ends, fire off another process until you are done.
Question is overly vague.
That you want to use all the CPUs implies you want the end result as quickly as possible - but a major concern for the performance f multiple instances would be contention for resources (reducing performance) and caching (improving performance).
Usually splitting the job amongst multiple processes will usually yield results faster. And there are many, many ways of sharding the workload. But without knowing a lot more about what it is doing it is difficult to recommend a particular approach.
Given that you have 8 CPUs, and assuming that the only constrained resource is the CPU, then you don't want to have more than 8 threads running concurrently on the job. So the problem then becomes how you schedule work to ensure that you are using the 8 cores optimally. Splitting the work into 8 scripts and running them concurrently you will initially see all 8 scripts running concurrently - but its very likely, depending on the nature of the work, that the scripts will finish at different times.
So if you really want to use the hardware optimally, that means running 8 processes as daemons, preferably with each process having a cpu affinity set, fed by a message queue. But is it really worthwhile coding all this if you're not going to be running this regularly? Also it may be faster to run just 7 and keep a CPU for handling the quueue and other demands placed on the box.

Resources