What are the differences between MPI and OpenMP? [closed]

What are the differences between MPI and OpenMP? [closed] - mpi

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I would like to know (in a few words) what are the main differences between OpenMP and MPI.

OpenMP is a way to program on shared memory devices. This means that the parallelism occurs where every parallel thread has access to all of your data.
You can think of it as: parallelism can happen during execution of a specific for loop by splitting up the loop among the different threads.
MPI is a way to program on distributed memory devices. This means that the parallelism occurs where every parallel process is working in its own memory space in isolation from the others.
You can think of it as: every bit of code you've written is executed independently by every process. The parallelism occurs because you tell each process exactly which part of the global problem they should be working on based entirely on their process ID.
The way in which you write an OpenMP and MPI program, of course, is also very different.

MPI stands for Message Passing Interface. It is a set of API declarations on message passing (such as send, receive, broadcast, etc.), and what behavior should be expected from the implementations.
The idea of "message passing" is rather abstract. It could mean passing message between local processes or processes distributed across networked hosts, etc. Modern implementations try very hard to be versatile and abstract away the multiple underlying mechanisms (shared memory access, network IO, etc.).
OpenMP is an API which is all about making it (presumably) easier to write shared-memory multi-processing programs. There is no notion of passing messages around. Instead, with a set of standard functions and compiler directives, you write programs that execute local threads in parallel, and you control the behavior of those threads (what resource they should have access to, how they are synchronized, etc.). OpenMP requires the support of the compiler, so you can also look at it as an extension of the supported languages.
And it's not uncommon that an application can use both MPI and OpenMP.

Related

Writing Scheduler/RTOS in XC8

I have an interest in writing a scheduler/RTOS project in XC8 using an enhanced MCU with access to the hardware stack.
I am trying to figure out how to control the creation of the software stacks so each task's software stack will get a certain range in the general purpose ram.
Conceptually this is all easy to program in ASM but I want to be able to write C programs and have the software stacks for each task be put into the right address space.
There doesn't appear to be an option to create a separate software stack for a certain section of code or even create multiple software stacks - how do I do it?
Thanks

Stack switching is the responsibility of teh scheduler,not teh compiler - so you will not find a compiler option for that. You have to implement that in the scheduler you are intending to write - that is in fact most of what a scheduler does.
In an RTOS, switching context involves storing all the registers relating to one thread of execution and replacing them with those of another. This includes replacing the stack-pointer - that is how you switch stacks between threads. A context switch is completed when the program-counter register is loaded effecting a jump to the new thread's last execution point (with all its registers, including the stack-pointer restored.
The context switch itself necessarily involves at least a small amount of assembler code, but much of it may still be written in C, and tasks themselves may be written in C.. A good description of a simple RTOS scheduler is provided in Jean Labrosse's book on μC/OS-II - freely available in PDF. A PIC18 port of μC/OS-II is described here with download.

Spinlock vs Busy wait [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Please explain why Busy Waiting is generally frowned upon whereas Spinning is often seen as okay. As far as I can tell, they both loop infinitely until some condition is met.

A spin-lock is usually used when there is low contention for the resource and the CPU will therefore only make a few iterations before it can move on to do productive work. However, library implementations of locking functionality often use a spin-lock followed by a regular lock. The regular lock is used if the resource cannot be acquired in a reasonable time-frame. This is done to reduce the overhead with context switches in settings where locks usually are quickly obtained.
The term busy-waiting tends to mean that you are willing to spin and wait for a change in a hardware register or a memory location. The term does not necessarily mean locking, but it does imply waiting in a tight loop, repeatedly probing for a change.
You may want to use busy-waiting in order to detect some kind of change in the environment that you want to respond to immediately. So a spin-lock is implemented using busy-waiting. Busy-waiting is useful in any situation where a very low latency response is more important than wasting CPU cycles (like in some types of embedded programming).
Related to this are the terms «lock-free» and «wait-free»:
So-called lock-free algorithms tend to use tight busy-waiting with a CAS instruction, but the contention is in ordinary situations so low that the CPU usually have to iterate only a few times.
So-called wait-free algorithms don't do any busy-waiting at all.
(Please note that «lock-free» and «wait-free» is used slightly differently in academic contexts, see Wikipedia's article on Non-blocking algorithms.)

Standard versus spin mutexes: When a thread calls the function that
locks and acquires a mutex, the function does not return until the
mutex is locked. This is typical for event synchronization: a thread
waiting for an event, namely, the fact that it has acquired mutex
ownership. There are two ways in which this can happen:
• An idle wait: the thread waiting to lock the mutex is blocked in a
wait state as explained in Chapter 2. It releases the CPU, which can
then be used to run another thread. When the mutex becomes available,
the runtime system wakes up and reschedules the waiting thread, which
can then lock the now available mutex.
• A busy wait, also called a spin wait, in which a thread waiting to
lock the mutex does not release the CPU. It remains scheduled,
executing some trivial do nothing instruction until the mutex is
released.
Standard mutexes normally subscribe to the first strategy, and perform
an idle wait. But some libraries also provide mutexes that subscribe
to the spin wait strategy. The best one depends on the application
context. For very short waits spinning in user space is more efficient
because putting a thread to sleep in a blocked state takes cycles. But
for long waits, a sleeping thread releases its CPU making cycles
available to other threads
I found this quite related from this source: https://www.sciencedirect.com/topics/computer-science/waiting-thread

When you understand the precise reason for a rule and have detailed platform and application knowledge, you know when it's appropriate to violate that rule. Spinlocks are implemented by experts who have a complete understanding of the platform they're developing on and the intended applications for spinlocks.
The problems with busy waiting are numerous, but on most platforms, there are solutions for them. Issues include:
For CPUs with hyper-threading, a busy-waiting thread can starve another thread in the same physical core, even the very thread it's waiting for.
When you busy wait, when you finally do get the resource you're waiting for, you take the mother of all mispredicted branches.
Busy waiting interferes with CPU power management.
Busy waiting can saturate inter-core buses as the thing you keep checking causes cache synchronization traffic.
But the people who design spinlocks understand all these issues and know precisely how to mitigate them on the platform. They don't write naive spinning code, they write intelligent spinning code.
So yes, they both loop infinitely until some condition is met, but they loop differently.

MPI on a multicore machine

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.

Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.

If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

Is OCaml suitable to write networking servers?

I was wondering if OCaml will perform well in terms of performance and ease of implementation while dealing with typical client/server interactions over TCP in a multi threaded environment.. I mean something really typical like having a thread per client that receives data, operated changes on game states and send them back to clients.
This because I need to write a server for a game and I always did these things in C but since now I know OCaml I was curious to know if it would be ok or I'll just find myself trying to solve a typical problem in a language that doesn't fit well that.

Performance: probably not. OCaml's threads do not provide parallel execution, they are only a way to structure your program. The OCaml runtime itself is not thread-safe, so the only code that could possibly execute in parallel of a single OCaml thread would be interfaced C code (without callbacks to OCaml!).
Implementation-wise, there is a mutex on the run-time, which is released when calling blocking C primitives, and could also be released when calling C functions that do significant work.
Ease of implementation: it wouldn't be world-changing. You would have the comfort of OCaml and a pthread-like library on the side. If you are looking for new things to discover while leveraging what you have learnt of OCaml, I recommend Jocaml. It goes in and out of sync with OCaml, but there was a (re-)re-implementation quite recently, and even when it is slightly out of sync, it is a lot of fun, and a completely new perspective of concurrent programs.
Jocaml is implemented on top of OCaml. What with the run-time not being concurrent and all, I am almost sure it uses separate processes and message-passing. But for the application that you mentioned it should be able to do fine.

OCaml is quite suitable for writing network servers, although as Pascal observes, there are limitations on threading.
Fortunately, however, threading isn't the only way to organize such a program. The Lwt library (for Light Weight Threads) provides an abstraction of asynchronous I/O that is quite easy to use (particularly when combined with a bit of syntax support). Everything actually runs in one thread, but it's all driven by an asynchronous I/O loop (built on the Unix select call), and the programming style lets you write code that looks like direct code (avoiding much of the normal code overhead of doing asynchronous I/O in many other languages). For example:
lwt my_message = read_message socket in
let repsonse = compute_response my_message in
send_response socket response
Both the read and the write happen back in the main event loop, but you avoid the normal "read, calling this function when you're done" manual overhead.

I'm so sorry this question has been sitting here for eight years with what I consider to be several quite bad answers because they all ignore the elephant in the room.
You say "really typical like having a thread per client" but having an OS thread per client is an extremely bad design. Threads are heavyweight, taking a long time to create and destroy and consuming ~1MB just for the thread stack. If you have one thread per connection then 1,000 simultaneous client connections (which is entirely feasible) will burn 1GB of RAM just for their stacks and the performance of your program (in any language) will be cripppled by the amount of context switching required to get any work done. You don't want to use that design in any language including both C and OCaml. Note that this problem is especially bad in the context of tracing garbage collected languages because the GC also traverses all of those thread stack in order to collate global roots before every GC cycle. I am the first to admit that this anti-pattern is ubiquitous in the real world but please don't copy it! I have seen "low latency" servers in the finance industry written in C++ using one thread per connection and they suffered latency stalls of up to six seconds just from the (Windows) OS servicing those threads.
See: http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-kqueue.html
Let's consider an efficient design instead, like an epoll or kqueue interface to the OS kernel giving the server's code information about incoming and outgoing data buffers. Single threaded servers can attain excellent performance with this design. However, a typical server has serialization work to do per client and some core work that is often performed in serial across all client connections. Therefore, serialization and deserialization can be parallelized but the core server operation cannot. In this context, OCaml is great for everything except the serialization layer because it has poor support for parallelism.
I have personally implemented many servers for various industries with hugely varying performance requirements. In my experience, OCaml is one of the best tools for this because it offers excellent libraries (easy to use and reliable) and excellent serial performance. The only issue I have is around parallelizing the serialization layer but, in practice, I have found that OCaml runs circles around alternatives like Java and .NET even though they can parallelize this. I found typical latencies were ~100us for .NET and 10us for OCaml.
See also: http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-latencies-in-haskell-ocaml-racket/

OCaml will work great for networking applications as long as you can live with a relatively small number of threads active at one time—say no more than 100. You could consider MLdonkey as an example, although in the client space, not in the server space.

Haskell would be a better choice if you want to use many preemptive threads. GHC can support huge numbers of threads and they run in parallel on multicore systems. OCaml prefers cooperative multithreading and multiple processes.

SIP test platform [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I am searching for a tool that tests SIP calls. A platform that makes a call from SIP device A to SIP device B and reports results...
Any idea? A simulation platform would be ideal.
thnx,
cateof

There are many solutions. Some more broken than others. Here's a quick summary of what I've found while looking for a base for a proper automated testing solution.
SIPp
It's ok if you want only single dialog at a time. What doesn't work here is complex solutions where you need to synchronise 2 call legs, do registration, call and presence in the same scenario. If you go this way, you'll end up with running multiple sipp scenarios for each conversation element separately. Sipp also doesn't scale at all for media transfers. Even though it's multithreaded, something stops it from running concurrently - if you look at htop for example, you'll see that sipp never crosses the 100% line. Around 50 media calls it starts to cut audio and take all CPU of the machine.
It can sometimes lose track of what's happening, some packets which don't even belong to the call really, can fail the test. It's got some silly bugs like case-sensitive comparing of the headers.
SIPr/sipper
Ruby-based solution where you have to write your own scenarios in Ruby. It's got its own SIP stack and lots of tests. While it's generally good and handles a lot of complex scenarios nicely, its design is terrible. Bugs are hard to track and after a week I had >10 patches that I needed just to make it do basic stuff. Later I learned that some of the scenarios are just written in a different way, but SIPr developers were not really responsive and it took a lot of time to find it out. Synchronising actions of many agents if a hard problem, since they'd rather use an event-based, but still single-threaded version... it just makes you concentrate too much on "what order can this happen in and do I handle it correctly", rather than writing the actual test.
WinSIP
Commercial solution. Never tested it properly since the basic functionality is missing from the evaluation version and it's hard to spend that much money on something you're not sure works...
SipUnit
Java-based solution reusing Jain-SIP stack. It can do almost any scenario and is fairly good. It tries to make everything non-blocking / action based leading to the same problems SIPr has, but in this case it's trivial to make it parallel / threaded. It has its own share of bugs, so not everything works well in the vanilla package, but most of the stuff is patchable. The developers seem to be busy with other projects, so it's not updated for a long time. If you need transfers, presence, dialog-info, custom messages, RTP handling, etc. - you'll have to write your own modifications to support them. It is not good for performance testing.
If you're a Java-hater like me, it can be used in a simple way from Jython, JRuby or any other JVM language.
In the end, I chose SIPunit as the least broken/evil/unusable solution. It is by no means perfect, but... it works in most cases. If I was doing the project once again with all this knowledge, I'd probably reuse SIPp configurations and try to write my own, sane solution that uses proper threading - but that's at least a ½ year project for one person, to make it good enough for production.

Check out SIPp at SourceForge. It has many different scenarios for testing which the UAS mode (server) would probably be interesting for you and seems to allow INVITE, BYE, etc.

Try SIPInspector. It is a JAVA based utility to re-create different SIP signaling scenarios. It can play RTP and stress test your system too. Since written in JAVA it is highly portable and works on different oeprating systems. Way easier to user than SIPp.

What do you want to test apart from if the call gets through? Can't you simply call device B from device A and see if you can talk through the connection? If you want to have a look at the packets being sent you should look into wireshark.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex