I was wondering if OCaml will perform well in terms of performance and ease of implementation while dealing with typical client/server interactions over TCP in a multi threaded environment.. I mean something really typical like having a thread per client that receives data, operated changes on game states and send them back to clients.
This because I need to write a server for a game and I always did these things in C but since now I know OCaml I was curious to know if it would be ok or I'll just find myself trying to solve a typical problem in a language that doesn't fit well that.

Performance: probably not. OCaml's threads do not provide parallel execution, they are only a way to structure your program. The OCaml runtime itself is not thread-safe, so the only code that could possibly execute in parallel of a single OCaml thread would be interfaced C code (without callbacks to OCaml!).
Implementation-wise, there is a mutex on the run-time, which is released when calling blocking C primitives, and could also be released when calling C functions that do significant work.
Ease of implementation: it wouldn't be world-changing. You would have the comfort of OCaml and a pthread-like library on the side. If you are looking for new things to discover while leveraging what you have learnt of OCaml, I recommend Jocaml. It goes in and out of sync with OCaml, but there was a (re-)re-implementation quite recently, and even when it is slightly out of sync, it is a lot of fun, and a completely new perspective of concurrent programs.
Jocaml is implemented on top of OCaml. What with the run-time not being concurrent and all, I am almost sure it uses separate processes and message-passing. But for the application that you mentioned it should be able to do fine.

OCaml is quite suitable for writing network servers, although as Pascal observes, there are limitations on threading.
Fortunately, however, threading isn't the only way to organize such a program. The Lwt library (for Light Weight Threads) provides an abstraction of asynchronous I/O that is quite easy to use (particularly when combined with a bit of syntax support). Everything actually runs in one thread, but it's all driven by an asynchronous I/O loop (built on the Unix select call), and the programming style lets you write code that looks like direct code (avoiding much of the normal code overhead of doing asynchronous I/O in many other languages). For example:
lwt my_message = read_message socket in
let repsonse = compute_response my_message in
send_response socket response
Both the read and the write happen back in the main event loop, but you avoid the normal "read, calling this function when you're done" manual overhead.

I'm so sorry this question has been sitting here for eight years with what I consider to be several quite bad answers because they all ignore the elephant in the room.
You say "really typical like having a thread per client" but having an OS thread per client is an extremely bad design. Threads are heavyweight, taking a long time to create and destroy and consuming ~1MB just for the thread stack. If you have one thread per connection then 1,000 simultaneous client connections (which is entirely feasible) will burn 1GB of RAM just for their stacks and the performance of your program (in any language) will be cripppled by the amount of context switching required to get any work done. You don't want to use that design in any language including both C and OCaml. Note that this problem is especially bad in the context of tracing garbage collected languages because the GC also traverses all of those thread stack in order to collate global roots before every GC cycle. I am the first to admit that this anti-pattern is ubiquitous in the real world but please don't copy it! I have seen "low latency" servers in the finance industry written in C++ using one thread per connection and they suffered latency stalls of up to six seconds just from the (Windows) OS servicing those threads.
Let's consider an efficient design instead, like an epoll or kqueue interface to the OS kernel giving the server's code information about incoming and outgoing data buffers. Single threaded servers can attain excellent performance with this design. However, a typical server has serialization work to do per client and some core work that is often performed in serial across all client connections. Therefore, serialization and deserialization can be parallelized but the core server operation cannot. In this context, OCaml is great for everything except the serialization layer because it has poor support for parallelism.
I have personally implemented many servers for various industries with hugely varying performance requirements. In my experience, OCaml is one of the best tools for this because it offers excellent libraries (easy to use and reliable) and excellent serial performance. The only issue I have is around parallelizing the serialization layer but, in practice, I have found that OCaml runs circles around alternatives like Java and .NET even though they can parallelize this. I found typical latencies were ~100us for .NET and 10us for OCaml.
OCaml will work great for networking applications as long as you can live with a relatively small number of threads active at one time—say no more than 100. You could consider MLdonkey as an example, although in the client space, not in the server space.

Haskell would be a better choice if you want to use many preemptive threads. GHC can support huge numbers of threads and they run in parallel on multicore systems. OCaml prefers cooperative multithreading and multiple processes.


MPI_Isend /Irecv: Is it possible to access the sendbuffer on unused memory-locations in the meanwhile

I would like to speedup my MPI- Program with the use of asynchronous communication. But the used time remains the same. The workflow is as followed.
1. MPI_send/ MPI_recv Halo (ca. 10 Seconds)
2. process the whole Array (ca. 12 Seconds)
1. MPI_Isend/ MPI_Irecv Halo (ca. 0,1 Seconds)
2. process the Array (without Halo) (ca. 10 Seconds)
3. MPI_Wait (ca. 10 Seconds) (should be ca. 0 Seconds)
4. process the Halo only (ca. 2 Seconds)
Measurements showed that the communication and processing the Array-core nearly take the same time for common workloads. So asynchronism should nearly hide the communication time.
But it dosn't.
One fact - and I thinks this could be the problem - is that the sendbuffer is also the array the calculations are made on. Is it possible that MPI serializes the memory-access although communication ONLY accesses the Halo (with derived datatype) and the computation ONLY accesses the core (only reading) of the array???
Does anybody know if this is for sure the reason?
Is it maybe implementation-dependend (I'm using OpenMPI)?
Thanks in advance.
It isn't the case that MPI serializes the memory accesses in the user code (that's beyond the library's power to do, in general), and it is true that what exactly does happen is implementation specific.
But as a practical matter, MPI libraries don't do as much communication "in the background" as you might hope, and this is particularly true when using transports and networks like tcp + ethernet, where there's no meaningful way to hand off communication to another set of hardware.
You can only be sure that the MPI library is actually doing something when you're running MPI library code, eg in an MPI function call. Often, a call to any of a number of MPI calls will nudge an implementations "progress engine" that keeps track of in-flight messages and ushers them along. So for instance one thing you can quickly do is to make calls to MPI_Test() on the requests within the compute loop to make sure things start happening well before the MPI_Wait(). There is of course overhead to this, but this is something that's easy to try to measure.
Of course you could imagine the MPI library would use some other mechanism to run things behind the scenes. Both MPICH2 and OpenMPI have played with separate "progress threads" which execute separately from the user code and do this ushering along in the background; but getting that to work well, and without tying up a processor while you're trying to run your computation, is a genuinely difficult problem. OpenMPI's progress threads implementation has long been experimental, and in fact is temporarily out of the current (1.6.x) release, although work continues. I'm not sure about MPICH2's support.
If you are using infiniband, where the network hardware has a lot of intelligence to it, then prospects brighten a bit. If you are willing to leave memory pinned (for the openfabrics), and/or you can use a vendor-specific module (mxm for Mellanox, psm for Qlogic), then things can progress somewhat more rapidly. If you're using shared memory, than the knem kernel module can also help with intranode transport.
One other implementation-specific approach you can take, if memory isn't a big issue, is to try to use eager protocols for sending the data directly, or send more data per chunk so fewer nudges of the progress engine are needed. What eager protocols means here is that data is automatically sent at send time, rather than just initiating a set of handshakes which will eventually lead to the message being sent. The bad news is that this generally requires extra buffer memory for the library, but if that's not a problem and you know the number of incoming messages is bounded (eg, by the number of halo neighbours you have), this can help a great deal. How to do this for (eg) shared memory transport for openmpi is described on the OpenMPI page for tuning for shared memory, but similar parameters exist for other transports and often for other implementations. One nice tool that IntelMPI has is an "mpitune" tool that automatically runs through a number of such parameters for best performance.
The MPI specification states:
A nonblocking send call indicates that the system may start copying
data out of the send buffer. The sender should not modify any part of the
send buffer after a nonblocking send operation is called, until the
send completes.
So yes, you should copy your data to a dedicated send buffer first.

What so different about Node.js's event-driven? Can't we do that in ASP.Net's HttpAsyncHandler?

I'm not very experienced in web programming,
and I haven't actually coded anything in Node.js yet, just curious about the event-driven approach. It does seems good.
The article explains some bad things that could happen when we use a thread-based approach to handle requests, and should opt for a event-driven approach instead.
In thread-based, the cashier/thread is stuck with us until our food/resource is ready. While in event-driven, the cashier send us somewhere out of the request queue so we don't block other requests while waiting for our food.
To scale the blocking thread-based, you need to increase the number of threads.
To me this seems like a bad excuse for not using threads/threadpools properly.
Couldn't that be properly handled using IHttpAsyncHandler?
ASP.Net receives a request, uses the ThreadPool and runs the handler (BeginProcessRequest), and then inside it we load the file/database with a callback. That Thread should then be free to handle other requests. Once the file-reading is done, the ThreadPool is called into action again and executes the remaining response.
Not so different for me, so why is that not as scalable?
One of the disadvantages of the thread-based that I do know is, using threads needs more memory. But only with these, you can enjoy the benefits of multiple cores. I doubt Node.js is not using any threads/cores at all.
So, based on just the event-driven vs thread-based (don't bring the "because it's Javascript and every browser..." argument), can someone point me out what is the actual benefit of using Node.js instead of the existing technology?
That was a long question. Thanks :)
First of all, Node.js is not multi-threaded. This is important. You have to be a very talented programmer to design programs that work perfectly in a threaded environment. Threads are just hard.
You have to be a god to maintain a threaded project where it wasn't designed properly. There are just so many problems that can be hard to avoid in very large projects.
Secondly, the whole platform was designed to be run asynchronously. Have you see any ASP.NET project where every single IO interaction was asynchronous? simply put, ASP.NET was not designed to be event-driven.
Then, there's the memory footprint due to the fact that we have one thread per open-connection and the whole scaling issue. Correct me if I'm wrong but I don't know how you would avoid creating a new thread for each connection in ASP.NET.
Another issue is that a Node.js request is idle when it's not being used or when it's waiting for IO. On the other hand, a C# thread sleeps. Now, there is a limit to the number of these threads that can sleep. In Node.js, you can easily handle 10k clients at the same time in parallel on one development machine. You try handling 10k threads in parallel on one development machine.
JavaScript itself as a language makes asynchronous coding easier. If you're still in C# 2.0, then the asynchronous syntax is a real pain. A lot of developers will simply get confused if you're defining Action<> and Function<> all over the place and using callbacks. An ASP.NET project written in an evented way is just not maintainable by an average ASP.NET developer.
As for threads and cores. Node.js is single-threaded and scales by creating multiple-node processes. If you have a 16 core then you run 16 instances of your node.js server and have a single Node.js load balancer in front of it. (Maybe a nginx load balancer if you want).
This was all written into the platform at a very low-level right from the beginning. This was not some functionality bolted on later down the line.
Other advantages
Node.js has a lot more to it then above. Above is only why Node.js' way of handling the event loop is better than doing it with asynchronous capabilities in ASP.NET.
Performance. It's fast. Real fast.
One big advantage of Node.js is its low-level API. You have a lot of control.
You have the entire HTTP server integrated directly into your code then outsourced to IIS.
You have the entire nginx vs Apache comparison.
The entire C10K challenge is handled well by node but not by IIS
AJAX and JSON communication feels natural and easy.
Real-time communication is one of the great things about Node.js. It was made for it.
Plays nicely with document-based nosql databases.
Can run a TCP server as well. Can do file-writing access, can run any unix console command on the server.
You query your database in javascript using, for example, CouchDB and map/reduce. You write your client in JavaScript. There are no context switches whilst developing on your web stack.
Rich set of community-driven open-source modules. Everything in node.js is open source.
Small footprint and almost no dependencies. You can build the node.js source yourself.
Disadvantages of Node.js
It's hard. It's young. As a skilled JavaScript developer, I face difficulty writing a website with Node.js just because of its low-level nature and the level of control I have. It feels just like C. A lot of flexibility and power either to be used for me or to hang me.
The API is not frozen. It's changing rapidly. I can imagine having to rewrite a large website completely in 5 years because of the amount Node.js will be changed by then. It is do-able, you just have to be aware that maintenance on node.js websites is not cheap.
further reading
There are a lot of misconceptions regarding node.js vs. ASP.Net and asynchronous programming. You can do non blocking IO in ASP.NET. Most people don't know that the .Net framework uses Windows iocompletion ports underneath when you do web service calls or other I/O bound operations using the begin/end pattern in .Net 2.0 and above. IO completion ports is the way the Windows operating system supports non-blocking IO so that the app thread is freed why the IO operation completes. Interestingly, node.js uses a less optimal non blocking IO implementation in Windows through Cygwin. A new Windows version is on the road map, which with Microsoft's guidance will be using IO completions ports. At that point there is underneath no difference.
It is also possible to do non-blocking database calls in ADO.NET but be aware of ORM tools such as NHibernate and Entity Framework. They are still very much synchronous.
Synchronous IO (blocking) makes the control flow much clearer and it has for this reason become popular. The reason why computer environments are multithreaded has only superficially to do with this. It is more generally related to time sharing and utilization of multiple CPUs.
Having only a single thread can cause starvation during lengthy operations, which can be related to both IO and complex computations. So, even though the rule of thumb is one thread pr. core when utilizing non-blocking IO, one should still consider a sufficient thread pool size so that simple requests don't get starved by more complex operations if such exist. Multiple threads also allows complex operations to be split easily among multiple CPUs. A single threaded environment like node.js can only utilize multicore processors through more processes and message passing to coordinate action.
I have personally not yet seen any compelling argument to introduce an additional technology such a node.js. However, there may be good reasons but they have in my opinion little to do with servicing a large number of connections through non-blocking IO since this can also be done with ASP.NET.
BTW tamejs can help make your nodejs code more readable similar to the new upcoming .Net Async CTP.
It is easy to understate the cultural difference between the Node.js and ASP.NET communities. Sure, IHttpAsyncHandler exists and it's been around since .NET 1.0 so it might even be good, but all of the code and discussion around Node.js is about async I/O which is decidedly not the case when it comes to .NET. Want to use LINQ To SQL? You kind of can, kind of. Want to log stuff? Maybe "CSharp DotNet Logger" will work, maybe.
So yes, IHttpAsyncHandler is there and if you're really careful you might be able to write an event driven web-service without tripping over some blocking I/O somewhere, but I don't really get the impression a lot of people are using it (and it certainly isn't the prominent way for writing ASP.NET apps). In contrast, Node.js is all about evented I/O, all the code examples, all the libraries and it's the only way people are using it. So if you were going to bet on which one's evented I/O model actually worked all the way through, Node.js would probably be the one to pick.
As per current age technology improvements and reading below links, I can say, it is matter of expertise and choosing perfect mix as per the particular scenario that matters. NodeJS is getting mature and ASP.NET side we have ASP.NET MVC, WebAPI, and SignalR etc. to make things better.
Node.js vs .Net performance

How does a (full featured) long polling server work abstractly

Since you're using an event loop as opposed to threads, how does the actual server look?
I know it uses an event loop, but how do you separate out the requests? And how do you prevent your server from running extremely slowly (since it, I assume, can only push one thing at a time since it's threadless?)
Some sort of pseudo-code would be great.
Forgive my ignorance; of course, if there's somewhere that explains it in a non-basic "this is good enough until you get 1000 visitors way", I'd be glad to know of it.
The implementation details of a long poll server would vary so much from platform to platform that your assumptions might not be correct.
I implemented a COMET server for our website using .NET. I leveraged HttpListener to do all the boring http stuff and Microsoft CCR to deal with all the async IO. It uses a pool of threads to service requests as and when they come in. It's not a thread per client, but it's not single threaded either generally requiring a few tens of threads to stay fluid as user numbers rise. This approach means that we scale easily across multiple CPU cores. CCRs async enumerator pattern really helped keep the asynchronous logic nice and tidy, and I can read the code fairly easily a year later.
This approach has proved extremely scalable. I've tested up to 20000 clients, whereupon we became bound by network IO. It handles all our clients (who are "permanently" connected, reconnecting every 30s) ticking along at 1-2% server load. It's definitely worth reconsidering your assumption that you must either choose an event loop architecture as opposed to multiple threads. The middle ground works very nicely for me, and the .NET asynchronous programming model for dealing with IO bound tasks really takes you away from needing to micro-manage threads. Effectively, when there's IO data to process, a thread is borrowed from the pool to do that processing, and subsequently returned to the pool ready to service another request. All the complicated IOCP stuff is abstracted away.

Boost asio async vs blocking reads, udp speed/quality

I have a quick and dirty proof of concept app that I wrote in C# that reads high data rate multicast UDP packets from the network. For various reasons the full implementation will be written in C++ and I am considering using boost asio. The C# version used a thread to receive the data using blocking reads. I had some problems with dropped packets if the computer was heavily loaded (generally with processing those packets in another thread).
What I would like to know is if the async read operations in boost (which use overlapped io in windows) will help ensure that I receive the packets and/or reduce the cpu time needed to receive the packets. The single thread doing blocking reads is pretty straightforward, using the async reads seems like a step up in complexity, but I think it would be worth it if it provided higher performance or dropped fewer packets on a heavily loaded system. Currently the data rate should be no higher than 60Mb/s.
I've written some multicast handling code using boost::asio also. I would say that overall, in my experience there is a lot of added complexity to doing things in asio that may not make it easy for other people you work with to understand the code you end up writing.
That said, presumably the argument in favour of moving to asio instead of using lots of different threads to do the work is that you would have to do less context switching. This would clearly be true on a single-core box, but what about when you go multi-core? Are you planning to offload the work you receive to threads or just have a single thread doing the processing work? If you go for a single threaded approach you are going to end up in a situation where you could drop packets waiting for that thread to process the work.
In the end it's swings and roundabouts. I'd say you want to get some fairly solid figures backing up your arguments for going down this route if you are going to do so, just because of all the added complexity it entails (a whole new paradigm for some people I'm sure).

How is MPI I/O Implemented?

Long-Winded Background
I'm working on parallelising some code for cardiac electrophysiology simulations. Since users can specify their own simulations using an in-built scripting language, I have no way of knowing how to manage the trade-off of communication vs. computation. To combat this, I'm making a sort-of runtime profiler, which will decide how to handle the domain decomposition once it's seen the simulation to be run and the hardware environment that it has to work with.
My question is this:
How is MPI I/O implemented behind the scenes? Is each process actually writing to a single file on some other node, or is each process writing to some sparse file, which will get spliced back together when the file is closed?
Knowing this will help me decide whether to consider I/O operations as communication or computation, and adjust the balance accordingly…
Thanks in advance for any insight you can offer.
The mechanism for I/O is implementation dependent. In addition, there is not a single style of I/O. Some I/O is cached by the remote ranks and collected by the mpirun process at the end of the run. Some I/O is written to local scratch space as required. Some I/O is written to a NAS/SAN style high performance shared file system.
Some MPI's use 3rd party libraries to support I/O to parallel file systems, and those details may be proprietary. Some file systems are local discs, others are SAN over fiber or InfinBand.
How are you planning to actually measure the time spent in I/O? Are you planning to use the pMPI interface to intercept all the calls into the library?
