Everything I've read and experienced ( Tornado based apps ) leads me to believe that ePoll is a natural replacement for Select and Poll based networking, especially with Twisted. Which makes me paranoid, its pretty rare for a better technique or methodology not to come with a price.
Reading a couple dozen comparisons between epoll and alternatives shows that epoll is clearly the champion for speed and scalability, specifically that it scales in a linear fashion which is fantastic. That said, what about processor and memory utilization, is epoll still the champ?

For very small numbers of sockets (varies depending on your hardware, of course, but we're talking about something on the order of 10 or fewer), select can beat epoll in memory usage and runtime speed. Of course, for such small numbers of sockets, both mechanisms are so fast that you don't really care about this difference in the vast majority of cases.
One clarification, though. Both select and epoll scale linearly. A big difference, though, is that the userspace-facing APIs have complexities that are based on different things. The cost of a select call goes roughly with the value of the highest numbered file descriptor you pass it. If you select on a single fd, 100, then that's roughly twice as expensive as selecting on a single fd, 50. Adding more fds below the highest isn't quite free, so it's a little more complicated than this in practice, but this is a good first approximation for most implementations.
The cost of epoll is closer to the number of file descriptors that actually have events on them. If you're monitoring 200 file descriptors, but only 100 of them have events on them, then you're (very roughly) only paying for those 100 active file descriptors. This is where epoll tends to offer one of its major advantages over select. If you have a thousand clients that are mostly idle, then when you use select you're still paying for all one thousand of them. However, with epoll, it's like you've only got a few - you're only paying for the ones that are active at any given time.
All this means that epoll will lead to less CPU usage for most workloads. As far as memory usage goes, it's a bit of a toss up. select does manage to represent all the necessary information in a highly compact way (one bit per file descriptor). And the FD_SETSIZE (typically 1024) limitation on how many file descriptors you can use with select means that you'll never spend more than 128 bytes for each of the three fd sets you can use with select (read, write, exception). Compared to those 384 bytes max, epoll is sort of a pig. Each file descriptor is represented by a multi-byte structure. However, in absolute terms, it's still not going to use much memory. You can represent a huge number of file descriptors in a few dozen kilobytes (roughly 20k per 1000 file descriptors, I think). And you can also throw in the fact that you have to spend all 384 of those bytes with select if you only want to monitor one file descriptor but its value happens to be 1024, wheras with epoll you'd only spend 20 bytes. Still, all these numbers are pretty small, so it doesn't make much difference.
And there's also that other benefit of epoll, which perhaps you're already aware of, that it is not limited to FD_SETSIZE file descriptors. You can use it to monitor as many file descriptors as you have. And if you only have one file descriptor, but its value is greater than FD_SETSIZE, epoll works with that too, but select does not.
Randomly, I've also recently discovered one slight drawback to epoll as compared to select or poll. While none of these three APIs supports normal files (i.e., files on a file system), select and poll present this lack of support as reporting such descriptors as always readable and always writable. This makes them unsuitable for any meaningful kind of non-blocking filesystem I/O, a program which uses select or poll and happens to encounter a file descriptor from the filesystem will at least continue to operate (or if it fails, it won't be because of select or poll), albeit it perhaps not with the best performance.
On the other hand, epoll will fail fast with an error (EPERM, apparently) when asked to monitor such a file descriptor. Strictly speaking, this is hardly incorrect. It's merely signalling its lack of support in an explicit way. Normally I would applaud explicit failure conditions, but this one is undocumented (as far as I can tell) and results in a completely broken application, rather than one which merely operates with potentially degraded performance.
In practice, the only place I've seen this come up is when interacting with stdio. A user might redirect stdin or stdout from/to a normal file. Whereas previously stdin and stdout would have been a pipe -- supported by epoll just fine -- it then becomes a normal file and epoll fails loudly, breaking the application.

In tests at my company, one issue with epoll() came up, thus a single cost compared to select.
When attempting to read from the network with a timeout, creating an epoll_fd ( instead of a FD_SET ), and adding the fd to the epoll_fd, is much more expensive than creating a FD_SET (which is a simple malloc).
As per the previous answer, as the number of FDs in the process becomes large, the cost of select() becomes higher, but in our testing, even with fd values in the 10,000's, select was still a winner. These are cases where there is only one fd that a thread is waiting on, and simply trying to overcome the fact that network read, and network write, doesn't timeout when using a blocking thread model. Of course, blocking thread models are low performance compared to non-blocking reactor systems, but there are occasions where, to integrate with a particular legacy code base, it is required.
This kind of use case is rare in high performance applications, because a reactor model doesn't need to create a new epoll_fd every time. For the model where an epoll_fd is long-lived --- which is clearly preferred for any high performance server design --- epoll is the clear winner in every way.


Is it a good practice to write a GB of byes into a TCP socket in one go?

I am maintaining some matured production code which sends data over TCP sockets. It always breaks large chunk of data into many packets, each 1000 bytes. I just wonder why it was done this way. Why can't I just write a GB worth of a byte array into the socket in one go? What are the cons to do that?
There are many reasons not to throw a huge chunk in at once.
First of all: Even on very fast networks sending a GB of data will take a non-trivial amount of time. On a 10Gbps network it would take a little under 1 second, which is a long time in computer speak. And that assumes that this one operation has all the bandwidth of the network available to it and doesn't have to share with anything else.
This means that if you successfully do a 1GB write call to a TCP socket, it will be some time until the later bits of data are actually sent.
And in that mean time you'll have to hold all that data in memory. That means that you'll need to allocate and hold on to 1GB of data for that whole transaction.
If instead you fill a small-ish buffer and read from your source (or generate, depending on where the data comes from) before each write, then you'll need only a little memory (the size of the buffer).
All of that might not sound like a big deal with todays machines, but consider that many servers will serve hundreds of clients/requests at once and if each one requires a 1GB buffer, then that can grow out of hand quickly.
Is 1000 a good size for that buffer? I'm no networking expert, but I suspect that's a little low. Maybe something on the order of 64k would be appropriate, but others can give better details here. Finding a good buffer size can sometimes be a bit tricky.
Depending on the underlying implementation, attempting to send 1GB in one bulk may result in the 1GB being copied into some cache, and then held there for some time. So it can be a problem if not enough memory is available (and even if there is enough available memory - it may not be the most efficient way to make use of it).
Though "manually" splitting it into segments of 1000 bytes each does sound a bit like an overkill to me.

How to write with a single node in MPI

I want to implement some file io with the routines provided by MPI (in particular Open MPI).
Due to possible limitations of the environment, I wondered, if it is possible to limit the nodes, which are responsible for IO, so that all other nodes are required to perform a hidden mpi_send to this group of processes, to actually write the data. This would be nice in cases, where e.g. the master node is placed on a node with high-performance filesystem and the other nodes have only access to a low-performance filesystem, where the binaries are stored.
Actually, I already found some information, which might be helpful, but I couldn't find further information, how to actually implement these things:
1: There is an info key MPI_IO belonging to the communicator, which tells which ranks provide standard-conforming IO-routines. As this is listed as an environmental inquiry, I don't see, where I could modify this.
2: There is an info key io_nodes_list which seems to belong to file-related info-objects. Unfortunately, the possible values for this key are not documented and Open MPI doesn't seem to implement them in any way. Actually, I can't even get the filename from the info-object which is returned by mpi_file_get_info...
As a workaround, I could imagine two things: On the one hand, I could perform the IO with standard Fortran routines, or on the other hand, create a new communicator, which is responsible for IO. But in both cases, the processes, which are responsible for IO have to check for possible IO from the other processes to perform manual communication and file interaction.
Is there a nice and automatic way to restrict the IO to certain nodes? If yes, how could I implement this?
You explicitly asked about OpenMPI, but there are two MPI-IO implementations in OpenMPI. The old workhorse is ROMIO, the MPI-IO implementation shared among just about every MPI implementation. OpenMPI also has OMPIO, but I don't know a whole lot about tuning that one.
Next, if you want things to happen automatically for you, you'll have to use collective i/o. The independent I/O routines cannot send a message to anyone else -- they are independent and there's no way to know if the other side will be listening.
With those preliminaries out of the way...
You are asking about "i/o aggregaton". There is a bit of information here in the context of another optimization called "deferred open" (and which OMPIO calls Lazy Open)
In short, you can definitely say "only these N processes should do I/O", and then the collective I/O library will exchange data and make sure that happens. The optimization was developed some 15-odd years ago for just the situation you proposed: some nodes being better connected to storage than others (as was the case on the old ASCI Red machine, to give you a sense for how old this optimization is...)
I don't know where you got io_nodes_list. You probably want to use the MPI-IO info keys cb_config_list and cb_nodes
So, you've got a cluster with master1, master2, master3, and compute1, compute2, compute3 (or whatever the hostnames actually are). You can do something like this (in c, sorry. I'm not proficient in Fortran):
MPI_Info info;
MPI_File fh;
MPI_Info_set(info, "cb_config_list", "master1:1,master2:1,master3:1");
With these hints, MPI_File_write_all will aggregate all the I/O through the MPI processes on master1, master2, and master3. ROMIO won't blow up your memory because it will chunk up the I/O into a smaller working set (specified with the "cb_buffer_size" hint: cranking this up, if you have the memory, is a good way to get better performance).
There is a ton of information about the hints you can set in the ROMIO users guide:

Optimization of parallel programming

I want to use MPI to make my program parallel, and I want to send something to other computers. I want to know which one is better: sending a huge buffer one time or sending two smaller messages 3 times atrent times during the execution instead of all at once?
It's almost always going to be faster to send the one big message than the smaller one. Each time you do a Send/Receive pair, the two processes have to go through the entire process of sending a message to each other, including at least 6 roundtrip messages. If you are just sending one larger message, there is a minimum of 2 roundtrip messages. Each of those messages can be very expensive (compared to doing things locally like packing all of your data into one buffer).
I'd encourage you to try it out both ways though to be sure that this applies to your application. It could be different if you're doing something unexpected.
Depending on your problem, sending all data may be more efficient, because the nodes have to be synced, every time. That may cause a delay.
I always try to send as much data as I can in a single MPI call. In my experience, sending many small bits of data greatly increases the overhead and network traffic, and I have even run into problems where I overwhelmed the computers' ability to keep up with the number of requests, because I was sending a large member of a complicated class, one integer at a time, to many workers. Therefore, when possible, send the entire data at once, unless you have some reason to believe it is too large.
Further, I strive to use 100% of all the CPU's my program claims. When you are working on shared resources, if you use a CPU, you need to actually use it. Otherwise, someone else who wants to use that core, or node, is blocked out while your program sits and does nothing. For example, on a Cray which I have used, even if you call for only two 'cores', the manager will reserve a full bank of 24 cores, essentially wasting 22. Or, perhaps one worker has nothing to do, while another chugs away -- again, wasting time. Hopefully, there is a way to balance the load, so to speak, to avoid unintentional waste of resources.
Back to the topic at hand. Demonstrate timing and efficiency of vector sending to yourself -- write a program which breaks up the vector into varying sizes of packets, and do the sends/receives. Test it with varying numbers of workers, and on several different configurations of computers, if you can. Before writing production code, do proof of concept, and what optimization you can. Test and time it!

Asynchronous programs showing locality of reference?

I was reading this excellent article which gives an introduction to Asynchronous programming here and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.

Moving data from memory to memory in micro controllers

Why can't we move data directly from a memory location into another memory location.
Pardon me if I am asking a dumb question, but I think this is a true situation, at least for the ones I've encountered (8085,8086 n 80386)
I am not really looking for a solution for moving the data (like for eg, using movs n all), but actually the reason for this anomaly.
What about MOVS? It moves a 8/16/32-bit value addressed by esi to the location addressed by edi.
The basic reason is that most instruction sets allow one register operand, and one memory operand, and sticking to this format makes designing the instruction decoder easier. It also makes the execution engine inside the CPU easier, because the instruction can issue typically a memory operation to just one memory location, and at most one register block read or write.
To do a memory-to-memory instruction directly requires two memory locations to be designated. This is awkward given a register/memory instruction format. Given the performance of the machines, there is little justification for modifying the instruction format just for this.
A hack used by more modern CPUs is to provide some type of block-move instruction, in which the source and destination locations are located in registers (for the X86 this is ESI and EDI respectively). Then an instruction can just designate two registers (or in the case of the x86, instructions that simply know which registers). That solves the instruction decoding problem.
The instruction execution problem is a little harder but people have lots of transistors. Organizing a read indirect from one register, and write indirect through another, and increment both is awkward in silicon but that just chews up some transistors.
Now you can have an instruction that moves from memory to memory, just as you asked.
One of the other posters noted for the X86 there are instrucitons (MOVB, MOVW, MOVS, ...) that do exactly this, one memory byte/word/... at a time.
Moving a block of memory would be ideal because the CPU can generate high-bandwith reads and writes. The x86 does this with with a REP (repeat) prefix on MOV- to move a larger block.
But if a single insturction can do this, you have the problem that it might take a long time to execute (how long to move 1Gb? --> millions of clock cycles!) and that ruins the interrupt response rate of the CPU.
The x86 solves this by allowing REP MOV- to be interrupted, with the PC being set back to the beginning of the instruction. By updating the registers during the move appropriately, you can interrupt and restart the REP MOV- instruction having both a fast block move and high interrupt response rates. More transistors down the tube.
The RISC guys figured out that all this complexity for a block move instruction was mostly not worth it. You can code a dumb loop (even the x86):
copy: MOV EAX,[ESI]
JNE copy
which does the same basic thing as REP MOV- . Pretty much the modern CPUs (x86, others) execute this so fast (superscalar, etc.) that the bus is just as utilized as the custom move instruction, but now you don't need all those wasted transistors (or corresponding heat).
Most CPU varieties don't allow memory-to-memory moves. Normally the CPU can access only one memory location at at time, which means you need a temporary spot to store the value when moving it (a general purpose register, usually). If you think about it, moving directly from one memory location to another would require that the CPU be able to access two different spots in RAM simultaneously - that means two full memory controllers at least, and even then, the chances they'd "play nice" enough to access the same RAM would be pretty bad. The chip designers might have been able to pull some tricks to allow direct copies from one RAM chip to another, but that would be a pretty special-application kind of feature that would just add cost and complexity to solve a very uncommon problem.
You might be able to use some special DMA hardware to make it look to your program like memory is being moved without that temporary storage, at least from the perspective of your CPU.
You have one set of address lines, one set of data lines, and a few control lines between the CPU and RAM. You can't physically move directly from memory to memory without a second set of address lines and a whole bunch of complicated logic inside the RAM. Therefore, we have to store it temporarily in a register.
You could make an instruction that does the load and store together and looks like one instruction to the programmer, but there are other considerations like instruction size, non-duplication of effective address calculation logic, pipelining, etc. that make it desirable to keep it more simple.
Memory-memory machines turn out to be slower in general than load-store machines. This was deduced/figured out/invented by the RISC researchers in 1980ish or so. So the older architectures (VAX/OS360) tend to have memory-memory architectures; newer machines do load-store.
Another interesting variant is stack machines; they seem to always be around as a minority.
