Optimization of parallel programming - mpi

I want to use MPI to make my program parallel, and I want to send something to other computers. I want to know which one is better: sending a huge buffer one time or sending two smaller messages 3 times atrent times during the execution instead of all at once?

It's almost always going to be faster to send the one big message than the smaller one. Each time you do a Send/Receive pair, the two processes have to go through the entire process of sending a message to each other, including at least 6 roundtrip messages. If you are just sending one larger message, there is a minimum of 2 roundtrip messages. Each of those messages can be very expensive (compared to doing things locally like packing all of your data into one buffer).
I'd encourage you to try it out both ways though to be sure that this applies to your application. It could be different if you're doing something unexpected.

Depending on your problem, sending all data may be more efficient, because the nodes have to be synced, every time. That may cause a delay.

I always try to send as much data as I can in a single MPI call. In my experience, sending many small bits of data greatly increases the overhead and network traffic, and I have even run into problems where I overwhelmed the computers' ability to keep up with the number of requests, because I was sending a large member of a complicated class, one integer at a time, to many workers. Therefore, when possible, send the entire data at once, unless you have some reason to believe it is too large.
Further, I strive to use 100% of all the CPU's my program claims. When you are working on shared resources, if you use a CPU, you need to actually use it. Otherwise, someone else who wants to use that core, or node, is blocked out while your program sits and does nothing. For example, on a Cray which I have used, even if you call for only two 'cores', the manager will reserve a full bank of 24 cores, essentially wasting 22. Or, perhaps one worker has nothing to do, while another chugs away -- again, wasting time. Hopefully, there is a way to balance the load, so to speak, to avoid unintentional waste of resources.
Back to the topic at hand. Demonstrate timing and efficiency of vector sending to yourself -- write a program which breaks up the vector into varying sizes of packets, and do the sends/receives. Test it with varying numbers of workers, and on several different configurations of computers, if you can. Before writing production code, do proof of concept, and what optimization you can. Test and time it!

Related

Am I misunderstanding how often IORING POLL is used?

I am watching a video that says after a million operations its faster to use io_uring_poll. I haven't used io_uring much so I may be remembering things wrong but io_uring poll means use io_uring_enter with the IORING_SETUP_SQPOLL flag? I understand sometimes traffic drops but wouldn't most of the time the code be using atomics to check if there's more items in the completion queue? Wouldnt enter/poll be needed the rare second when traffic drops? Or is it called constantly? I never had to do that many operations so I don't know

Why wouldn't a small Firebase Functions app just use a single Function to handle logic?

...aside from the benefit in separate performance monitoring and logging.
For logging, I am confident I can get granularity through manually adding the name of the "routine" to each call. This is how it is now with several discrete Functions for different parts of the system:
There are multiple automatic logs: start and finish of the routine, for example. It would be more challenging to find out how expensive certain routines are, but it would not be impossible.
The reason I want the entire logic of the application handled by a single handle function is because of reducing cold starts: one function means only one container that can be persistently kept alive when there are very few users of the app.
If a month is ~2.6m seconds and we assume the system uses 1 GB RAM and 1 GHz CPU frequency at all times, that's:
2600000 * 0.0000025 + 2600000 * 0.000001042 = USD$9.21 a month
...for one minimum instance.
I should also state that all of my functions have the bare minimum amount of global scope code; it just sets up Firebase assets (RTDB and Firestore).
From a billing, performance (based on user wait time), and user/developer experience perspective, is there any reason why it would be smart to keep all my functions discrete?
I'd also accept an answer saying "one single function for all logic is reasonable" as long as there's a reason for it.
Thanks!
If you have very small app with ~5 end points and very low traffic. Sure you could do something like this. But why not do it:
billing and performance
The important thing to realize is that with every request a new instance of your function is created. Which means there could be 10s of them running at the same time.
If you would like to have just 1 instance handling all the traffic you should explore GCP Cloud run, where you have 1 container handling multiple requests and scaling only when it's not sufficient.
Imagine you have several end-points and every one of them have different performance requirements.
1 can need only 128MB or RAM
1 can need 1GB RAM
(FYI: You can control the CPU MHz of the function via the RAM settings too - which can speed up execution in some cases)
If you had only 1 function with 1GB of ram. Every request would allocate such function and in some cases most of the memory could go to waste.
But if you split it into multiple, some requests will require much less resources and can save you $ when we talk about bigger amount of executions / month. (tens of thousands+).
Let's imagine function, 3 second execution, 10k executions/month:
128MB would cost you $0.0693
1024MB would cost you $0.495
As you can see, with small app the difference could be nothing. But if you scale it matters. (*The cost can vary based on datacenter)
As for the logging, I don't think it matters. Usually in bigger systems there could be messages traveling trough several functions so you have to deal with that anyway.
As for the cold start. You just need good UI to facilitate that. At first I was worry about it in our apps but later on, you just get used to it that some action can take ~2s to execute (cold start). And you should have the UI "loading" regardless, because you don't know if the function will take ~100ms or 3s due to bad connection.

Is it a good practice to write a GB of byes into a TCP socket in one go?

I am maintaining some matured production code which sends data over TCP sockets. It always breaks large chunk of data into many packets, each 1000 bytes. I just wonder why it was done this way. Why can't I just write a GB worth of a byte array into the socket in one go? What are the cons to do that?
There are many reasons not to throw a huge chunk in at once.
First of all: Even on very fast networks sending a GB of data will take a non-trivial amount of time. On a 10Gbps network it would take a little under 1 second, which is a long time in computer speak. And that assumes that this one operation has all the bandwidth of the network available to it and doesn't have to share with anything else.
This means that if you successfully do a 1GB write call to a TCP socket, it will be some time until the later bits of data are actually sent.
And in that mean time you'll have to hold all that data in memory. That means that you'll need to allocate and hold on to 1GB of data for that whole transaction.
If instead you fill a small-ish buffer and read from your source (or generate, depending on where the data comes from) before each write, then you'll need only a little memory (the size of the buffer).
All of that might not sound like a big deal with todays machines, but consider that many servers will serve hundreds of clients/requests at once and if each one requires a 1GB buffer, then that can grow out of hand quickly.
Is 1000 a good size for that buffer? I'm no networking expert, but I suspect that's a little low. Maybe something on the order of 64k would be appropriate, but others can give better details here. Finding a good buffer size can sometimes be a bit tricky.
Depending on the underlying implementation, attempting to send 1GB in one bulk may result in the 1GB being copied into some cache, and then held there for some time. So it can be a problem if not enough memory is available (and even if there is enough available memory - it may not be the most efficient way to make use of it).
Though "manually" splitting it into segments of 1000 bytes each does sound a bit like an overkill to me.

Asp.net guaranteed response time

Does anybody have any hints as to how to approach writing an ASP.net app that needs to have a guaranteed response time?
When under high load that would normally cause us to exceed our desired response time, we want to throw out an appropriate number of requests, so that the rest of the requests can return before the max response time. Throwing out requests based on exceeding a fixed req/s is not viable, as there are other external factors that will control response time that cause the max rps we can safely support to fiarly drastically drift and fluctuate over time.
Its ok if a few requests take a little too long, but we'd like the great majority of them to meet the required response time window. We want to "throw out" the minimal or near minimal number of requests so that we can process the rest of the requests in the allotted response time.
It should account for ASP.Net queuing time, ideally the network request time but that is less important.
We'd also love to do adaptive work, like make a db call if we have plenty of time, but do some computations if we're shorter on time.
Thanks!
SLAs with a guaranteed response time require a bit of work.
First off you need to spend a lot of time profiling your application. You want to understand exactly how it behaves under various load scenarios: light, medium, heavy, crushing.. When doing this profiling step it is going to be critical that it's done on the exact same hardware / software configuration that production uses. Results from one set of hardware have no bearing on results from an even slightly different set of hardware. This isn't just about the servers either; I'm talking routers, switches, cable lengths, hard drives (make/model), everything. Even BIOS revisions on the machines, RAID controllers and any other device in the loop.
While profiling make sure the types of work loads represent an actual slice of what you are going to see. Obviously there are certain load mixes which will execute faster than others.
I'm not entirely sure what you mean by "throw out an appropriate number of requests". That sounds like you want to drop those requests... which sounds wrong on a number of levels. Doing this usually kills an SLA as being an "outage".
Next, you are going to have to actively monitor your servers for load. If load levels get within a certain percentage of your max then you need to add more hardware to increase capacity.
Another thing, monitoring result times internally is only part of it. You'll need to monitor them from various external locations as well depending on where your clients are.
And that's just about your application. There are other forces at work such as your connection to the Internet. You will need multiple providers with active failover in case one goes down... Or, if possible, go with a solid cloud provider.
Yes, in the last mvcConf one of the speakers compares the performance of various view engines for ASP.NET MVC. I think it was Steven Smith's presentation that did the comparison, but I'm not 100% sure.
You have to keep in mind, however, that ASP.NET will really only play a very minor role in the performance of your app; DB is likely to be your biggest bottle neck.
Hope the video helps.

Caveats of select/poll vs. epoll reactors in Twisted

Everything I've read and experienced ( Tornado based apps ) leads me to believe that ePoll is a natural replacement for Select and Poll based networking, especially with Twisted. Which makes me paranoid, its pretty rare for a better technique or methodology not to come with a price.
Reading a couple dozen comparisons between epoll and alternatives shows that epoll is clearly the champion for speed and scalability, specifically that it scales in a linear fashion which is fantastic. That said, what about processor and memory utilization, is epoll still the champ?
For very small numbers of sockets (varies depending on your hardware, of course, but we're talking about something on the order of 10 or fewer), select can beat epoll in memory usage and runtime speed. Of course, for such small numbers of sockets, both mechanisms are so fast that you don't really care about this difference in the vast majority of cases.
One clarification, though. Both select and epoll scale linearly. A big difference, though, is that the userspace-facing APIs have complexities that are based on different things. The cost of a select call goes roughly with the value of the highest numbered file descriptor you pass it. If you select on a single fd, 100, then that's roughly twice as expensive as selecting on a single fd, 50. Adding more fds below the highest isn't quite free, so it's a little more complicated than this in practice, but this is a good first approximation for most implementations.
The cost of epoll is closer to the number of file descriptors that actually have events on them. If you're monitoring 200 file descriptors, but only 100 of them have events on them, then you're (very roughly) only paying for those 100 active file descriptors. This is where epoll tends to offer one of its major advantages over select. If you have a thousand clients that are mostly idle, then when you use select you're still paying for all one thousand of them. However, with epoll, it's like you've only got a few - you're only paying for the ones that are active at any given time.
All this means that epoll will lead to less CPU usage for most workloads. As far as memory usage goes, it's a bit of a toss up. select does manage to represent all the necessary information in a highly compact way (one bit per file descriptor). And the FD_SETSIZE (typically 1024) limitation on how many file descriptors you can use with select means that you'll never spend more than 128 bytes for each of the three fd sets you can use with select (read, write, exception). Compared to those 384 bytes max, epoll is sort of a pig. Each file descriptor is represented by a multi-byte structure. However, in absolute terms, it's still not going to use much memory. You can represent a huge number of file descriptors in a few dozen kilobytes (roughly 20k per 1000 file descriptors, I think). And you can also throw in the fact that you have to spend all 384 of those bytes with select if you only want to monitor one file descriptor but its value happens to be 1024, wheras with epoll you'd only spend 20 bytes. Still, all these numbers are pretty small, so it doesn't make much difference.
And there's also that other benefit of epoll, which perhaps you're already aware of, that it is not limited to FD_SETSIZE file descriptors. You can use it to monitor as many file descriptors as you have. And if you only have one file descriptor, but its value is greater than FD_SETSIZE, epoll works with that too, but select does not.
Randomly, I've also recently discovered one slight drawback to epoll as compared to select or poll. While none of these three APIs supports normal files (i.e., files on a file system), select and poll present this lack of support as reporting such descriptors as always readable and always writable. This makes them unsuitable for any meaningful kind of non-blocking filesystem I/O, a program which uses select or poll and happens to encounter a file descriptor from the filesystem will at least continue to operate (or if it fails, it won't be because of select or poll), albeit it perhaps not with the best performance.
On the other hand, epoll will fail fast with an error (EPERM, apparently) when asked to monitor such a file descriptor. Strictly speaking, this is hardly incorrect. It's merely signalling its lack of support in an explicit way. Normally I would applaud explicit failure conditions, but this one is undocumented (as far as I can tell) and results in a completely broken application, rather than one which merely operates with potentially degraded performance.
In practice, the only place I've seen this come up is when interacting with stdio. A user might redirect stdin or stdout from/to a normal file. Whereas previously stdin and stdout would have been a pipe -- supported by epoll just fine -- it then becomes a normal file and epoll fails loudly, breaking the application.
In tests at my company, one issue with epoll() came up, thus a single cost compared to select.
When attempting to read from the network with a timeout, creating an epoll_fd ( instead of a FD_SET ), and adding the fd to the epoll_fd, is much more expensive than creating a FD_SET (which is a simple malloc).
As per the previous answer, as the number of FDs in the process becomes large, the cost of select() becomes higher, but in our testing, even with fd values in the 10,000's, select was still a winner. These are cases where there is only one fd that a thread is waiting on, and simply trying to overcome the fact that network read, and network write, doesn't timeout when using a blocking thread model. Of course, blocking thread models are low performance compared to non-blocking reactor systems, but there are occasions where, to integrate with a particular legacy code base, it is required.
This kind of use case is rare in high performance applications, because a reactor model doesn't need to create a new epoll_fd every time. For the model where an epoll_fd is long-lived --- which is clearly preferred for any high performance server design --- epoll is the clear winner in every way.

Resources