In a Rebus service bus, there is a single message transport queue per endpoint. It is possible for an endpoint to handle more than one message, and it is possible to have only a single endpoint in a system.
Other than the throughput of messages, what reasons are there to use more than a single endpoint in a Rebus service bus system?

Excellent question! :) There can be many reasons why you might want to have several Rebus endpoints active at the same time.
An obvious reason is that you might want to host the endpoints in separate processes so you can update them independently of each other. But since this reason is pretty obvious, I assume you are thinking about reasons one might want to host multiple Rebus endpoints in the same process.
Let me just mention a few(*):
Concurrency requirements
One endpoint might be hosting data that experiences contention and therefore does not benefit from being able to process messages concurrently - this endpoint will probably have only a few threads and low parallelism, possibly 1/1.
Another endpoint might be doing stream-based data processing (e.g. loading blobs from one place into another, downloading data from web services, etc.), which can be done with very high throughput and low resource requirements with one single thread and a high level of parallelism - e.g. 1/20.
Yet another endpoint might be doing a lot of serialization/deserialization, which is usually CPU-bound, and therefore might benefit from running on a many-core box with many worker threads and matching parallelism - e.g. 10/10.
As you can see, the type of tasks performed by an endpoint can call for a configuration that matches the nature of the tasks.
One endpoint might be designated for processing low-priority background stuff, like e.g. moving data to cold storage, optimizing storage of historic data, etc.
Another endpoint might be processing messages where low latency is the most important quality attribute.
If these two were using the same queue, the low-priority background stuff could sometimes clog up the queue, hindering low-latency processing of the other messages.
Logical separation
I have many times started out by hosting several Rebus endpoints in the same process because it was easy to deal with during development, while keeping the endpoints separate because they were implementing different business functions.
This way it is easy to physically break them apart some time later on, allowing for a higher degree of separation and independence.
(*) Udi Dahan works with the concepts "business components" and "autonomous components" where the first one is an implementation of a business capability and the second one is what business components are decomposed into, mostly for technical reasons.
I guess you could say that the first two reasons I mentioned are separate endpoints for "autonomous component" reasons, whereas the third is separation because things belong to different business components.
Udi keeps a pretty strict view of these concepts that is completely orthogonal to how the system is physically composed, but I almost always end up with pretty high convergence between logical separation and physical separation.


How to evenly balance processing many simultaneous tasks?

Our PROCESSING SERVICE is serving UI, API, and internal clients and listening for commands from Kafka.
Few API clients might create a lot of generation tasks (one task is N messages) in a short time. With Kafka, we can't control commands distribution, because each command comes to the partition which is consumed by one processing instance (aka worker). Thus, UI requests could be waiting too long while API requests are processing.
In an ideal implementation, we should handle all tasks evenly, regardless of its size. The capacity of the processing service is distributed among all active tasks. And even if the cluster is heavily loaded, we always understand that the new task that has arrived will be able to start processing almost immediately, at least before the processing of all other tasks ends.
Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis.
On the side of the producer
the task comes from the client
immediately create a queue for this task
send all messages to this queue
On the side of the consumer
in one process, you constantly update the list of queues
in other processes, you follow this list and consume for example 1 message from each queue
scale consumers
Is there any common solution to such a problem? Using RabbitMQ or any other tooling. Š¯istorically, we use Kafka on the project, so if there is any approach using - it is amazing, but we can use any technology for the solution.
Why not use spark to execute the messages within the task? What I'm thinking is that each worker creates a spark context that then parallelizes the messages. The function that is mapped can be based on which kafka topic the user is consuming. I suspect however your queues might have tasks that contained a mixture of messages, UI, API calls, etc. This will result in a more complex mapping function. If you're not using a standalone cluster and are using YARN or something similar you can change the queueing method that the spark master is using.
As I understood the problem, you want to create request isolation from the customer using dynamically allocated queues which will allow each customer tasks to be executed independently. The problem looks like similar to Head of line blocking issue in networking
The dynamically allocating queues is difficult. This can also lead to explosion of number of queues that can be a burden to the infrastructure. Also, some queues could be empty or very less load. RabbitMQ won't help here, it is a queue with different protocol than kafka.
One alternative is to use custom partitioner in kafka that can look at the partition load and based on that load balance the tasks. This works if the tasks are independent in nature and there is no state store maintains in the worker.
The other alternative would be to load balance at the customer level. In this case you select a dedicated set of predefined queues for a set of customers. Customers with certain Ids will be getting served by a set of queues. The downside of this is some queues can have less load than others. This solution is similar to Virtual Output Queuing in networking,
My understanding is that the partitioning of the messages it's not ensuring a evenly load-balance. I think that you should avoid create overengineering and so some custom stuff that will come on top of the Kafka partitioner and instead think at a good partitioning key that will allows you to use Kafka in an efficiently manner.

Rebus pub-sub system. Process each message by one and only one subscriber

With rebus, I'd like to be able to publish from one application, and subscribe from multiple applications, where each message only gets processed by one subscriber in a round robin, as described here:
Is this possible with rebus?
Yes, but the words "publish" and "subscribe" are confusing here, because it's not "Publish/Subscribe" as described in the literature, because pub/sub pretty much means that you do not care how many listeners there are.
What you want, is to send a message, and then you want the sent messages distributed among multiple consumers.
With Rebus, that is definitely possibly. However, the way you do it may depend slightly on which transport, you are using(*)
Most transports have pretty good support for the competing consumers pattern, where you simply start multiple processes, probably running on multiple machines, that consume messages from the same queue.
This way, each message gets processed exactly once, and each consumer will receive messages at a rate that suits that particular consumer.
(*) MSMQ is not good at distributing load when multiple processes are taking messages off of the same queue, especially not when the processes are running on other machines than where the queue is.

Why are Distributed Systems considered complex?

I'm just getting into the concept of a Distributed System and its advantages and disadvantages. In the book I'm reading it discusses the complexity of a Distributed System and that they are inherently complex, it lists the following as potential reasons for complexity;
Asynchronous communication
Partial failures
What I am struggling to understand is what these concepts actually encompass (i.e what is a partial failure and what are the causes of a partial failure?), and how they are dealt with in modern systems? Does middleware successfully solve all three of these complexity issues within a system?
This question can be answered in many words, but I'll try to boil it down to essentials:
Heterogeneity is one of the main problems integration tries to solve. It is an inherent characteristic of most distributed systems and it refers to the fact that most often than not, when you have to integrate multiple systems, they will:
Be on different platforms, in different networks;
Differ in their capabilities in terms of integration;
Have disparities in data, even data referring to the same business domain;
Use and support different (sometimes even forgotten or unsupported) technologies and standards;
Have different owners (are controlled by different departments, companies).
All of the above add more and more complexity.
Asynchronous communication solves some problems of stateless communication but introduces whole other set of complexities, that can easily lead to problems when not implementation is not proper. This is mainly due to the fact that you only have guarantee that the message will be successfully received on the other end, but have no guarantee whatsoever when the operation will be processed, if ever. So it is much harder to carry out orchestration of interdependent asynchronous tasks, as opposed to synchronous tasks.
Partial failures - When you have processes that involve multiple interdependent write operations you need to ensure ACID transactions. Having to do so in scenarios when multiple systems are involved is even harder because you cannot achieve common transactional context as easily in heterogeneous distributed environment as you would if you were within the boundaries of a single system. Often you will need to implement opposite operations in services (or worse, implement two-phase commit), just to be able to compensate all prior writes in the process in case something goes wrong with one of the tasks.
Hope this clears things a bit!
The reason distributed systems are so complex is simple: time!
Perfect synchronization of state becomes impossible in distributed systems for the simple fact that some amount of time must pass between the point that a message leaves one server and that point it arrives at its intended destination. In addition to this, networks are a far more unreliable communication medium, meaning that message may never make it at all.
The lack of perfect time synchronization means that it's impossible to make absolute assumptions about the order of events. For instance, in a highly available distributed database, if two requests writing to the same resource arrive on two different servers nearly simultaneously, there's no way to determine the absolute order of those events. So, distributed systems must use approximations of logical time and conflict resolution to resolve these types of event order issues.
Partial Failure - In case of a transactionS involving many clients (#2or more),the scheduling technique being used many involve conflicting operations of a write and maybe a write, in the process of issuing a lock complexities arise like for in case of a deadlock. When the lock manager tries either to detect, avoid or prevent the system may partially fail leading to rollback of the whole process.

When a queue should be used?

Suppose we were to implement a network application, such as a chat with a central server and several clients: we assume that all communication must go through the central server, then it should pick up messages from some clients and forward them to target clients, and so on.
Regardless of the technology used (sockets, web services, etc..), it is possible to think that there are some producer threads (that generate messages) and some consumer threads (that read messages).
For example, you could use a single queue for incoming and outgoing messages, but using a single queue, you couldn't receive and send messages simultaneously, because only one thread at a time can access the queue.
Perhaps it would be more appropriate to use two queues: for example, this article explains a way in which you can manage a double queue so that producers and consumers can work almost simultaneously. This scenario may be fine if there are only a producer and a consumer, but if there are many clients:
How to make so that the central server can receive data simultaneously from multiple input streams?
How to make so that the central server can send data simultaneously to multiple output streams?
To resolve this problem, my idea is to use a double queue for each client: on the central server, each client connection may be associated with two queues, one for incoming messages from that client and one for outgoing messages addressed to that client. In this way the central server may send and receive data simultaneously on almost all the connections with the clients...
There are probably other ways to manage the queues ... What are the parameters to determine how many queues are needed and how to organize them? There are cases that do not need any queue?
To me, this idea of using a queue per client or multiple queues per client seems to miss the point. First of all, it is absolutely possible to build a queue which can be accessed simultaneously by 2 threads (one can be enqueueing an item while a different one is dequeueing another item). If you want to know how, post a specific question about that.
Second, even if we assume that only 1 thread at a time can access a single queue, and even if we assume that the server will be receiving or sending data to/from all the clients simultaneously, it still doesn't follow that you need a different queue for each client. To avoid limiting system performance, you just need to allow enough concurrency to utilize all the server's CPUs. Even with a single, system-wide queue, if dequeueing/enqueueing messages is fast enough compared to the other work the server is doing, it might not be a bottleneck. (And with an efficient implementation, simply inserting an item or removing an item from a queue should be very fast. It's a very simple operation.) For that message queue to become the bottleneck limiting performance, either you would need a LOT of CPUs, or everything else the server was doing would have to be very fast. In that case, you could work out some scheme with 2 or 4 system-wide queues, to allow 2x or 4x more concurrency.
The whole idea of using work queues in a multi-threaded system is that they 1) allow multiple consumers to all grab work from a single location, so producers can "dump" whatever work they need done at that single location without worrying about which consumer will do it, and 2) function as a load-balancing mechanism for the consumers. (Additionally, a work queue can act as a "buffer" if producers temporarily generate work too fast for the consumers.) If you have a dedicated pair of producer-consumer threads for each client, it calls into question why you need to use queues at all. Why not just do a synchronous "pass off" from dedicated producer to corresponding dedicated consumer? Or, why not use a single thread per client which acts as both producer and consumer? Using queues in the way which you are proposing doesn't seem to really gain anything.

Is Async Messaging (In particular pub/sub style messaging) viable as a domain service architecture or only in an SOA-focused environment?

I have been researching asynchronous messaging, and I like the way it elegantly deals with some problems within certain domains and how it makes domain concepts more explicit. But is it a viable pattern for general domain-driven development (at least in the service/application/controller layer), or is the design overhead such that it should be restricted to SOA-based scenarios, like remote services and distributed processing?
Great question :). The main problem with asynchronous messaging is that when folks use procedural or object oriented languages, working in an asynchronous or event based manner is often quite tricky and complex and hard for the programmer to read & understand. Business logic is often way simpler if its built in a kinda synchronous manner - invoking methods and getting results immediately etc :).
My rule of thumb is generally to try use simpler synchronous programming models at the micro level for business logic; then use asynchrony and SEDA at the macro level.
For example submitting a purchase order might just write a message to a message queue; but the processing of the purchase order might require 10 different steps all being asynchronous and parallel in a high performance distributed system with many concurrent processes & threads processing individual steps in parallel. So the macro level wiring is based on a SEDA kind of approach - but at the micro level the code for the individual 10 steps could be written mostly in a synchronous programming style.
Like so many architecture and design questions, the answer is "it depends".
In my experience, the strength of asynchronous messaging has been in the loose coupling it brings to a design. The coupling can be in:
Time - Requests can be handled asynchronously, helping overall scalability.
Space - As you point out, allowing for distributed processing in a more robust way than many synchronous designs.
Technology - Messages and queues are one way to bridge technology differences.
Remember that messages and queues are an abstraction that can have a variety of implementations. You don't necessarily need to use a JMS-compliant, transactional, high-performance messaging framework. Implemented correctly, a table in a relational database can act as a queue with the rows as messages. I've seen both approaches used to great effect.
I agree with #BradS too BTW
BTW here's a way of hiding the middleware from your business logic while still getting the benefits of loose coupling & SEDA - while being able to easily switch between a variety of different middleware technology - from in memory SEDA to JMS to AMQP to JavaSpaces to database, files or FTP etc
