Does a leader have any other responsibility other than replication in Raft, like load balancing? - raft

We are trying to implement Raft in a chat application. As far as I read, Raft is for replication and thats it. So when a client wants to connect to the chat server for chatting with another client, does Raft require that all the clients connect to the leader only? If yes, if it connects to a follower, the follower can redirect it to the leader. But then what? Does the leader again assign it to a follower node i.e. essentially does the leader behave as a load balancer as well or does it do all the work itself with replication of users data on all other servers?

If the followers tell the clients the address of the current leader and the clients must re-connect to the leader, then the leader gets overloaded. The problem is that the leader is already doing the computation for the log replication and all the client processing also happens on the leader while the other nodes are practically idle. Furthermore, this severely limits the number of client connections you have.
If on the other hand, the followers accept the client connections and forward the requests to the leader (via a long-lived tcp connection), it forms a sort of connection fan-out that eases the load on the leader. The more processing you can do on the followers the better.
Extending upon the above, I have found success in making a few nodes invisible to the clients and these nodes have preference in being the leader. That way, they are never encumbered by client processing.

Related

Queue system recommendation approach

We have a bus reservation system running in GKE in which we are handling the creation of such reservations with different threads. Due to that, CRUD java methods can sometimes run simultaneously referring to the same bus, resulting in the save in our DB of the LAST simultaneous update only (so the other simultaneous updates are lost).
Even if the probabilities are low (the simultaneous updates need to be really close, 1-2 seconds), we need to avoid this. My question is about how to address the solution:
Lock the bus object and return error to the other simultaneous requests
In-memory map or Redis caché to track the bus requests
Use GCP Pub/Sub, Kafka or RabbitMQ as a queue system.
Try to focus the efforts on reducing the simultaneous time window (reduce from 1-2 seconds up to milliseconds)
Others?
Also, we are worried if in the future the GKE requests handling scalability may be an issue. If we manage a relatively higher number of buses, should we need to implement a queue system between the client and the server? Or GKE load balancer & ambassador will already manages it for us? In case we need a queue system in the future, could it be used also for the collision problem we are facing now?
Last, the reservation requests from the client often takes a while. Therefore, we are changing the requests to be handled asynchronously with a long polling approach from the client to know the task status. Could we link this solution to the current problem? For example, using the Redis caché or the queue system to know the task status? Or should we try to keep the requests synchronous and focus on reducing the processing time (it may be quite difficult).

Corda - FlowMonitor - Flow with id X has been waiting for Y seconds to receive messages from parties

In our Corda network we work with Accounts. We have a network with well-defined nodes.
To show the problem, imagine 3 nodes, PartyA, PartyB and Notary.
We created the accounts (AccountA for example) on PartyA. We have flows that can be executed at PartyB that has AccountA as a participant in the transaction.
Now imagine that PartyA is down for any reason, or communication between nodes is not available.
When I request a new AccountA key for PartyA, the flow gets stuck trying to communicate and does not return any exception. This happens in any situation that tries to communicate with another node, when running a CollectSignaturesFlow or ShareStateAndSyncAccounts to share account states for example.
The question is, is there any configuration or mechanism to return an exception in those cases where it is unable to communicate with another node?
Timeouts can be handled differently depending on where you need to manage it.
There is the flowTimeout:
When a flow implementing the TimedFlow interface and setting the
isTimeoutEnabled flag does not complete within a defined elapsed time,
it is restarted from the initial checkpoint.
Currently only used for notarisation requests with clustered notaries
Or otherwise it can be done programmatically in your flow. I suggest you to take a look to this part of the Corda documentation (Concurrency, Locking and Waiting) where there are many suggestions that you could try to implement.

Is it recommended to connect directly to a NATS server from a native front-end client?

As per the docs, the NATS server design is a "server first" approach, regarding protecting against "lazy clients". Lazy clients are just booted under poor performance scenarios.
Due to this, I have internalized an assumption that anything connecting from past the edge should not connect directly to the NATS server, but should instead access some middle layer service point that manages and maps the more poorly performing external connections through internally initiated connections to the NATS server.
For example, consider a client service node on a remote customer facility that accesses back end services that are serviced over NATS in some form.
Is my stated assumption true in a VERY STRICT sense, such that it is NEVER advisable for that remote node to connect to a NATS server directly, even given a low number of possible client connections AND a stable network service to those facilities?
Or, is it ok-ish to connect directly from that remote node IF, and ONLY IF, I have a known solid infrastructure (low latency, high bandwidth, dependable, etc).
Finally, how about if the path to that remote node is not a very solid network service? Specifically, is it better to A) as I described, use an intermediary service on the back-end for managing the end point connection requests and passing them along over internal connections, or B) is it preferred to just let it connect directly to the server and let NATS boot it as needed, then use a re-connect on drops approach to keep the connection up as best as it can?
A good example here is a mobile endpoint, which could be up and down regularly for various reasons and due to no fault of the device or infrastructure at all.
Currently, I am designing every NATS solution as a "backend connections only" design. If this is overcomplicating my designs needlessly, of course I want to stop forcing that design constraint. :)

Is it safe to communicate with WCF service from ASP.NET

There is a Windows service that I need to communicate with (in a duplex way) from ASP.NET. Is it safe to turn the Windows service into a WCF service and organize two-way communication?
I'm concerned about a scenario when the service is trying to communicate but ASP.NET process is getting reloaded and the message gets lost. Though it's unlikely during development, I guess it's quite likely in production with many clients.
I'm leaning towards a solution that involves some kind of persistence:
Both the Windows service and ASP.NET write data to SQL Server and get notified via SqlDependency
They exchange messages via RabbitMq
Here's a couple of ideas regarding the general case where two independent systems (processes, servers, etc.) need to communicate reliably:
Transaction model, where the transmitting party initiates communication and waits for acknowledgment from the recipient before marking the message as delivered. In case of transmission failure/timeout, it's the sender's responsibility to persist the message and retry later. For instance, Webhook architectures rely on this model.
Publish/Subscribe model, used by a lot of distributed systems, where both parties rely on a third-party message broker (message queue/service bus mechanism) such as RabbitMQ. In this architecture, sender is only responsible for making sure that the message has been successfully queued. The responsibility of making sure that the message is delivered to the recipient is on the message broker. In this case, you need to make sure that your message broker satisfies your reliability needs, for example: Is it in-memory only? Or does it also persist to disk and is able to recover from not just a process-recycle but also a power/system recycle.
And like you said, you can build your own messaging infrastructure too: sender writes to a local or cloud database or a cloud queue/service bus, and the receiver polls and consumes the messages.
So, a few guidelines:
If you ever need to scale out (have multiple servers) and they need to somehow collaborate on these messages, then make your initial investment on a database or cloud-queue solution (such as Azure SQL or Azure Queues).
Otherwise, if your services only need to communicate within one server, then you can use a database approach or use a queue service that satisfies your persistence/reliability requirements. RabbitMQ seems like a robust solution for this scenario.

When a queue should be used?

Suppose we were to implement a network application, such as a chat with a central server and several clients: we assume that all communication must go through the central server, then it should pick up messages from some clients and forward them to target clients, and so on.
Regardless of the technology used (sockets, web services, etc..), it is possible to think that there are some producer threads (that generate messages) and some consumer threads (that read messages).
For example, you could use a single queue for incoming and outgoing messages, but using a single queue, you couldn't receive and send messages simultaneously, because only one thread at a time can access the queue.
Perhaps it would be more appropriate to use two queues: for example, this article explains a way in which you can manage a double queue so that producers and consumers can work almost simultaneously. This scenario may be fine if there are only a producer and a consumer, but if there are many clients:
How to make so that the central server can receive data simultaneously from multiple input streams?
How to make so that the central server can send data simultaneously to multiple output streams?
To resolve this problem, my idea is to use a double queue for each client: on the central server, each client connection may be associated with two queues, one for incoming messages from that client and one for outgoing messages addressed to that client. In this way the central server may send and receive data simultaneously on almost all the connections with the clients...
There are probably other ways to manage the queues ... What are the parameters to determine how many queues are needed and how to organize them? There are cases that do not need any queue?
To me, this idea of using a queue per client or multiple queues per client seems to miss the point. First of all, it is absolutely possible to build a queue which can be accessed simultaneously by 2 threads (one can be enqueueing an item while a different one is dequeueing another item). If you want to know how, post a specific question about that.
Second, even if we assume that only 1 thread at a time can access a single queue, and even if we assume that the server will be receiving or sending data to/from all the clients simultaneously, it still doesn't follow that you need a different queue for each client. To avoid limiting system performance, you just need to allow enough concurrency to utilize all the server's CPUs. Even with a single, system-wide queue, if dequeueing/enqueueing messages is fast enough compared to the other work the server is doing, it might not be a bottleneck. (And with an efficient implementation, simply inserting an item or removing an item from a queue should be very fast. It's a very simple operation.) For that message queue to become the bottleneck limiting performance, either you would need a LOT of CPUs, or everything else the server was doing would have to be very fast. In that case, you could work out some scheme with 2 or 4 system-wide queues, to allow 2x or 4x more concurrency.
The whole idea of using work queues in a multi-threaded system is that they 1) allow multiple consumers to all grab work from a single location, so producers can "dump" whatever work they need done at that single location without worrying about which consumer will do it, and 2) function as a load-balancing mechanism for the consumers. (Additionally, a work queue can act as a "buffer" if producers temporarily generate work too fast for the consumers.) If you have a dedicated pair of producer-consumer threads for each client, it calls into question why you need to use queues at all. Why not just do a synchronous "pass off" from dedicated producer to corresponding dedicated consumer? Or, why not use a single thread per client which acts as both producer and consumer? Using queues in the way which you are proposing doesn't seem to really gain anything.

Resources