Which node should I push data to in a cluster? - nginx

I've setup a kafka cluster with 3 nodes.
kafka01.example.com
kafka02.example.com
kafka03.example.com
Kafka does replication so that any node in the cluster can be removed without loosing data.
Normally I would send all data to kafka01, however that will break the entire cluster if that one node goes down.
What is industry best practice when dealing with clusters? I'm evaluating setting up an NGINX reverse proxy with round robin load balancing. Then I can point all data producers at the proxy and it will divvy up between the nodes.
I need to ensure that no data is lost if one of the nodes becomes unavailable.
Is an nginx reverse proxy an appropriate tool for this use case?
Is my assumption correct that a round robin reverse proxy will distribute the data and increase reliability without data loss?
Is there a different approach that I haven't considered?

Normally your producer takes care of distributing the data to all (or selected set of) nodes that are up and running by using a partitioning function either in a round robin mode or by using some semantics of your choice. The producer publishes to a partition of a topic and different nodes are leaders for different partitions of one topic. If a broker node becomes unavailable, this node will fall out of the cluster (In Sync Replicas) and new leaders for partitions on that node will be selected. Through metadata requests/responses, your producer will become aware of this fact and push messages to other nodes which are currently up.

Related

Using endpoints of AWS ElastiCache for Redis

I am using AWS ElastiCache for Redis as the caching solution for my spring-boot application. I am using spring-boot-starter-data-redis and jedis client to connect with my cache.
Imagine that I am having my cache in cluster-mode-enabled and 3 shards with 2 nodes in each. I agree then the best way of doing it is using the configuration-endpoint. Alternatively, I can list all the endpoints of all nodes and let the job done.
However, even if I use a single node's endpoint from one of the shards, my caching solution works. That doesn't looks right to me. I feel even if it works, that might case problems in the cluster in long run. When there are all together 6 nodes partitioned into 3 shards but only using one node's endpoint. I have following questions.
Is using one node's endpoint create an imbalance in the cluster?
or
Is that handled automatically by the AWS ElastiCache for Redis?
If I use only one node's endpoint does that mean the other nodes will never being used?
Thank you!
To answer your questions;
Is using one node's endpoint create an imbalance in the cluster?
NO
Is that handled automatically by the AWS ElastiCache for Redis?
Somewhat
if I use only one node's endpoint does that mean the other nodes will never being used?
No. All nodes are being used.
This is how Cluster Mode Enabled works. In your case, you have 3 shards meaning all your slots (where key-value data is stored) are divided into 3 sub-clusters ie. shards.
This was explained in this answer as well - https://stackoverflow.com/a/72058580/6024431
So, essentially, your nodes are smart enough to re-direct your requests to the nodes that has the key-slot where your data needs to be stored. So, no imbalances. Redis handles the redirection for you.
Now, while using Node endpoints, you're going to be facing other problems.
Elasticache is running on cloud (which is essentially AWS Hardware). All hardware faces issues. You have 3 primaries (1p, 2p, 3p) and 3 (1r, 2r, 3r) replicas.
So, if a primary goes down due to hardware issue (lets say 1p), the replica will get promoted to become the new Primary for the cluster (1r).
Now the problem would be, your application is connected directly to 1p which has now been demoted to replica. So, all the WRITE operations will fail.
And you will have to change the application code manually whenever this happens.
Alternatively, if you were using configurational endpoint (or other cluster level endpoints) instead of node-endpoints, this issue would only be a blip to your application at most, perhaps for 1-2 seconds.
Cheers!

How to get high availability when download data on 2 servers?

I have data need to be downloaded on a local server every 24 hours. For high availability we provided 2 servers to avoid failures and losing data.
My question is: What is the best approach to use the 2 servers?
My ideas are:
-Download on a server and just if download failed for any reason, download will continue on the other server.
-Download will occur on the 2 servers at the same time every day.
Any advice?
In terms of your high-level approach, break it down into manageable chunks i.e. reliable data acquisition, and highly available data dissemination. I would start with the second part first, because that's the state you want to get to.
Highly available data dissemination
Working backwards (i.e. this is the second part of your problem), when offering highly-available data to consumers you have two options:
Active-Passive
Active-Active
Active-Active means you have at least two nodes servicing requests for the data, with some kind of Load Balancer (LB) in front, which allocates the requests. Depending on the technology you are using there may be existing components/solutions in that tech stack, or reference models describing potential solutions.
Active-Passive means you have one node taking all the traffic, and when that becomes unavailable requests are directed at the stand-by / passive node.
The passive node can be "hot" ready to go, or "cold" - meaning it's not fully operational but is relatively fast and easy to stand-up and start taking traffic.
In both cases, and if you have only 2 nodes, you ideally want both the nodes to be capable of handling the entire load. That's obvious for Active-Passive, but it also applies to active-active, so that if one goes down the other will successfully handle all requests.
In both cases you need some kind of network component that routes the traffic. Ideally it will be able to operate autonomously (it will have to if you want active-active load sharing), but you could have a manual / alert based process for switching from active to passive. For one thing, it will depend on what your non-functional requirements are.
Reliable data acquisition
Having figured out how you will disseminate the data, you know where you need to get it to.
E.g. if active-active you need to get it to both at the same time (I don't know what tolerances you can have) since you want them to serve the same consistent data. One option to get around that issues is this:
Have the LB route all traffic to node A.
Node B performs the download.
The LB is informed that Node B successfully got the new data and is ready to serve it. LB then switches the traffic flow to just Node B.
Node A gets the updated data (perhaps from Node B, so the data is guaranteed to be the same).
The LB is informed that Node A successfully got the new data and is ready to serve it. LB then allows the traffic flow to Nodes A & B.
This pattern would also work for active-passive:
Node A is the active node, B is the passive node.
Node B downloads the new data, and is ready to serve it.
Node A gets updated with the new data (probably from node B), to ensure consistency.
Node A serves the new data.
You get the data on the passive node first so that if node A went down, node B would already have the new data. Admittedly the time-window for that to happen should be quite small.

Load balancing on same server

I research about Kubernetes and actually saw that they do load balancer on a same node. So if I'm not wrong, one node means one server machine, so what good it be if doing load balancer on the same server machine. Because it will use same CPU and RAM to handle requests. First I thought that load balancing would do on separate machine to share resource of CPU and RAM. So I wanna know the point of doing load balancing on same server.
If you can do it on one node , it doesn't mean that you should do it , specially in production environment.
the production cluster will have least 3 or 5 nodes min
kubernetes will spread the replicas across the cluster nodes in balancing node workload , pods ends up on different nodes
you can also configure on which nodes your pods land
use advanced scheduling , pod affinity and anti-affinity
you can also plug you own schedular , that will not allow placing the replica pods of the same app on the same node
then you define a service to loadbalance across pods on different nodes
kube proxy will do the rest
here is a useful read:
https://itnext.io/keep-you-kubernetes-cluster-balanced-the-secret-to-high-availability-17edf60d9cb7
So you generally need to choose a level of availability you are
comfortable with. For example, if you are running three nodes in three
separate availability zones, you may choose to be resilient to a
single node failure. Losing two nodes might bring your application
down but the odds of loosing two data centres in separate availability
zones are low.
The bottom line is that there is no universal approach; only you can
know what works for your business and the level of risk you deem
acceptable.
I guess you mean how Services do automatical load-balancing. Imagine you have a Deployment with 2 replicas on your one node and a Service. Traffic to the Pods goes through the Service so if that were not load-balancing then everything would go to just one Pod and the other Pod would get nothing. You could then handle more load by spreading evenly and still be confident that traffic will be served if one Pod dies.
You can also load-balance traffic coming into the cluster from outside so that the entrypoint to the cluster isn't always the same node. But that is a different level of load-balancing. Even with one node you can still want load-balancing for the Services within the cluster. See Clarify Ingress load balancer on load-balancing of external entrypoint.

DynamoDB DAX and High Availability

What's your preferred strategy for dealing with DAX's maintenance windows?
DynamoDB itself has no MWs and is very highly available. When DAX is introduced into the mix, if it's the sole access point of clients to DDB then it becomes a SPOF. How do you then handle degradation gracefully during DAX scheduled downtimes?
My thinking was to not use the DAX Client directly but introduce some abstraction layer that allows it to fall back to direct DDB access when DAX is down. Is that a good approach?
DAX maintenance window doesn't take the cluster offline, unless it is a one-node cluster. DAX provides availability through multiple nodes in the cluster. For a multi-node cluster, each node in the cluster goes through maintenance in a specific order in order for the cluster to remain available. With retries configured on the DAX client, your worload shouldn't see an impact during maintenance windows.
Other than maintenance window, cluster nodes need to be divided across multiple AZs, for availability in case an AZ goes down.
An abstraction layer to fall back to DDB is not a bad idea. But you need to make sure you have the provisioned capacity configured to handle the load spike.

akka.net scaling in azure asp.net website

I have set up Akka.net actors running inside an ASP.net application to handle some asynchronous & lightweight procedures. I'm wondering how Akka scales when I scale out the website on Azure. Let's say that in code I have a single actor to process messages of type FooBar. When I have two instances of the website, is there still a single actor or are there now two actors?
By default, whenever you'll call ActorOf method, you'll order creation of a new actor instance. If you'll call that in two actor systems, you'll end up with two separate actors, having the same relative paths inside their systems, but different global addresses.
There are several ways to share an information about actors between actor systems:
When using Akka.Remote you can call actors living on another actor system given their addresses or IActorRefs. Requirements:
You must know the path to a given actor.
You must know the actual address (URL or IP) of an actor system, on which that actor lives.
Both actor systems must be able to communicate via TCP between actor system (i.e. open ports on firewall).
When using Akka.Cluster actor systems (also known as nodes) can form a cluster. They will exchange information about their localization in the network, track incoming nodes and eventually detect a dead or unreachable ones. On top of it, you can use higher level components i.e. cluster routers. Requirements:
Every node must be able to open TCP channel to every other (so again, firewalls etc.)
A new incoming node must know at least one node that is already part of the cluster. This is easily achievable as part of the pattern known as lighthouse or via plugins and 3rd party services like consul.
All nodes must have the same name.
Finally, when using cluster configuration you can make use of Akka.Cluster.Sharding - it's essentially a higher level abstraction over actor's location inside the cluster. When using it, you don't need to explicitly tell, where to find or when to create an actor. Instead, all you need is a unique actor identifier. When sending a message to such actor, it will be created ad-hoc somewhere in the cluster if it didn't exist before, and rebalanced to equally spread the workload in cluster. This plugin also handles all logic associated with routing the message to that actor.

Resources