ECS Cluster died, no logs, no alarms - kibana

We're running a platform made out of 5 clusters. One of the clusters died. We're using Kibana because its cheaper than Cloudwatch (log router with fluentbit). The 14 hour span that the cluster was dead shows 0 logs on Kibana, and we have no idea what happened to the cluster. A simple restart of the cluster fixed our issue. So, to make sure it doesn't die again while we're away, we need to set it up so it automatically restarts. Dev did not implement a cluster health check. We're using Kibana, so I can't use Cloudwatch to implement metrics, alarms and actions. What do I do here? How do I make the cluster restart itself when Kibana detects no incoming logs from it? Thank you.

Related

KubernetesPodOperator - crashing pods when scaling down

I ran into this issue the other day and I'm not sure if this the correct cause. Essentially I am spinning up 10 KubernetesPodOperators in parallel in airflow. When I request the 10 pods, the nodes will autoscale to meet the resource requirements of those 10 pods. However, once let's say 8/10 pods have completed their task, the autoscaler will scale down the nodes, which seemed to crash my currently running remaining 2 pods (as I assume they are being placed onto a new node). When I set autoscale to "off" in kubernetes and predefined the correct number of nodes, my 10 pods run fine. Does this logic make sense? If so has anyone faced a similar issue and if so is there any way around this? We are running airflow in an Azure AKS instance.
Thanks,

MariaDB Spider with Galera Clusters failover solutions

I am having problems trying to build a database solution for the experiment to ensure HA and performance(sharding).
Now, I have a spider node and two galera clusters (3 nodes in each cluster), as shown in the figure below, and this configuration works well in general cases.:
However, as far as I know, when the spider engine performs sharding, it must assign primary IP to distribute SQL statements to two nodes in different Galera clusters.
So my first question here is:
Q1): When the machine .12 shuts down due to destruction, how can I make .13 or .14(one of them) automatically replace .12?
The servers that spider engine know
Q2): Are there any open source tools (or technologies) that can help me deal with this situation? If so, please explain how it works. (Maybe MaxScale? But I never knew what it is and what it can do.)
Q3): The motivation for this experiment is as follows. An automated factory has many machines, and each machine generates some data that must be recorded during the production process (maybe hundreds or thousands of data per second) to observe the operation of the machine and make the quality of each batch of products the best.
So my question is: how about this architecture (Figure 1)? or please provides your suggestions.
You could use MaxScale in front of the Galera cluster to make the individual nodes appear like a combined cluster. This way Spider will be able to seamlessly access the shard even if one of the nodes fails. You can take a look at the MaxScale tutorial for instructions on how to configure it for a Galera cluster.
Something like this should work:
This of course has the same limitation that a single database node has: if the MaxScale server goes down, you'll have to switch to a different MaxScale for that cluster. The benefit of using MaxScale is that it is in some sense stateless which means it can be started and stopped almost instantly. A network load balancer (e.g. ELB) can already provide some form of protection from this problem.

Mariadb Galera cluster does not come up after killing the mysql process

I have a Mariadb Galera cluster with 2 nodes and it is up and running.
Before moving to production, I want to make sure that if a node crashes abruptly, It should come up on its own.
I tried using systemd "restart", but after killing the mysql process the mariadb service does not come up, so, is there any tool or method, that I can use to automate bringing up the nodes after crashes?
Galera clusters needs to have quorum (3 nodes).
In order to avoid a split-brain condition, the minimum recommended number of nodes in a cluster is 3. Blocking state transfer is yet another reason to require a minimum of 3 nodes in order to enjoy service availability in case one of the members fails and needs to be restarted. While two of the members will be engaged in state transfer, the remaining member(s) will be able to keep on serving client requests.
You can read more here.

EC2 Instance Requires Daily Restart

I am running a Wordpress blog on an AWS t2.micro EC2 instance running the AWS Linux. However most days I wake to an email saying that my blog is offline. When this happens I cannot SSH into the EC2 instance, however on the AWS dashboard it is shown as being online and none of the metrics look too suspicious.
The time I was notified about the blog being down was just after the start of the first plateau on the CPU Utilization graph - 4:31am.
A restart from the AWS control panel/app fixes things for a day or two, however I would like to have a more permanent fix.
Can anyone suggest any changes I can make to my instance to get it running more reliably?
[Edit - February 2018]
This has started happening again, after being fine for a few months. Each morning this week I have woken up to an alert that my blog offline - a reboot of the server brings it back online. This morning I was able to investigate it and was able to SSH in. Running top gave the following (I noticed the lack of http/mysqld):
My CloudWatch metrics for the last 72 hours are:
The bigger spikes are where I rebooted the instance. As you can see from the CPU balance, although there are spikes, they aren't huge spikes, as the CPU Credit Balance metric barely dips.
As this question has had so many views, I thought I would post about the workaround I have used to overcome this issue.
I still do not know why my blog goes offline, but knowing that rebooting the EC2 instance recovered it, I decided to automate that reboot.
There are three parts to this solution:
Detect the "blog offline" email from Jetpack and send it to AWS. I created a rule on my Gmail to handle this, forwarding the email to an address monitored by AWS SES.
The SNS triggers an AWS Lambda function to run.
The Lambda function reboots the EC2 instance.
Now I usually get a "blog back online" email within a few minutes of the original "blog offline" email.

How to detect EC2 instance has been idle

I am looking for idea how to approach this problem, basically measuring EC2 instance activity.
Is there a way I can know that via AWS? If not, how can I measure activity on e.g. Ubuntu instance. Taking into account that I will have some processes running and looking into their activity?
I'm not sure how you are defining "instance activity" but you can monitor your Amazon EC2 instance with Amazon CloudWatch and then query on the CPUUtilization metric to get information about how much CPU your instance is using (see Amazon Elastic Compute Cloud Dimensions and Metrics for details).
It's better to measure by 2 metrics and compare, Network traffic and CPU utilization
Ref: https://www.trendmicro.com/cloudoneconformity/knowledge-base/aws/EC2/idle-instance.html#:~:text=Using%20AWS%20Console-,01%20Sign%20in%20to%20the%20AWS%20Management%20Console.,to%20identify%20the%20right%20resource).

Resources