Unable to create MariaDB Galera Cluster - mariadb

I have built an image based on mariadb:10.1 which basically adds a new cluster.conf but facing the following error on the second node after the first node started working successfully. Can somebody help me debug here?
Error log tail
2016-09-28 10:12:55 139799503415232 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():162
2016-09-28 10:12:55 139799503415232 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2016-09-28 10:12:55 139799503415232 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'test_cluster' at 'gcomm://172.17.0.2,172.17.0.3,172.17.0.4': -110 (Connection timed out)
2016-09-28 10:12:55 139799503415232 [ERROR] WSREP: gcs connect failed: Connection timed out
2016-09-28 10:12:55 139799503415232 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.2,172.17.0.3,172.17.0.4) failed: 7
2016-09-28 10:12:55 139799503415232 [ERROR] Aborting
MySQL init process failed.
Debugging steps taken
NOTE: Container IP addresses were ensured to be the same as shown.
To ensure networking between containers is working, tried creating another container which could login to the first container's mysql instance.
This is definitely not related to MYSQL_HOST
To see if the container was running out of memory, I used docker stats and saw that the failed container was using only a meagre 142MB all through its lifecycle until it failed, which is way lesser than the total memory it was allowed (~4GB).
I am using Docker for Mac, but tried running the same on a CentOS VirtualBox and gives the same results. Doesn't look like Docker on Mac has a problem.
Config
[mysqld]
user=mysql
binlog_format=ROW
bind-address=0.0.0.0
default_storage_engine=innodb
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
innodb_buffer_pool_size=122M
innodb_file_per_table=1
innodb_doublewrite=1
query_cache_size=0
query_cache_type=0
wsrep_on=ON
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_sst_method=rsync
Steps to start containers
# bootstrap node
docker run --rm -e MYSQL_ROOT_PASSWORD=123 \
activatedgeek/mariadb:devel \
--wsrep-cluster-name=test_cluster \
--wsrep-cluster-address=gcomm://172.17.0.2,172.17.0.3,172.17.0.4 \
--wsrep-new-cluster
# add node into cluster
docker run --rm -e MYSQL_ROOT_PASSWORD=123 \
activatedgeek/mariadb:devel \
--wsrep-cluster-name=test_cluster \
--wsrep-cluster-address=gcomm://172.17.0.2,172.17.0.3,172.17.0.4
# add node into cluster
docker run --rm -e MYSQL_ROOT_PASSWORD=123 \
activatedgeek/mariadb:devel \
--wsrep-cluster-name=test_cluster \
--wsrep-cluster-address=gcomm://172.17.0.2,172.17.0.3,172.17.0.4

This problem is caused due to the hanging init process. The configurations and CLI arguments above are correct. The only thing to be done before the init process starts is to create and empty mysql directory in the data directory (/var/lib/mysql by default). The must only be created on all nodes except the bootstrap node.
mkdir -p /var/lib/mysql/mysql
See sample MariaDB Cluster for usage which uses a custom MariaDB image and is a proof of concept for creating clusters.

I guess your containers should either expose the required ports:
-p 3306:3306 -p 4444:4444 -p 4567:4567 -p 4568:4568
or should be --link (ed) together.

Related

Error on Starting MySQL Cluster 8.0 Data Node on Ubuntu 22.04 LTS

When I start the data nodeid 1 (10.1.1.103) of MySQL Cluster 8.0 on Ubuntu 22.04 LTS I am getting the following error:
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
2023-01-02 17:16:55 [ndbd] INFO -- Angel connected to '10.1.1.102:1186'
2023-01-02 17:16:55 [ndbd] INFO -- Angel allocated nodeid: 2
When I start data nodeid 2 (10.1.1.105) I get the following error:
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
2023-01-02 11:10:04 [ndbd] INFO -- Angel connected to '10.1.1.102:1186'
2023-01-02 11:10:04 [ndbd] ERROR -- Failed to allocate nodeid, error: 'Error: Could not alloc node id at 10.1.1.102:1186: Connection done from wrong host ip 10.1.1.105.'
The management node log file reports (on /var/lib/mysql-cluster/ndb_1_cluster.log):
2023-01-02 11:28:47 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ]
What is the relevance of failing to open: /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory?
Why is data node on 10.1.1.105 unable to allocate a nodeid?
I initially installed a single Management Node on 10.1.1.102:
wget https://dev.mysql.com/get/Downloads/MySQL-Cluster-8.0/mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
tar -xf mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
dpkg -i mysql-cluster-community-management-server_8.0.31-1ubuntu22.04_amd64.deb
mkdir /var/lib/mysql-cluster
vi /var/lib/mysql-cluster/config.ini
The configuration set up on config.ini:
[ndbd default]
# Options affecting ndbd processes on all data nodes:
NoOfReplicas=2 # Number of replicas
[ndb_mgmd]
# Management process options:
hostname=10.1.1.102 # Hostname of the manager
datadir=/var/lib/mysql-cluster # Directory for the log files
[ndbd]
hostname=10.1.1.103 # Hostname/IP of the first data node
NodeId=2 # Node ID for this data node
datadir=/usr/local/mysql/data # Remote directory for the data files
[ndbd]
hostname=10.1.1.105 # Hostname/IP of the second data node
NodeId=3 # Node ID for this data node
datadir=/usr/local/mysql/data # Remote directory for the data files
[mysqld]
# SQL node options:
hostname=10.1.1.102 # In our case the MySQL server/client is on the same Droplet as the cluster manager
I then started and killed the running server and created a systemd file for Cluster manager:
ndb_mgmd -f /var/lib/mysql-cluster/config.ini
pkill -f ndb_mgmd
vi /etc/systemd/system/ndb_mgmd.service
Adding the following configuration:
[Unit]
Description=MySQL NDB Cluster Management Server
After=network.target auditd.service
[Service]
Type=forking
ExecStart=/usr/sbin/ndb_mgmd -f /var/lib/mysql-cluster/config.ini
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
I then reloaded the systemd daemon to apply the changes, started and enabled the Cluster Manager and checked its active status:
systemctl daemon-reload
systemctl start ndb_mgmd
systemctl enable ndb_mgmd
Here is the status of the Cluster Manager:
# systemctl status ndb_mgmd
● ndb_mgmd.service - MySQL NDB Cluster Management Server
Loaded: loaded (/etc/systemd/system/ndb_mgmd.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2023-01-01 08:25:07 CST; 27min ago
Main PID: 320972 (ndb_mgmd)
Tasks: 12 (limit: 9273)
Memory: 2.5M
CPU: 35.467s
CGroup: /system.slice/ndb_mgmd.service
└─320972 /usr/sbin/ndb_mgmd -f /var/lib/mysql-cluster/config.ini
Jan 01 08:25:07 nuc systemd[1]: Starting MySQL NDB Cluster Management Server...
Jan 01 08:25:07 nuc ndb_mgmd[320971]: MySQL Cluster Management Server mysql-8.0.31 ndb-8.0.31
Jan 01 08:25:07 nuc systemd[1]: Started MySQL NDB Cluster Management Server.
I then set up a data node on 10.1.1.103, installing dependencies, downloading the data node and setting up its config:
apt update && apt -y install libclass-methodmaker-perl
wget https://dev.mysql.com/get/Downloads/MySQL-Cluster-8.0/mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
tar -xf mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
dpkg -i mysql-cluster-community-data-node_8.0.31-1ubuntu22.04_amd64.deb
vi /etc/my.cnf
I entered the address of the Cluster Management Node in the configuration:
[mysql_cluster]
# Options for NDB Cluster processes:
ndb-connectstring=10.1.1.102 # location of cluster manager
I then created a data directory and started the node:
mkdir -p /usr/local/mysql/data
ndbd
This is when I got the "Failed to open" error result on data nodeid 1 (102.1.1.103):
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
2023-01-02 17:16:55 [ndbd] INFO -- Angel connected to '10.1.1.102:1186'
2023-01-02 17:16:55 [ndbd] INFO -- Angel allocated nodeid: 2
UPDATED (2023-01-02)
Thank you #MauritzSundell. I corrected the (private) IP addresses above and no longer got:
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
ERROR: Unable to connect with connect string: nodeid=0,10.1.1.2:1186
Retrying every 5 seconds. Attempts left: 12 11 10 9 8 7 6 5 4 3 2 1, failed.
2023-01-01 14:41:57 [ndbd] ERROR -- Could not connect to management server, error: ''
Also #MauritzSundell, in order to use the ndbmtd process rather than the ndbd process, does any alteration need to be made to any of the configuration files (e.g. /etc/systemd/system/ndb_mgmd.service)?
What is the appropriate reference/tutorial documentation for MySQL Cluster 8.0? Is it MySQL Cluster "MySQL NDB Cluster 8.0" on:
https://downloads.mysql.com/docs/mysql-cluster-excerpt-8.0-en.pdf
Or is it "MySQL InnoDB Cluster" on:
https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-cluster-introduction.html
Not sure I understand the difference.

Podman build command unable to pull image

I have configured Subuid and Subgid after installing Podman in RHEL7
I have created a simple Dockerfile to print hello world and was trying to build the image.
My Dockerfile
FROM alpine
CMD ["echo", "Hello World"]
To test I am running below command
Podman build -t imagename .
I see the below error received.
STEP 1: FROM alpine
Error: error creating build container: The following failures happened while trying to pull image specified by "alpine" based on search registries in /etc/containers/registries.conf:
* "localhost/alpine": Error initializing source docker://localhost/alpine:latest: error pinging docker registry localhost: Get https://localhost/v2/: dial tcp [::1]:443: connect: connection refused
* "registry.access.redhat.com/alpine": Error initializing source docker://registry.access.redhat.com/alpine:latest: error pinging docker registry registry.access.redhat.com: Get https://registry.access.redhat.com/v2/: read tcp 10.70.85.174:17758->23.54.147.129:443: read: connection reset by peer
* "registry.redhat.io/alpine": Error initializing source docker://registry.redhat.io/alpine:latest: error pinging docker registry registry.redhat.io: Get https://registry.redhat.io/v2/: read tcp 10.70.85.174:36028->104.79.150.216:443: read: connection reset by peer
* "docker.io/library/alpine": Error initializing source docker://alpine:latest: error pinging docker registry registry-1.docker.io: Get https://registry-1.docker.io/v2/: read tcp 10.70.85.174:53352->18.213.137.78:443: read: connection reset by peer
Am I missing any configuration ?
Thanks
Have you still the docket Daemon running and/or docker installed?
First stop the docker Daemon
sudo systemctl stop docker
OR
sudo service docker stop
Then uninstall docker
Ubuntu here but what ever you need you can Google :D
sudo apt-get remove docker docker-engine docker.io containerd runc
Try again,
If other fail now try a refreshed install of podman
sudo --reinstall install podman
Sources
https://www.cyberciti.biz/faq/debian-ubuntu-linux-reinstall-a-package-using-apt-get-command/
https://askubuntu.com/questions/935569/how-to-completely-uninstall-docker
https://intellipaat.com/community/43965/how-to-stop-docker
https://podman.io/getting-started/installation
I suggest that you first search your image in registries
podman search alpine
you should get a list of images available. Choose the one you want - version, name, tag etc and put that in the dockerfile.
to be sure it is accessible, do the 'pull' manually
podman pull alpine<version,tag>

Airflow live executor logs with DaskExecutor

I have an Airflow installation (on Kubernetes). My setup uses DaskExecutor. I also configured remote logging to S3. However when the task is running I cannot see the log, and I get this error instead:
*** Log file does not exist: /airflow/logs/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log
*** Fetching from: http://airflow-worker-74d75ccd98-6g9h5:8793/log/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-74d75ccd98-6g9h5', port=8793): Max retries exceeded with url: /log/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d0668ae80>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Once the task is done, the log is shown correctly.
I believe what Airflow is doing is:
for finished tasks read logs from s3
for running tasks, connect to executor's log server endpoint and show that.
Looks like Airflow is using celery.worker_log_server_port to connect to my dask executor to fetch logs from there.
How to configure DaskExecutor to expose log server endpoint?
my configuration:
core remote_logging True
core remote_base_log_folder s3://some-s3-path
core executor DaskExecutor
dask cluster_address 127.0.0.1:8786
celery worker_log_server_port 8793
what i verified:
- verified that the log file exists and is being written to on the executor while the task is running
- called netstat -tunlp on executor container, but did not find any extra port exposed, where logs could be served from.
UPDATE
have a look at serve_logs airflow cli command - I believe it does exactly the same.
We solved the problem by simply starting a python HTTP handler on a worker.
Dockerfile:
RUN mkdir -p $AIRFLOW_HOME/serve
RUN ln -s $AIRFLOW_HOME/logs $AIRFLOW_HOME/serve/log
worker.sh (run by Docker CMD):
#!/usr/bin/env bash
cd $AIRFLOW_HOME/serve
python3 -m http.server 8793 &
cd -
dask-worker $#

ICp 2.1.0.1: Installation failed with error TASK [master: Waiting for MariaDB service to start]

I am installing ICp 2.1.0.1 and I received an error at the TASK
[master: Waiting for MariaDB service to start] msg: The MariaDB
component failed to start.
After this msg the installation completed with failed status.
We are installing ICp with 3 Masters, 3 Proxies and 2 Workers. We have 1 IP for VIP master and 1 for VIP proxy.
I tried to install multiple times and all installations got the same error.
For prior issues with that error, the correct db admin password was not used. So check the db user and password to resolve issue.
Would you validate whether each master host was able to access port 3306 on the other hosts?
If you run with .. install -vv | tee -a install-log.txt, do you get additional details as well?
The error was solved by following the steps below.
Check whether kubelet is running:
Log in to your master node.
Run the following command to check kubelet status:
systemctl status kubelet
If kubelet is not running, run the following command to get the logs:
journalctl -u kubelet &> kubelet.log
We found the error in the kubelet.log log:
Error: failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false.
We found this troubleshoot in this link, and the solution at the ICP issue 4651.
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/troubleshoot/etcd_fails.html
https://github.ibm.com/IBMPrivateCloud/roadmap/issues/4651

Riak 1.3.1 will not start on lucid, Ec2 instance

I have installed riak (apt-get) on an EC2 instance, lucid, amd64 with libssl.
When running riak start I get:
Attempting to restart script through sudo -H -u riak
Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.
Running riak console:
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.3.1/riak
-embedded -config /etc/riak/app.config
-pa /usr/lib/riak/lib/basho-patches
-args_file /etc/riak/vm.args -- console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true]
/usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
Erlang has closed
{"Kernel pid terminated",application_controller,"{application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}})
The error logs:
2013-04-24 11:36:20.897 [error] <0.146.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332
2013-04-24 11:36:20.899 [error] <0.145.0> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error
2013-04-24 11:36:20.902 [error] <0.142.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error
2013-04-24 11:36:20.903 [error] <0.130.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at undefined exit with reason shutdown in context start_error
I'm new to Riak and basically tried to run through the "Fast Track" docs.
None of the default core IP settings in the configs have been changed. They are still set to {http, [ {"127.0.0.1", 8098 } ]}, {handoff_port, 8099 }
Any help would be greatly appreciated.
I know this is old but there is some solid documentation about the errors in the crash.dump file on the Riak site.

Resources