Erlang error message : global' at node 'X' requested disconnect from node 'Y' in order to prevent overlapping partition. Why? - networking

I have 135 nodes distributed over 8 CPUs. After upgrading to OTP25 , Erlang updated kernel's env variable to {prevent_overlapping_partitions,true} which is ok.
I do start properly all the nodes as I did before the OTP25 upgrade but I get this error :
global' at node 'X' requested disconnect from node 'Y' in order to prevent overlapping
What can cause the nodes to disconnect now when they were not disconnecting before the update? What can I do to solve this?
Thank you.
In order to solve the problem I created a node monitoring routine that registers the nodes and uses net_kernel to restart a node if disconnecting. I made also sure that all the nodes were restarted so that I get a "clean" starting point. At some point "out of nowhere" the errors start popping and the nodes start to disconnect which it did not happen before.

Related

Fusionauth entering silent mode

We are running fusionauth with mariadb in cluster with 3 nodes. Fusion auth is getting started in 2 of the nodes and entering in Silent mode in one of them. Is there a way to increase the number of attempts or is there a way to get this running on all three nodes without one entering the silent mode.
FusionAuth will enter silent mode if the proper environment variables are defined, and if it needs to complete configuration.
If you are entering silent mode on only one of the 3 nodes perhaps that node does not have the correct JDBC connection or has a network issue of some type that would keep it from connecting to the search and database services on startup?
https://fusionauth.io/docs/v1/tech/installation-guide/docker

MariaDB 10.1.33 keeps crashing

I have a standard master/multiple slave set up on CentOS 7 running on EC2.
I actually have three identical slaves (all spawned off the same AMI), but only one is crashing about once a day. I've posted the dump from error.log below, as well as the query log referencing the connection ID in the error dump.
I've tried looking at the MariaDB docs, but all it points to is resolve_stack_dump but no real help on trying to figure it out from there.
The slave that is crashing is running many batch-like queries, but according to the dump logs, the last connection id is never one of the connections running queries.
For this slave, I have the system turn off slave updates (SQL_THREAD), run queries for 15 minutes, stop the queries, start the slave until caught up, stop the slave updates, and restart the queries. Repeat. This code has been working pretty much non-stop/crash-free for years now when I had a colo set up before I moved to AWS.
My other two cloned slaves only run the replication queries as a hot-spare of the master (which I've never needed to use). Those servers never crash.
thanks.
Error.log crash dump:
180618 13:12:46 [ERROR] mysqld got signal 11 ; This could be because
you hit a bug. It is also possible that this binary or one of the
libraries it was linked against is corrupt, improperly built, or
misconfigured. This error can also be caused by malfunctioning
hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, something is
definitely wrong and this may fail.
Server version: 10.1.33-MariaDB
key_buffer_size=268431360
read_buffer_size=268431360
max_used_connections=30
max_threads=42
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads =
22282919 K bytes of memory Hope that's ok; if not, decrease some
variables in the equation.
Thread pointer: 0x7f4209f1c008 Attempting backtrace. You can use the
following information to find out where mysqld died. If you see no
messages after this, something went terribly wrong... stack_bottom =
0x7f4348db90b0 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55c19a7be10e]
/usr/sbin/mysqld(handle_fatal_signal+0x305)[0x55c19a2e1295]
sigaction.c:0(__restore_rt)[0x7f4348a835e0]
sql/sql_class.h:3406(sql_set_variables(THD*, List,
bool))[0x55c19a0d2ecd]
sql/sql_list.h:179(base_list::empty())[0x55c19a14bcb8]
sql/sql_parse.cc:2007(dispatch_command(enum_server_command, THD,
char*, unsigned int))[0x55c19a15e85a]
sql/sql_parse.cc:1122(do_command(THD*))[0x55c19a160f37]
sql/sql_connect.cc:1330(do_handle_one_connection(THD*))[0x55c19a22d6da]
sql/sql_connect.cc:1244(handle_one_connection)[0x55c19a22d880]
pthread_create.c:0(start_thread)[0x7f4348a7be25]
/lib64/libc.so.6(clone+0x6d)[0x7f4346e1f34d]
Trying to get some variables. Some pointers may be invalid and cause
the dump to abort. Query (0x0): Connection ID (thread ID): 15894
Status: NOT_KILLED
Optimizer switch:
index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off
Query Log for connection ID:
180618 13:11:01
15894 Connect ****#piper**** as anonymous on
15894 Query show status
15894 Prepare show full processlist /* m6clone1 /
15894 Execute show full processlist / m6clone1 */
15894 Close stmt
15894 Query show slave status
15894 Query show variables
15894 Quit

Starting all nodes in galera at once

I have a galera cluster of three nodes, If I shut down the the three virtual machines and started them all at once, systemd will automatically start mariadb on each of the virtual machines.
Some times it happens that all of the mariadb instances start at once, and this result of a broken cluster.
Which I have to reinitiate using galera_new_cluster
The question is, why does starting all the mariadb instances at once break the cluster ?
Thank you
Whenever you start a node, it either starts as the first node in the cluster (initiates a new cluster), or it attempts to connect to an existing nodes using wsrep_cluster_address. The behavior depends on the node options.
So, every time when you shut down or lose all nodes and start them again, there is nothing to connect to, and you need to start a new cluster. galera_new_cluster does that by starting a node with --wsrep-new-cluster option which overrides the current value of wsrep_cluster_address.
If sometimes it works for you automatically, it most likely means that one of your nodes is permanently configured as the "first node", either via wsrep_cluster_address=gcomm://, or via wsrep-new-cluster. It is a wrong setup in itself. If you lose or shut down only this node and have to restart it, it won't join the remaining nodes in the cluster, it will create a new one.
When you start all nodes at once, you create a race condition. If your "first node" comes up first and initializes quickly enough, it will create a new cluster, and other nodes will join it. If another node comes up first, it won't be able to join anything, thus you get a "broken cluster".
You can find more information on restarting the whole cluster here:
http://galeracluster.com/documentation-webpages/restartingcluster.html
Recommendate way of bootstrapping cluster is to start a advanced node first then second and third so for you need to check lsn no of all nodes or check grastate.date file where you can check all nodes value .
so follow these steps your cluster node will not be crash

Very slow Riak writes and this error: {shutdown,max_concurrency}

On a 5-node Riak cluster, we have observed very slow writes - about 2 docs per second. Upon investigation, I noticed that some of the nodes were low on disk space. After making more space available and restarting the nodes, we are see this error (or something similar) on all of the nodes inside console.log:
2015-02-20 16:16:29.694 [info] <0.161.0>#riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 182687704666362864775460604089535377456991567872 was terminated for reason: {shutdown,max_concurrency}
Currently, the cluster is not being written to or read from.
I would appreciate any help in getting the cluster back to good health.
I will add that we are writing documents to an index that is tied to a Solr index.
This is not critical production data, and I could technically wipe everything and start fresh, but I would like to properly diagnose and fix the issue so that I am prepared to handle it if it should happen in a production environment in the future.
Thanks!

OpenMPI fault tolerance

I have an assignment to implement simple fault-tolerance in an OpenMPI application. The problem we are having is that, despite setting the MPI error handling to MPI_ERRORS_RETURN, when one of our nodes is unplugged from the cluster we get the following error on the next MPI_ call after a lengthy hang:
[btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() failed: Connection timed out (110)
My take from this is that it is not possible to continue processing on all other nodes when one node drops from the network with OpenMPI. Can anyone confirm this for me, or point me in a direction for preventing the btl_tcp_endpoint error?
We are using OpenMPI version 1.6.5.
The MPI_ERRORS_RETURN code paths are not well tested (and probably not well implemented) in Open MPI. They simply haven't been a priority, so we've never really done much work in this area.
Sorry.

Resources