Riak is Not Starting After Reboot - riak

I'm seeing an IO error on the Riak console. I'm not sure what the cause is as the owner of the directory is riak. Here's how the error looks.
2018-01-25 23:18:06.922 [info] <0.2301.0>#riak_kv_vnode:maybe_create_hashtrees:234 riak_kv/730750818665451459101842416358141509827966271488: unable to start index_hashtree: {error,{{badmatch,{error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/730750818665451459101842416358141509827966271488/LOCK: already held by process"}}},[{hashtree,new_segment_store,2,[{file,"src/hashtree.erl"},{line,725}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_hashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,712}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,565}]},{riak_kv_index_hashtree,init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,308}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
2018-01-25 23:18:06.927 [info] <0.2315.0>#riak_kv_vnode:maybe_create_hashtrees:234 riak_kv/890602560248518965780370444936484965102833893376: unable to start index_hashtree: {error,{{badmatch,{error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/890602560248518965780370444936484965102833893376/LOCK: already held by process"}}},[{hashtree,new_segment_store,2,[{file,"src/hashtree.erl"},{line,725}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_hashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,712}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,565}]},{riak_kv_index_hashtree,init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,308}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
2018-01-25 23:18:06.928 [error] <0.27284.0> CRASH REPORT Process <0.27284.0> with 0 neighbours exited with reason: no match of right hand value {error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/890602560248518965780370444936484965102833893376/LOCK: already held by process"}} in hashtree:new_segment_store/2 line 725 in gen_server:init_it/6 line 328
Any ideas on what the problem could be?

Related

Mariadb Galera 10.5.13-16 Node Crash

I have a cluster with 2 galera nodes and 1 arbitrator.
My node 1 crashed I don't understand why..
Here is the log of the node 1.
It seems that it is a problem with the pthread library.
Also every requests are proxied by 2 HAProxy.
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:56 0 [Warning] WSREP: Handshake failed: http request
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
what(): remote_endpoint: Transport endpoint is not connected
230103 12:08:56 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.13-MariaDB-1:10.5.13+maria~focal
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=101
max_threads=102
thread_count=106
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760333 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mariadbd(my_print_stacktrace+0x32)[0x55b1b67f7e42]
Printing to addr2line failed
mariadbd(handle_fatal_signal+0x485)[0x55b1b62479a5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff88ea983c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff88e59e18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff88e57d859]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff88e939911]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff88e94538c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff88e9453f7]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff88e9456a9]
/usr/lib/galera/libgalera_smm.so(+0x448ad)[0x7ff884b5e8ad]
/usr/lib/galera/libgalera_smm.so(+0x1fc315)[0x7ff884d16315]
/usr/lib/galera/libgalera_smm.so(+0x1ff7eb)[0x7ff884d197eb]
/usr/lib/galera/libgalera_smm.so(+0x1ffc28)[0x7ff884d19c28]
/usr/lib/galera/libgalera_smm.so(+0x2065b6)[0x7ff884d205b6]
/usr/lib/galera/libgalera_smm.so(+0x1f81f3)[0x7ff884d121f3]
/usr/lib/galera/libgalera_smm.so(+0x1e6f04)[0x7ff884d00f04]
/usr/lib/galera/libgalera_smm.so(+0x103438)[0x7ff884c1d438]
/usr/lib/galera/libgalera_smm.so(+0xe8eea)[0x7ff884c02eea]
/usr/lib/galera/libgalera_smm.so(+0xe9a8d)[0x7ff884c03a8d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff88ea8c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff88e67a293]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing
PS: if you want more data ask me :)
OK it seems that 2 simultaneous scans of OpenVAS crashes the node.
I tried with version 10.5.13 and 10.5.16 -> crash.
Solution: Upgrade to 10.5.17 at least.

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Nexus3 is active(exited) and not accessible

My Nexus3 was stuck due to out of space issue, i clean some directories(non-nexus) and started, its show status like below
# service nexus status;
? nexus.service - LSB: nexus
Loaded: loaded (/etc/init.d/nexus; generated)
Active: active (exited)
in logs i can see below
2019-02-06 18:59:08,550+0100 ERROR [FelixStartLevel] *SYSTEM org.sonatype.nexus.extender.NexusContextListener - Failed to start nexus
com.orientechnologies.orient.core.exception.OStorageException: Cannot open local storage '/opt/nexus/sonatype-work/nexus3/db/component' with mode=rw
DB name="component"
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:323)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:259)
at org.sonatype.nexus.orient.DatabaseManagerSupport.connect(DatabaseManagerSupport.java:174)
at org.sonatype.nexus.orient.DatabaseInstanceImpl.doStart(DatabaseInstanceImpl.java:56)
at org.sonatype.goodies.lifecycle.LifecycleSupport.start(LifecycleSupport.java:104)
at org.sonatype.goodies.lifecycle.Lifecycles.start(Lifecycles.java:44)
at org.sonatype.nexus.orient.DatabaseManagerSupport.createInstance(DatabaseManagerSupport.java:306)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1688)
at org.sonatype.nexus.orient.DatabaseManagerSupport.instance(DatabaseManagerSupport.java:285)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.io.FileNotFoundException: /opt/nexus/sonatype-work/nexus3/db/component/dirty.fl (Permission denied)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
But when i do ls it shows file is there
root#XXX:/opt/nexus/sonatype-work/nexus3/db/component# ls -ltrh dirty.fl
-rw-r--r-- 1 root root 2 Feb 6 19:05 dirty.fl
Any clue what goes wrong?
Cannot open local storage '/opt/nexus/sonatype-work/nexus3/db/component' with mode=rw
The file is present, but NXRM can't open it in a read-write mode. Since you have already ran out of space on your disk, please ensure your disk isn't mounted in read-only mode.
If you're still out of space, move the sonatype-work/nexus3/db/component directory to another location and create a symlink to point to the new component directory. Keep in mind the performance when choosing your new location.
To prevent this from happening in the future, try using the Cleanup Policies and periodically run Compact blob store task.

ArangoDB start failing in centos6

ArangoDB start in failing with the following error in centos6. I'm using latest arangodb version arangodb3-3.3.16-1.x86_64.rpm
[root#vm1 RPM]# service arangodb3 start
Starting /usr/sbin/arangod: : arena 0 background thread creation failed (13)
/etc/init.d/arangodb3: line 43: 3576 Segmentation fault $ARANGO_BIN --uid arangodb --gid arangodb --server.rest-server false --log.foreground-tty false --database.check-version
FATAL ERROR: EXIT_CODE_RESOLVING_FAILED for code 139 - could not resolve exit code 139
[root#vm1 RPM]#
Any help will be really appreciable.
The exit code 139 = 128 + 11 means that the Arangod process crashed with a segmentation violation
Can you try the following:
Check your server for a faulty memory bank
Run a memory test with memtester
This should fix the problem
There was a similar issue that has been raised on GitHub: https://github.com/arangodb/arangodb/issues/2329

Riak 1.3.1 will not start on lucid, Ec2 instance

I have installed riak (apt-get) on an EC2 instance, lucid, amd64 with libssl.
When running riak start I get:
Attempting to restart script through sudo -H -u riak
Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.
Running riak console:
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.3.1/riak
-embedded -config /etc/riak/app.config
-pa /usr/lib/riak/lib/basho-patches
-args_file /etc/riak/vm.args -- console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true]
/usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
Erlang has closed
{"Kernel pid terminated",application_controller,"{application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}})
The error logs:
2013-04-24 11:36:20.897 [error] <0.146.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332
2013-04-24 11:36:20.899 [error] <0.145.0> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error
2013-04-24 11:36:20.902 [error] <0.142.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error
2013-04-24 11:36:20.903 [error] <0.130.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at undefined exit with reason shutdown in context start_error
I'm new to Riak and basically tried to run through the "Fast Track" docs.
None of the default core IP settings in the configs have been changed. They are still set to {http, [ {"127.0.0.1", 8098 } ]}, {handoff_port, 8099 }
Any help would be greatly appreciated.
I know this is old but there is some solid documentation about the errors in the crash.dump file on the Riak site.

Resources