Riak is Not Starting After Reboot - riak
I'm seeing an IO error on the Riak console. I'm not sure what the cause is as the owner of the directory is riak. Here's how the error looks.
2018-01-25 23:18:06.922 [info] <0.2301.0>#riak_kv_vnode:maybe_create_hashtrees:234 riak_kv/730750818665451459101842416358141509827966271488: unable to start index_hashtree: {error,{{badmatch,{error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/730750818665451459101842416358141509827966271488/LOCK: already held by process"}}},[{hashtree,new_segment_store,2,[{file,"src/hashtree.erl"},{line,725}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_hashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,712}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,565}]},{riak_kv_index_hashtree,init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,308}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
2018-01-25 23:18:06.927 [info] <0.2315.0>#riak_kv_vnode:maybe_create_hashtrees:234 riak_kv/890602560248518965780370444936484965102833893376: unable to start index_hashtree: {error,{{badmatch,{error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/890602560248518965780370444936484965102833893376/LOCK: already held by process"}}},[{hashtree,new_segment_store,2,[{file,"src/hashtree.erl"},{line,725}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_hashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,712}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,565}]},{riak_kv_index_hashtree,init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,308}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
2018-01-25 23:18:06.928 [error] <0.27284.0> CRASH REPORT Process <0.27284.0> with 0 neighbours exited with reason: no match of right hand value {error,{db_open,"IO error: lock /var/lib/riak/anti_entropy/v0/890602560248518965780370444936484965102833893376/LOCK: already held by process"}} in hashtree:new_segment_store/2 line 725 in gen_server:init_it/6 line 328
Any ideas on what the problem could be?
Related
Mariadb Galera 10.5.13-16 Node Crash
I have a cluster with 2 galera nodes and 1 arbitrator. My node 1 crashed I don't understand why.. Here is the log of the node 1. It seems that it is a problem with the pthread library. Also every requests are proxied by 2 HAProxy. 2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate 2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate 2023-01-03 12:08:56 0 [Warning] WSREP: Handshake failed: http request terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>' what(): remote_endpoint: Transport endpoint is not connected 230103 12:08:56 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.5.13-MariaDB-1:10.5.13+maria~focal key_buffer_size=134217728 read_buffer_size=2097152 max_used_connections=101 max_threads=102 thread_count=106 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760333 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 mariadbd(my_print_stacktrace+0x32)[0x55b1b67f7e42] Printing to addr2line failed mariadbd(handle_fatal_signal+0x485)[0x55b1b62479a5] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff88ea983c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff88e59e18b] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff88e57d859] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff88e939911] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff88e94538c] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff88e9453f7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff88e9456a9] /usr/lib/galera/libgalera_smm.so(+0x448ad)[0x7ff884b5e8ad] /usr/lib/galera/libgalera_smm.so(+0x1fc315)[0x7ff884d16315] /usr/lib/galera/libgalera_smm.so(+0x1ff7eb)[0x7ff884d197eb] /usr/lib/galera/libgalera_smm.so(+0x1ffc28)[0x7ff884d19c28] /usr/lib/galera/libgalera_smm.so(+0x2065b6)[0x7ff884d205b6] /usr/lib/galera/libgalera_smm.so(+0x1f81f3)[0x7ff884d121f3] /usr/lib/galera/libgalera_smm.so(+0x1e6f04)[0x7ff884d00f04] /usr/lib/galera/libgalera_smm.so(+0x103438)[0x7ff884c1d438] /usr/lib/galera/libgalera_smm.so(+0xe8eea)[0x7ff884c02eea] /usr/lib/galera/libgalera_smm.so(+0xe9a8d)[0x7ff884c03a8d] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff88ea8c609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff88e67a293] The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /var/lib/mysql Resource Limits: Fatal signal 11 while backtracing PS: if you want more data ask me :)
OK it seems that 2 simultaneous scans of OpenVAS crashes the node. I tried with version 10.5.13 and 10.5.16 -> crash. Solution: Upgrade to 10.5.17 at least.
Data unpack would read past end of buffer in file util/show_help.c at line 501
I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1. -------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: barbun40 Local adapter: mlx5_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: barbun40 Local device: mlx5_0 -------------------------------------------------------------------------- [barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501 [barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected [barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in device init -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first). Although this local MPI process will likely now behave unpredictably (it may even hang or crash), the root cause of this problem is the failure of the peer -- that is what you need to investigate. For example, there may be a core file that you can examine. More generally: such peer hangups are frequently caused by application bugs or other external events. Local host: barbun64 Local PID: 252415 Peer host: barbun39 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[15284,1],35] Exit code: 9 --------------------------------------------------------------------------
Nexus3 is active(exited) and not accessible
My Nexus3 was stuck due to out of space issue, i clean some directories(non-nexus) and started, its show status like below # service nexus status; ? nexus.service - LSB: nexus Loaded: loaded (/etc/init.d/nexus; generated) Active: active (exited) in logs i can see below 2019-02-06 18:59:08,550+0100 ERROR [FelixStartLevel] *SYSTEM org.sonatype.nexus.extender.NexusContextListener - Failed to start nexus com.orientechnologies.orient.core.exception.OStorageException: Cannot open local storage '/opt/nexus/sonatype-work/nexus3/db/component' with mode=rw DB name="component" at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:323) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:259) at org.sonatype.nexus.orient.DatabaseManagerSupport.connect(DatabaseManagerSupport.java:174) at org.sonatype.nexus.orient.DatabaseInstanceImpl.doStart(DatabaseInstanceImpl.java:56) at org.sonatype.goodies.lifecycle.LifecycleSupport.start(LifecycleSupport.java:104) at org.sonatype.goodies.lifecycle.Lifecycles.start(Lifecycles.java:44) at org.sonatype.nexus.orient.DatabaseManagerSupport.createInstance(DatabaseManagerSupport.java:306) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1688) at org.sonatype.nexus.orient.DatabaseManagerSupport.instance(DatabaseManagerSupport.java:285) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Caused by: java.io.FileNotFoundException: /opt/nexus/sonatype-work/nexus3/db/component/dirty.fl (Permission denied) at java.io.RandomAccessFile.open0(Native Method) at java.io.RandomAccessFile.open(RandomAccessFile.java:316) But when i do ls it shows file is there root#XXX:/opt/nexus/sonatype-work/nexus3/db/component# ls -ltrh dirty.fl -rw-r--r-- 1 root root 2 Feb 6 19:05 dirty.fl Any clue what goes wrong?
Cannot open local storage '/opt/nexus/sonatype-work/nexus3/db/component' with mode=rw The file is present, but NXRM can't open it in a read-write mode. Since you have already ran out of space on your disk, please ensure your disk isn't mounted in read-only mode. If you're still out of space, move the sonatype-work/nexus3/db/component directory to another location and create a symlink to point to the new component directory. Keep in mind the performance when choosing your new location. To prevent this from happening in the future, try using the Cleanup Policies and periodically run Compact blob store task.
ArangoDB start failing in centos6
ArangoDB start in failing with the following error in centos6. I'm using latest arangodb version arangodb3-3.3.16-1.x86_64.rpm [root#vm1 RPM]# service arangodb3 start Starting /usr/sbin/arangod: : arena 0 background thread creation failed (13) /etc/init.d/arangodb3: line 43: 3576 Segmentation fault $ARANGO_BIN --uid arangodb --gid arangodb --server.rest-server false --log.foreground-tty false --database.check-version FATAL ERROR: EXIT_CODE_RESOLVING_FAILED for code 139 - could not resolve exit code 139 [root#vm1 RPM]# Any help will be really appreciable.
The exit code 139 = 128 + 11 means that the Arangod process crashed with a segmentation violation Can you try the following: Check your server for a faulty memory bank Run a memory test with memtester This should fix the problem There was a similar issue that has been raised on GitHub: https://github.com/arangodb/arangodb/issues/2329
Riak 1.3.1 will not start on lucid, Ec2 instance
I have installed riak (apt-get) on an EC2 instance, lucid, amd64 with libssl. When running riak start I get: Attempting to restart script through sudo -H -u riak Riak failed to start within 15 seconds, see the output of 'riak console' for more information. If you want to wait longer, set the environment variable WAIT_FOR_ERLANG to the number of seconds to wait. Running riak console: Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.3.1/riak -embedded -config /etc/riak/app.config -pa /usr/lib/riak/lib/basho-patches -args_file /etc/riak/vm.args -- console Root: /usr/lib/riak Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true] /usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed. Erlang has closed {"Kernel pid terminated",application_controller,"{application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}"} Crash dump was written to: /var/log/riak/erl_crash.dump Kernel pid terminated (application_controller) ({application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}) The error logs: 2013-04-24 11:36:20.897 [error] <0.146.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332 2013-04-24 11:36:20.899 [error] <0.145.0> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error 2013-04-24 11:36:20.902 [error] <0.142.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error 2013-04-24 11:36:20.903 [error] <0.130.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at undefined exit with reason shutdown in context start_error I'm new to Riak and basically tried to run through the "Fast Track" docs. None of the default core IP settings in the configs have been changed. They are still set to {http, [ {"127.0.0.1", 8098 } ]}, {handoff_port, 8099 } Any help would be greatly appreciated.
I know this is old but there is some solid documentation about the errors in the crash.dump file on the Riak site.