Riak 1.3.1 will not start on lucid, Ec2 instance - riak

I have installed riak (apt-get) on an EC2 instance, lucid, amd64 with libssl.
When running riak start I get:
Attempting to restart script through sudo -H -u riak
Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.
Running riak console:
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.3.1/riak
-embedded -config /etc/riak/app.config
-pa /usr/lib/riak/lib/basho-patches
-args_file /etc/riak/vm.args -- console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true]
/usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
Erlang has closed
{"Kernel pid terminated",application_controller,"{application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core, {shutdown,{riak_core_app,start,[normal,[]]}}})
The error logs:
2013-04-24 11:36:20.897 [error] <0.146.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332
2013-04-24 11:36:20.899 [error] <0.145.0> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error
2013-04-24 11:36:20.902 [error] <0.142.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error
2013-04-24 11:36:20.903 [error] <0.130.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at undefined exit with reason shutdown in context start_error
I'm new to Riak and basically tried to run through the "Fast Track" docs.
None of the default core IP settings in the configs have been changed. They are still set to {http, [ {"127.0.0.1", 8098 } ]}, {handoff_port, 8099 }
Any help would be greatly appreciated.

I know this is old but there is some solid documentation about the errors in the crash.dump file on the Riak site.

Related

Setting repmgr witness node on Debian

I am trying to set up repmgr version 5 on Debian with PostgtrSql 11.
Seems like the documentation is more oriented towards centos/RHEL.
When I am trying to setup the witnes node to start the repmgr daemon, I get an error without any idea where to look for for seeing what is the cause of the error.
This is my repmgr.conf file:
node_id=3
node_name='PG-Node-Witness'
conninfo='host=10.97.7.140 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/11/main'
failover='automatic'
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=60
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=6
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true
repmgrd_service_start_command='sudo /etc/init.d/repmgrd start' #??????
repmgrd_service_stop_command='sudo //etc/init.d/repmgrd stop'#??????
service_start_command='sudo /usr/bin/systemctl start postgresql#11-main.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql#11-main.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql#11-main.service'
service_reload_command='sudo /usr/bin/systemctl relaod postgresql#11-main.service'
monitoring_history=yes
log_status_interval=60
register is OK:
repmgr -f /etc/repmgr.conf witness register -h 10.97.7.97
INFO: connecting to witness node "PG-Node-Witness" (ID: 3)
INFO: connecting to primary node
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
INFO: witness registration complete
NOTICE: witness node "PG-Node-Witness" (ID: 3) successfully registered
repmgr daemon dry-run OK too:
$repmgr -f /etc/repmgr.conf daemon start --dry-run
INFO: prerequisites for starting repmgrd met
DETAIL: following command would be executed:
sudo /usr/bin/systemctl start postg...#11-main.service
I setup /etc/default/repmgrd with:
REPMGRD_ENABLED=yes
and
REPMGRD_CONF="/etc/repmgr.conf"
But still get error when trying to run the daemon start:
$ repmgr -f /etc/repmgr.conf daemon start
I get:
NOTICE: executing: "sudo /etc/init.d/repmgrd start"
ERROR: repmgrd does not appear to have started after 15 seconds
HINT: use "repmgr service status" to confirm that repmgrd was successfully started
It is recommended to run repmgrd as a systemd service,
According to the docs (for debian) you may first need to configure /etc/default/repmgrd,
My configuration looks like this:
# default settings for repmgrd. This file is source by /bin/sh from
# /etc/init.d/repmgrd
# disable repmgrd by default so it won't get started upon installation
# valid values: yes/no
REPMGRD_ENABLED=yes
# configuration file (required)
REPMGRD_CONF="/etc/repmgr/12/repmgr.conf"
# additional options
REPMGRD_OPTS="--daemonize=false"
# user to run repmgrd as
REPMGRD_USER=postgres
# repmgrd binary
REPMGRD_BIN=/bin/repmgrd
# pid file
REPMGRD_PIDFILE=/var/run/repmgrd.pid
Secondly, I would revisit sudoers (visudo) in order to check whether the non-root user can execute sudo /etc/init.d/repmgrd start.
Further, the user who runs repmgr commands should be able to write logs depending on your configuration.
Apparently the correct command to start the repmgr daemon is:
repmgrd -f /etc/prepmgr.conf

Airflow live executor logs with DaskExecutor

I have an Airflow installation (on Kubernetes). My setup uses DaskExecutor. I also configured remote logging to S3. However when the task is running I cannot see the log, and I get this error instead:
*** Log file does not exist: /airflow/logs/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log
*** Fetching from: http://airflow-worker-74d75ccd98-6g9h5:8793/log/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-74d75ccd98-6g9h5', port=8793): Max retries exceeded with url: /log/dbt/run_dbt/2018-11-01T06:00:00+00:00/3.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7d0668ae80>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Once the task is done, the log is shown correctly.
I believe what Airflow is doing is:
for finished tasks read logs from s3
for running tasks, connect to executor's log server endpoint and show that.
Looks like Airflow is using celery.worker_log_server_port to connect to my dask executor to fetch logs from there.
How to configure DaskExecutor to expose log server endpoint?
my configuration:
core remote_logging True
core remote_base_log_folder s3://some-s3-path
core executor DaskExecutor
dask cluster_address 127.0.0.1:8786
celery worker_log_server_port 8793
what i verified:
- verified that the log file exists and is being written to on the executor while the task is running
- called netstat -tunlp on executor container, but did not find any extra port exposed, where logs could be served from.
UPDATE
have a look at serve_logs airflow cli command - I believe it does exactly the same.
We solved the problem by simply starting a python HTTP handler on a worker.
Dockerfile:
RUN mkdir -p $AIRFLOW_HOME/serve
RUN ln -s $AIRFLOW_HOME/logs $AIRFLOW_HOME/serve/log
worker.sh (run by Docker CMD):
#!/usr/bin/env bash
cd $AIRFLOW_HOME/serve
python3 -m http.server 8793 &
cd -
dask-worker $#

Starting OpenShift cluster never ends when starting minishift or takes to much memory

Whenever I run the command to start Minishift with the virtualbox driver to the OS host it takes a crazy time and it never ends. Sometimes I even get an error message on storage limit being reached.
I wonder if it is an error of Persistent storage volume configuration and usage that is described here
mike#mike-thinks:~$ minishift start --vm-driver=virtualbox
-- Starting profile 'minishift'
-- Check if depereccated options are used ... OK
-- Checking if https://github.com is reachable ... OK
-- Checking if requested OpenShift version 'v3.9.0' is valid ... OK
-- Checking if requested OpenShift version 'v3.9.0' is supported ... OK
-- Checking if requested hypervisor 'virtualbox' is supported on this platform ... OK
-- Checking if VirtualBox is installed ... OK
-- Checking the ISO URL ... OK
-- Checking if provided oc flags are supported ... OK
-- Starting local OpenShift cluster using 'virtualbox' hypervisor ...
-- Starting Minishift VM ........................ OK
-- Checking for IP address ... OK
-- Checking for nameservers ... OK
-- Checking if external host is reachable from the Minishift VM ...
Pinging 8.8.8.8 ... OK
-- Checking HTTP connectivity from the VM ...
Retrieving http://minishift.io/index.html ... OK
-- Checking if persistent storage volume is mounted ... OK
-- Checking available disk space ... 8% used OK
-- OpenShift cluster will be configured with ...
Version: v3.9.0
-- Copying oc binary from the OpenShift container image to VM .... OK
-- Starting OpenShift cluster ..............................................
What I can do ? I'm following this tutorial and I just want to get to the stage that allows me to add oc to the PATH
Update: new error during openshift cluster start
-- Starting OpenShift cluster ...........Error during 'cluster up' execution: Error starting the cluster. ssh command error:
command : /var/lib/minishift/bin/oc cluster up --use-existing-config --host-volumes-dir /var/lib/minishift/openshift.local.volumes --host-pv-dir /var/lib/minishift/openshift.local.pv --host-config-dir /var/lib/minishift/openshift.local.config --host-data-dir /var/lib/minishift/hostdata --public-hostname 192.168.99.100 --routing-suffix 192.168.99.100.nip.io
err : exit status 1
output : Deleted existing OpenShift container
Using nsenter mounter for OpenShift volumes
Using public hostname IP 192.168.99.100 as the host IP
Using 192.168.99.100 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
-- Starting OpenShift container ...
Starting OpenShift using container 'origin'
Waiting for API server to start listening
FAIL
Error: cannot access master readiness URL https://192.168.99.100:8443/healthz/ready
Details:
Last 10 lines of "origin" container log:
E0625 14:47:40.905680 2341 proxier.go:252] Error removing userspace rule: error checking rule: fork/exec /usr/sbin/iptables: exec format error:
E0625 14:47:40.908353 2341 proxier.go:259] Error removing userspace rule: error checking rule: fork/exec /usr/sbin/iptables: exec format error:
E0625 14:47:40.910681 2341 proxier.go:274] Error flushing userspace chain: error flushing chain "KUBE-PORTALS-CONTAINER": fork/exec /usr/sbin/iptables: exec format error:
E0625 14:47:40.913452 2341 proxier.go:274] Error flushing userspace chain: error flushing chain "KUBE-PORTALS-HOST": fork/exec /usr/sbin/iptables: exec format error:
E0625 14:47:40.919209 2341 proxier.go:274] Error flushing userspace chain: error flushing chain "KUBE-NODEPORT-HOST": fork/exec /usr/sbin/iptables: exec format error:
W0625 14:47:40.931698 2341 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
E0625 14:47:40.932412 2341 proxier.go:274] Error flushing userspace chain: error flushing chain "KUBE-NODEPORT-CONTAINER": fork/exec /usr/sbin/iptables: exec format error:
E0625 14:47:40.938345 2341 proxier.go:274] Error flushing userspace chain: error flushing chain "KUBE-NODEPORT-NON-LOCAL": fork/exec /usr/sbin/iptables: exec format error:
W0625 14:47:40.941639 2341 iptables.go:151] Error checking iptables version, assuming version at least 1.4.11: fork/exec /usr/sbin/iptables: exec format error
F0625 14:47:40.949329 2341 network.go:177] error: Could not initialize Kubernetes Proxy. You must run this process as root (and if containerized, in the host network namespace as privileged) to use the service proxy: failed to initialize iptables: error creating chain "KUBE-PORTALS-CONTAINER": fork/exec /usr/sbin/iptables: exec format error:
Caused By:
Error: Get https://192.168.99.100:8443/healthz/ready: dial tcp 192.168.99.100:8443: getsockopt: connection refused
If this should anyhow help, the first time it did not finish for me at all too. However the image was in a state "running" in VirtualBox.
The second time I run it via elevated command prompt - and it finalized successfully. But perhaps running inside elevated user did not help, but the fact that I run it the second time.
If the issue is that you're stuck on Starting Minishift VM (not the case for the OP), then the issue may be that you're on a VPN. Try disconnecting the VPN and see if that fixes your issues.
Had the same problem. I noticed some network traffic during the first (failing) startup with some slower network connection. I've waited some time until the network traffic was low and tried it again then and it worked. So probably during startup some docker image downloads are done.

ICp 2.1.0.1: Installation failed with error TASK [master: Waiting for MariaDB service to start]

I am installing ICp 2.1.0.1 and I received an error at the TASK
[master: Waiting for MariaDB service to start] msg: The MariaDB
component failed to start.
After this msg the installation completed with failed status.
We are installing ICp with 3 Masters, 3 Proxies and 2 Workers. We have 1 IP for VIP master and 1 for VIP proxy.
I tried to install multiple times and all installations got the same error.
For prior issues with that error, the correct db admin password was not used. So check the db user and password to resolve issue.
Would you validate whether each master host was able to access port 3306 on the other hosts?
If you run with .. install -vv | tee -a install-log.txt, do you get additional details as well?
The error was solved by following the steps below.
Check whether kubelet is running:
Log in to your master node.
Run the following command to check kubelet status:
systemctl status kubelet
If kubelet is not running, run the following command to get the logs:
journalctl -u kubelet &> kubelet.log
We found the error in the kubelet.log log:
Error: failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false.
We found this troubleshoot in this link, and the solution at the ICP issue 4651.
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/troubleshoot/etcd_fails.html
https://github.ibm.com/IBMPrivateCloud/roadmap/issues/4651

Installation of Riak under Ubuntu 14.04 LTS

I cant bring riak to work on Ubuntu 14.04. LTS using the bash instructions under
http://docs.basho.com/riak/latest/ops/building/installing/debian-ubuntu/.
When running riak start I get:
riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.
When running riak console afterwards:
Exec: /usr/lib/riak/erts-5.10.3/bin/erlexec -boot /usr/lib/riak/releases/2.1.3/riak -config /var/lib/riak/generated.configs/app.2016.02.28.21.43.04.config -args_file /var/lib/riak/generated.configs/vm.2016.02.28.21.43.04.args -vm_args /var/lib/riak/generated.configs/vm.2016.02.28.21.43.04.args -pa /usr/lib/riak/lib/basho-patches -- console -x
Root: /usr/lib/riak
Erlang R16B02_basho8 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true] [frame-pointer]
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
{"Kernel pid terminated",application_controller,"{application_start_failure,riak_core,{bad_return,{{riak_core_app,start,[normal,[]]},{'EXIT',{{function_clause,[{orddict,fetch,['riak#127.0.0.1',[{'riak#54.194.69.48',[{{riak_core,bucket_types},[true,false]},{{riak_core,fold_req_version},[v2,v1]},{{riak_core,net_ticktime},[true,false]},{{riak_core,resizable_ring},[true,false]},{{riak_core,security},[true,false]},{{riak_core,staged_joins},[true,false]},{{riak_core,vnode_routing},[proxy,legacy]},{{riak_pipe,trace_format},[ordsets,sets]}]}]],[{file,\"orddict.erl\"},{line,72}]},{riak_core_capability,renegotiate_capabilities,1,[{file,\"src/riak_core_capability.erl\"},{line,441}]},{riak_core_capability,handle_call,3,[{file,\"src/riak_core_capability.erl\"},{line,213}]},{gen_server,handle_msg,5,[{file,\"gen_server.erl\"},{line,585}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,239}]}]},{gen_server,call,[riak_core_capability,{register,{riak_core,vnode_routing},{capability,[proxy,legacy],legacy,{riak_core,legacy_vnode_routing,[{true,legacy},{false,proxy}]}}},infinity]}}}}}}"}
Any idea how to fix this? Installation has been done via apt-get. Default riak.conf. Riak version is 2.1.3.
This is a Riak error, not at all related to Ubuntu.
The error message indicates that the current name of the node does not match the name of any node in the ring file. This can happen if you start the node with a default configuration before configuring the node's name. See Note on changing the name value at http://docs.basho.com/riak/latest/ops/building/basic-cluster-setup/
If this is a singleton node, the simplest solution will be to delete the files in /var/lib/riak/ring (make a backup first). A new one will be created when you start the node.

Resources