I got this error
Installation failed. Failed to receive heartbeat from agent.
when I was installing cloudera on single node.
This is what is in my /etc/hosts file:
127.0.0.1 localhost
192.168.2.131 ubuntu
This is what is in my /etc/hostname file:
ubuntu
And this is the error in my /var/log/cloudera-scm-agent file:
[13/Jun/2014 12:31:58 +0000] 15366 MainThread agent INFO To override these variables, use /etc/cloudera-scm-agent/config.ini. Environment variables for CDH locations are not used when CDH is installed from parcels.
[13/Jun/2014 12:31:58 +0000] 15366 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/process
[13/Jun/2014 12:31:58 +0000] 15366 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor
[13/Jun/2014 12:31:58 +0000] 15366 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor/include
[13/Jun/2014 12:31:58 +0000] 15366 MainThread agent ERROR Failed to connect to previous supervisor.
Traceback (most recent call last):
File "/usr/lib/cmf/agent/src/cmf/agent.py", line 1236, in find_or_start_supervisor
self.get_supervisor_process_info()
File "/usr/lib/cmf/agent/src/cmf/agent.py", line 1423, in get_supervisor_process_info
self.identifier = self.supervisor_client.supervisor.getIdentification()
File "/usr/lib/python2.7/xmlrpclib.py", line 1224, in __call__
return self.__send(self.__name, args)
File "/usr/lib/python2.7/xmlrpclib.py", line 1578, in __request
verbose=self.__verbose
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/supervisor-3.0-py2.7.egg/supervisor/xmlrpc.py", line 460, in request
self.connection.request('POST', handler, request_body, self.headers)
File "/usr/lib/python2.7/httplib.py", line 958, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 992, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 954, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 814, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 776, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 757, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 571, in create_connection
raise err
error: [Errno 111] Connection refused
[13/Jun/2014 12:31:58 +0000] 15366 MainThread tmpfs INFO Reusing mounted tmpfs at /run/cloudera-scm-agent/process
[13/Jun/2014 12:31:59 +0000] 15366 MainThread agent INFO Trying to connect to newly launched supervisor (Attempt 1)
[13/Jun/2014 12:31:59 +0000] 15366 MainThread agent INFO Successfully connected to supervisor
[13/Jun/2014 12:31:59 +0000] 15366 MainThread _cplogging INFO [13/Jun/2014:12:31:59] ENGINE Bus STARTING
[13/Jun/2014 12:31:59 +0000] 15366 MainThread _cplogging INFO [13/Jun/2014:12:31:59] ENGINE Started monitor thread '_TimeoutMonitor'.
[13/Jun/2014 12:31:59 +0000] 15366 MainThread _cplogging INFO [13/Jun/2014:12:31:59] ENGINE Serving on ubuntu:9000
[13/Jun/2014 12:31:59 +0000] 15366 MainThread _cplogging INFO [13/Jun/2014:12:31:59] ENGINE Bus STARTED
[13/Jun/2014 12:31:59 +0000] 15366 MainThread __init__ INFO New monitor: (<cmf.monitor.host.HostMonitor object at 0x305b990>,)
[13/Jun/2014 12:31:59 +0000] 15366 MainThread agent WARNING Setting default socket timeout to 30!
[13/Jun/2014 12:31:59 +0000] 15366 MonitorDaemon-Scheduler __init__ INFO Monitor ready to report: ('HostMonitor',)
[13/Jun/2014 12:31:59 +0000] 15366 MainThread agent INFO Using parcels directory from server provided value: /opt/cloudera/parcels
[13/Jun/2014 12:31:59 +0000] 15366 MainThread parcel INFO Agent does create users/groups and apply file permissions
[13/Jun/2014 12:31:59 +0000] 15366 MainThread downloader INFO Downloader path: /opt/cloudera/parcel-cache
[13/Jun/2014 12:31:59 +0000] 15366 MainThread parcel_cache INFO Using /opt/cloudera/parcel-cache for parcel cache
[13/Jun/2014 12:31:59 +0000] 15366 MainThread agent INFO Active parcel list updated; recalculating component info.
[13/Jun/2014 12:32:04 +0000] 15366 Monitor-HostMonitor throttling_logger INFO Using java location: '/usr/lib/jvm/java-7-oracle-cloudera/bin/java'.
[13/Jun/2014 12:32:04 +0000] 15366 Monitor-HostMonitor throttling_logger ERROR Failed to collect NTP metrics
Traceback (most recent call last):
File "/usr/lib/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 39, in collect
result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout)
File "/usr/lib/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 32, in _subprocess_with_timeout
return subprocess_with_timeout(args, timeout)
File "/usr/lib/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 40, in subprocess_with_timeout
close_fds=True)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
[13/Jun/2014 12:32:12 +0000] 15366 Monitor-HostMonitor throttling_logger ERROR Timeout with args ['/usr/lib/jvm/java-7-oracle-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.0.2.jar', 'com.cloudera.cmon.agent.DnsTest']
None
[13/Jun/2014 12:32:12 +0000] 15366 Monitor-HostMonitor throttling_logger ERROR Failed to collect java-based DNS names
Traceback (most recent call last):
File "/usr/lib/cmf/agent/src/cmf/monitor/host/dns_names.py", line 67, in collect
result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
File "/usr/lib/cmf/agent/src/cmf/monitor/host/dns_names.py", line 49, in _subprocess_with_timeout
return subprocess_with_timeout(args, timeout)
File "/usr/lib/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 81, in subprocess_with_timeout
raise Exception("timeout with args %s" % args)
Exception: timeout with args ['/usr/lib/jvm/java-7-oracle-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.0.2.jar', 'com.cloudera.cmon.agent.DnsTest']
I am facing similar problems. I've found how to solve the problem:
ERROR Failed to collect NTP metrics
It's because NTP service is not installed/started.
Try:
sudo apt-get update && sudo apt-get install ntp
sudo service ntp start
Got the same error, please assure that your hostname could be translated to your ip.
Run ifconfig -a lookup your ip address for eth0, then run dig or host command using your FQDN and review the ip address is the same that ifconfig shows.
Follow this tutorial from cloudera: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_11_1.html
When installing Cloudera 5.2 on AWS, this error occurs. It's a known issue and Cloudera put the workaround on their website (copied here):
Installing on AWS, you must use private EC2 hostnames.
When installing on an AWS instance, and adding hosts using their public names, the installation will fail when the hosts fail to heartbeat.
Workaround:
Use the Back button in the wizard to return to the original screen, where it prompts for a license.
Rerun the wizard, but choose "Use existing hosts" instead of searching for hosts. Now those hosts show up with their internal EC2 names.
Continue through the wizard and the installation should succeed.
Ensure that the host's hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager server (check firewall rules).
Ensure that ports 9000 and 9001 are free on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).
Related
Upon running:
airflow scheduler
I get the following error:
[2022-08-10 13:26:53,501] {scheduler_job.py:708} INFO - Starting the scheduler
[2022-08-10 13:26:53,502] {scheduler_job.py:713} INFO - Processing each file at most -1 times
[2022-08-10 13:26:53,509] {executor_loader.py:105} INFO - Loaded executor: SequentialExecutor
[2022-08-10 13:26:53 -0400] [1388] [INFO] Starting gunicorn 20.1.0
[2022-08-10 13:26:53,540] {manager.py:160} INFO - Launched DagFileProcessorManager with pid: 1389
[2022-08-10 13:26:53,545] {scheduler_job.py:1233} INFO - Resetting orphaned tasks for active dag runs
.
.
.
[2022-08-10 13:26:53 -0400] [1391] [INFO] Booting worker with pid: 1391
Process DagFileProcessor10-Process:
Traceback (most recent call last):
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 998, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 672, in do_commit
dbapi_connection.commit()
sqlite3.OperationalError: disk I/O error
I get this 'disk I/O error' as well when I run airflow webserver --port 8080 command as so:
Workers: 4 sync
Host: 0.0.0.0:8080
Timeout: 120
Logfiles: - -
Access Logformat:
=================================================================
[2022-08-10 14:42:28 -0400] [2759] [INFO] Starting gunicorn 20.1.0
[2022-08-10 14:42:29 -0400] [2759] [INFO] Listening at: http://0.0.0.0:8080 (2759)
[2022-08-10 14:42:29 -0400] [2759] [INFO] Using worker: sync
.
.
.
[2022-08-10 14:42:55,149] {app.py:1455} ERROR - Exception on /static/appbuilder/datepicker/bootstrap-datepicker.css [GET]
Traceback (most recent call last):
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 998, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/home/dromo/anaconda3/envs/airflow_env_2/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 672, in do_commit
dbapi_connection.commit()
sqlite3.OperationalError: disk I/O error
Any ideas as to what might be causing this and possible fixes?
It seems like airflow doesn't find the database on the disk, try to initialize it:
airflow db init
I launch a airflow webserver command in my local machine to start an airflow instance on port 8081. The server starts, however the pŕompt constantly shows some warning messages, as a loop. No error message appears, but the server doesn't works. Those are the messages:
/usr/local/lib/python3.8/dist-packages/airflow/configuration.py:361 DeprecationWarning: The default_queue option in [celery] has been moved to the default_queue option in [operators] - the old setting has been used, but please update your config.
/usr/local/lib/python3.8/dist-packages/airflow/configuration.py:361 DeprecationWarning: The dag_concurrency option in [core] has been renamed to max_active_tasks_per_dag - the old setting has been used, but please update your config.
/usr/local/lib/python3.8/dist-packages/airflow/configuration.py:361 DeprecationWarning: The processor_poll_interval option in [scheduler] has been renamed to scheduler_idle_sleep_time - the old setting has been used, but please update your config.
[2022-06-13 15:11:57,355] {manager.py:779} WARNING - No user yet created, use flask fab command to do it.
[2022-06-13 15:12:01,925] {manager.py:512} WARNING - Refused to delete permission view, assoc with role exists DAG Runs.can_create User
[2022-06-13 15:12:19 +0000] [1117638] [INFO] Handling signal: ttou
[2022-06-13 15:12:19 +0000] [1120256] [INFO] Worker exiting (pid: 1120256)
[2022-06-13 15:12:19 +0000] [1117638] [WARNING] Worker with pid 1120256 was terminated due to signal 15
[2022-06-13 15:12:22 +0000] [1117638] [INFO] Handling signal: ttin
[2022-06-13 15:12:22 +0000] [1121568] [INFO] Booting worker with pid: 1121568
Do you know what can could be happening?
Thank you in advance!
When I issue an AWS-CLI command from the windows command line, "aws s3 ls" it successfully completes. When I try shell(paste0("aws ", "s3 ", "ls")) using R with Eclipse, it successfully completes.
However, on another machine using the same credentials, from the windows command line, "aws s3 ls" successfully completes, but with RStudio shell(paste0("aws ", "s3 ", "ls")) I get:
"fatal error: Unable to locate credentials"
If I run shell(paste0("aws ", "s3 ", "ls ", "--debug")) from RStudio I get:
2017-02-27 15:07:10,516 - MainThread - botocore.credentials - DEBUG -
Looking for credentials via: env 2017-02-27 15:07:10,516 - MainThread
- botocore.credentials - DEBUG - Looking for credentials via: assume-role 2017-02-27 15:07:10,516 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via:
shared-credentials-file 2017-02-27 15:07:10,516 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via:
config-file 2017-02-27 15:07:10,516 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via:
ec2-credentials-file 2017-02-27 15:07:10,516 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via:
boto-config 2017-02-27 15:07:10,516 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via:
container-role 2017-02-27 15:07:10,516 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via: iam-role
2017-02-27 15:07:10,520 - MainThread -
botocore.vendored.requests.packages.urllib3.connectionpool - INFO -
Starting new HTTP connection (1): xxx.xxx.xxx.xxx 2017-02-27
15:07:10,523 - MainThread - botocore.utils - DEBUG - Caught exception
while trying to retrieve credentials: ('Connection aborted.',
error(10051, 'A socket operation was attempted to an unreachable
network')) Traceback (most recent call last): File
"botocore\utils.pyc", line 159, in _get_request File
"botocore\vendored\requests\api.pyc", line 69, in get File
"botocore\vendored\requests\api.pyc", line 50, in request File
"botocore\vendored\requests\sessions.pyc", line 465, in request File
"botocore\vendored\requests\sessions.pyc", line 573, in send File
"botocore\vendored\requests\adapters.pyc", line 415, in send
For the successful case from the Windows command line, it finds the credentials as seen below:
2017-02-27 15:07:03,267 - MainThread - botocore.credentials - DEBUG -
Looking for credentials via: env 2017-02-27 15:07:03,267 - MainThread
- botocore.credentials - DEBUG - Looking for credentials via: assume-role 2017-02-27 15:07:03,267 - MainThread -
botocore.credentials - DEBUG - Looking for credentials via:
shared-credentials-file 2017-02-27 15:07:03,267 - MainThread -
botocore.credentials - INFO - Found credentials in shared credentials
file: ~/.aws/credentials
So, what I am trying to figure out is why the AWS CLI cannot find the credentials using RStudio, but it works fine from the Windows CLI.
UPDATE: I have installed Eclipse on both machines and it works fine. However, it is still not working with RStudio, so it must be an IDE-related issue. Any ideas?
I was able to solve it by defining AWS credentials in RStudio:
Sys.setenv ("AWS_ACCESS_KEY_ID" = "YourKEY_ID",
"AWS_SECRET_ACCESS_KEY" = "YOURSECRETACCESSKEY",
"AWS_DEFAULT_REGION" = "eu-west-1")
I'm trying to set up multiple servers that look like:
Client Request ----> Nginx (Reverse-Proxy / Load-Balancer)
|
/|\
| | `-> App. Server I. 10.128.xxx.yy1:8080 # Our example
| `--> App. Server II. 10.128.xxx.yy2:8080
`----> ..
I understand that I need to put the App servers (Gunicorn in this case) behind an Nginx Proxy, but how do I set up the App servers by themselves?
I'm trying to set up the App server with systemd, and my configuration looks like:
[Unit]
Description=gunicorn daemon
After=network.target
[Service]
User=kyle
Group=www-data
WorkingDirectory=/home/kyle/do_app_one
ExecStart=/home/kyle/do_app_one/venv/bin/gunicorn --workers 3 --bind unix:/home/kyle/do_app_one/do_app_one.sock do_app_one.wsgi:application
[Install]
WantedBy=multi-user.target
I know the socket is being created because I can see it:
but I can't access the Gunicorn server by itself when I hit the IP address, with or without the :8000 port attached to it. Without the systemd configuration, I can access the site if I do:
gunicorn --bind 0.0.0.0:8000 myproject.wsgi:application
but I want to do this the right way with an init system like systemd, and I don't think I'm supposed to be binding it directly to a port because I've read it's less efficient/secure than using a socket. Unless binding to a port is the only way, then I guess that's what I have to do.
Every tutorial I see says I need an Nginx server in front of my Gunicorn server, but I already have an Nginx server in front of them. Do I need another one in front of each server such that it looks like:
Client Request ----> Nginx (Reverse-Proxy / Load-Balancer)
|
/|\
| | `-> Nginx + App. Server I. 10.128.xxx.yy1:8080 # Our example
| `--> Nginx + App. Server II. 10.128.xxx.yy2:8080
`----> ..
If Nginx is an HTTP server, and Gunicorn is an HTTP server, why would I need another Nginx server in front of each App Server? It seems redundant.
And if I don't need another Nginx server in front of each Gunicorn server, how do I set up the Gunicorn server with systemd such that it can stand alone?
Edit:
I was curious as to why the binding to a physical port was working, but the socket wasn't, so I ran gunicorn status and got errors:
kyle#ubuntu-512mb-tor1-01-app:~/do_app_one$ . venv/bin/activate
(venv) kyle#ubuntu-512mb-tor1-01-app:~/do_app_one$ gunicorn status
[2016-12-03 20:19:49 +0000] [11050] [INFO] Starting gunicorn 19.6.0
[2016-12-03 20:19:49 +0000] [11050] [INFO] Listening at: http://127.0.0.1:8000 (11050)
[2016-12-03 20:19:49 +0000] [11050] [INFO] Using worker: sync
[2016-12-03 20:19:49 +0000] [11053] [INFO] Booting worker with pid: 11053
[2016-12-03 20:19:49 +0000] [11053] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/arbiter.py", line 557, in spawn_worker
worker.init_process()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 126, in init_process
self.load_wsgi()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/workers/base.py", line 136, in load_wsgi
self.wsgi = self.app.wsgi()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 65, in load
return self.load_wsgiapp()
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp
return util.import_app(self.app_uri)
File "/home/kyle/do_app_one/venv/lib/python3.5/site-packages/gunicorn/util.py", line 357, in import_app
__import__(module)
ImportError: No module named 'status'
[2016-12-03 20:19:49 +0000] [11053] [INFO] Worker exiting (pid: 11053)
[2016-12-03 20:19:49 +0000] [11050] [INFO] Shutting down: Master
[2016-12-03 20:19:49 +0000] [11050] [INFO] Reason: Worker failed to boot.
Still not sure how to fix the problem though.
The right answer is to just bind Gunicorn to a port instead of a unix socket. I'm not too sure about the details, but unix sockets can only be used within a local network according to:
https://unix.stackexchange.com/questions/91774/performance-of-unix-sockets-vs-tcp-ports
So when I changed the gunicorn.service file ExecStart line to:
ExecStart=/home/kyle/do_app_one/venv/bin/gunicorn --workers 3 --bind 0.0.0.0:8000 do_app_one.wsgi:application
I was able to access the server by itself, and connect it to my Nginx server that was on a different IP.
We're running RedHat 6.4 on 2 of our nodes.
We've installed the new Cloudera Manager 5.5.0 and we've been trying to create a cluster and add a first node to it (node is initially clean of any Cloudera component). Unfortunately, during the cluster installation, Cloudera Manager gets stuck every time at :
Installation failed. Failed to receive heartbeat from agent.
Ensure that the host's hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager Server (check firewall rules).
Ensure that ports 9000 and 9001 are not in use on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added. (Some of the logs can be found in the installation details).
If Use TLS Encryption for Agents is enabled in Cloudera Manager (Administration -> Settings -> Security), ensure that /etc/cloudera-scm-agent/config.ini has use_tls=1 on the host being added. Restart the corresponding agent and click the Retry link here.
We looked around and saw how this is usually caused by a misconfigured /etc/hosts file. So we edited ours on both Cloudera Manager and the new node, did a service network restart as well as service cloudera-scm-server restart but it didn't work either.
Here's what the /etc/hosts file looks like :
127.0.0.1 localhost
10.186.80.86 domain.node2.fr.net host
10.186.80.105 domain.node1.fr.net mgrnode
We also tried some cleaning up before relaunching the cluster creation by deleting scm_prepare_node.* and .scm_prepare_node.lock.
We looked at service cloudera-scm-agent status on the new node after each installation fail as well, and we noticed that the service isn't running (even when we do a service restart, the result is still the same)
service cloudera-scm-agent start
Starting cloudera-scm-agent: [ OK ]
service cloudera-scm-agent status
cloudera-scm-agent dead but pid file exists
Here's the agent logs on the new node side :
tail -f /var/log/cloudera-scm-agent/cloudera-scm-agent.log
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Agent Logging Level: INFO
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO No command line vars
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Missing database jar: /usr/share/java/mysql-connector-java.jar (normal, if you're not using this database type)
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Missing database jar: /usr/share/java/oracle-connector-java.jar (normal, if you're not using this database type)
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Found database jar: /usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Agent starting as pid 24529 user cloudera-scm(420) group cloudera-scm(207).
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Because agent not running as root, all processes will run with current user.
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent WARNING Expected mode 0751 for /var/run/cloudera-scm-agent but was 0755
[30/Nov/2015 15:07:27 +0000] 24529 MainThread agent INFO Re-using pre-existing directory: /var/run/cloudera-scm-agent
[30/Nov/2015 15:07:29 +0000] 24529 MainThread agent INFO Re-using pre-existing directory: /var/run/cloudera-scm-agent/cgroups
Is there anything we're doing wrong?
Thanks in advance for your help!
This time we just created the cluster with the root user (didn't check the single user mode)
Besides, our host had no internet access, and having created our own repository we needed to do one last step before launching the cluster creation which is importing the GPG key on the host using this command :
sudo rpm --import
If anybody finds themselves facing the same problem, hope this helps!