unable to successfully submit a job through Livy in cluster mode - livy

I have the following but I am unable to successfully submit a job through Livy in cluster mode.
here are my settings
spark-defaults.conf
spark.master yarn
livy.conf
livy.spark.master=yarn
livy.spark.deploy-mode = cluster
livy.server.recovery.mode = recovery
livy.server.recovery.state-store = zookeeper
livy.server.recovery.state-store.url = localhost:2181
Anything wrong with this conf?
I am using apache livy 0.4 and spark 2.3.0

Related

Airflow:Logs not showing in the UI while tasks running

Airflow Version: 2.2.4
Airflow running in EKS
Issue: Logs not showing in the UI while tasks running
The issue with the logs is that airflow is only writing the logs to the log file rather than standard out as well. This is what's preventing us from being able to see the logs in the web UI while the task is running.
When i get into the pod , i do see log inside the pod
Is there any solution to finding the setting or configuration needed to output to both?
I log as below
kubectl logs detaskdate0.3d55e5ba89ca4ad491bb3e1cadfdaaec -n airflow
Added new context arn:aws:eks:us-west-2:XXXXXXXX:cluster/us-west-2-airflow-cluster to /home/airflow/.kube/config
[2022-05-20 19:56:43,529] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/tss/dq_tss_mod_date_dag.py

Error: No response from Gunicorn master within 120 seconds . Shutting down webserver. Airflow-Webserver Service won't start

I want to monitor my airflow worker logs with the help of Prometheus.
So I looked up on the internet and found statsd-exporter can help me.
But, when I added the required configuration of statsd-exporter in Airflow.cfg , Service wont start.
Airflow.cfg file
[scheduler]
statsd_on = True
statsd_host = xxx.xxx.x.xxx
statsd_port = 9125
statsd_prefix = airflow
When I restart The service after adding these lines in Airflow.cfg , it doesnt restart and throws above error in logs.
When I remove these lines , It starts perfectly.
I installed statsd and statsd-exporter via pip.
Also can anyone suggest any better way this can be done.

Apache Airflow 1.10.10, Remote Worker and S3 logs

I have recently upgraded airflow from 1.10.0 to 1.10.10. The current setup is web, worker, scheduler and flower are on same machine. When a DAG is run first step is it spins up new EMR for the DAG and along with it a worker node where only worker process runs. We are using celery executor. This worker node sends tasks to run on EMR cluster. Once the tasks are run next steps are terminating EMR and terminating this worker instance. Every task's log is present on this worker node. As long as the tasks are running or worker node is running, I can see the logs on web UI. But as soon as worker is terminated, I am unable to see the logs. The config is setup is to upload logs to s3. I see logs of startEMR and startWorker on S3 since these logs are main airflow instance(where all 4 processes are running)
Here is the config snippet of airflow.cfg
base_log_folder = /home/deploy/airflow/logs
remote_logging = True
remote_base_log_folder = s3://airflow-log-bucket/airflow/logs/
remote_log_conn_id = aws_default
encrypt_s3_logs = False
s3_log_folder = '/airflow/logs/'
executor = CeleryExecutor
Same config file is setup when worker instance is initialized for DAG and only worker process is started on that node.
Here is the log from a task when worker node is terminated.
*** Log file does not exist: /home/deploy/airflow/logs/XXXX/XXXXXX/2020-07-07T23:30:05+00:00/1.log
*** Fetching from: http://ip-10-164-62-253.ap-southeast-2.compute.internal:8799/log/XXXX/XXXXXX/2020-07-07T23:30:05+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='ip-10-164-62-253.ap-southeast-2.compute.internal', port=8799): Max retries exceeded with url: /log/xxxx/XXXXXX/2020-07-07T23:30:05+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6750ac52b0>: Failed to establish a new connection: [Errno 113] No route to host',))
So basically -
This was working in airflow 1.10.1 (I did not need to add remote_logging=True)
The logs are copied to S3 for EMR start and Worker Node start steps and are shown on web-UI.
Only tasks running on remote worker node are not copied to S3.
Can someone please let me know what am I missing in configuration as same config used to work on airflow1.10.0
I found the mistake I was doing. The S3 module that was getting installed on new worker node was being installed via pip and not pip3. Airflow server was having this installation from pip3.
Another config change I had to do was in webserver section of airflow.cfg file.
worker_class = sync
This was previously gevent.

sparklyr - Connect remote hadoop cluster

It is possible to connect sparklyr with a remote hadoop cluster or it is only possible to use it local?
And if it is possible, how? :)
In my opinion the connection from R to hadoop via spark is very important!
Do you mean Hadoop or Spark cluster? If Spark, you can try to connect through Livy, details here:
https://github.com/rstudio/sparklyr#connecting-through-livy
Note: Connecting to Spark clusters through Livy is under experimental development in sparklyr
You could use livy which is a Rest API service for the spark cluster.
once you have set up your HDinsight cluster on Azure check for livy service using curl
#curl test
curl -k --user "admin:mypassword1!" -v -X GET
#r-studio code
sc <- spark_connect(master = "https://<yourclustername>.azurehdinsight.net/livy/",
method = "livy", config = livy_config(
username = "admin",
password = rstudioapi::askForPassword("Livy password:")))
Some useful URL
https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-livy-rest-interface

Kaa node service fails to start mongodb and zookeeper

We are trying to setup a Single Node Kaa server(version 0.10.0) in an Ubuntu 16.04 machine.
Followed the documentation given here
We were unable to connect to the admin UI after starting the kaa node service.
On investigating further we could see that the Mongodb and zookeeper services were not started. So we manually started those services. After that we were able to connect to Kaa admin UI. Do we need any additional steps to get these service running on kaa-node start ?
I setup kaaproject with the guide for my Ubuntu 16.04.1 LTS VM and Zookeeper was not running by default on my server also, so I had to install the deamon (which starts zookeeper also on startup):
sudo apt-get install zookeeperd
Check if zookeeper is running:
netstat -ntlp | grep 2181
This should result in an output like this:
With mongodb I had the problem, that there was not enough space available for the journal files. I fixed this by increasing the available disk space + setting smallfiles=true in the /etc/mongod.conf
Probably you have some troubles with configurations for services. Check if auto-startup is enabled for MongoDB / Zookeeper by the next command:
$ systemctl is-enabled ${service-name}
if you see this:
$ disabled
then auto-startup is disabled for specified service and you should try next in order to enable it:
$ systemctl enable ${service-name}

Resources