OOZIE stuck in RUNING status - oozie

I use OOZIE to run a workflow. But a simple official example shell-wf (echo hello oozie) stuck in RUNNING state and never end. The workflow can be submitted but stuck at RUNNING state. There is not any error in job log in OOZIE UI.
When submitting a shell with spark-submit inside, the job will be never submitted and can not be seen in Spark UI. I suspect the shell didn't run at all.
What's the possible problem?

A Quick Checklist
For those who have the same problem, there is a checklist to check your system. Hope it helps!
Check jobTracker in your Oozie configuration. Note: If a job has been successfully run, it probably not the problem of jobTracker. Related discussion can be found here
Check your disk usage. If ## Heading ##disk usage is greater than 90%, remove some files to make sure disk usage is less than 90%. (That's my case!)
Check Console URL of the stuck action. It can be found in Job - Job Info tab - Actions - Action - Action Info tab. Job state here may help you to find the problem.
Check Oozie log. It's typically in /usr/local/oozie/logs. Check oozie.log* to find if there are exceptions.
Details
Disk usage
If your action state is
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
That may be the disk problem. Relative discussion can be found in MapReduce job hangs, waiting for AM container to be allocated. Solutions can be found in Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?.

Related

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

How to keep indexed a Maniphest task after editing its title

After a new Maniphest task has been created, chances are that you may need to change the task title to a new one with different keywords. However, upon editing the title, the task cannot be found by its new keywords but only by the old ones.
After manually reindexing the database the edited tasks can be found again but further changes in title will fail again until a new reindexation is issued.
I suppose the normal behavior is that tasks should be found anytime searching by their title without reindexing the database. Should I expect a different behavior from Maniphest?
Phabricator Version:
phabricator cb033673b6eb3dc8330d2ddea0fd358eae3b939a (Nov 16 2018)
The usual culprit is your phabricator daemons (background workers) aren't running.
From the phabricator directory:
# Check the status of daemons:
./bin/phd status
# (re)start the daemons:
./bin/phd restart
See Managing Daemons with PHD. You can also try looking at the daemon console which should be reachable at https://your.phabricator.url/daemon/, this will show the queue of jobs so you can see if jobs are failing for some reason.

Symfony - background task from form setup

Would you know how to run a background task on Symfony 4, based on the setup of a form ? This would avoid that the user has to remain on the form until the task is finished.
The idea would be that when the form is validated, it starts an independant background task. Then the user can continue its navigation and come back once the task is finished to get the results.
Thanks for your help,
You need to use pattern Message Bus. Symfony has own implementation of this pattern since version 4.1 introducing Messenger Component.
You can see documentation here: https://symfony.com/doc/current/components/messenger.html
To get it work you need some external program that will implement AMQP protocol. Most popular in PHP world IMHO RabbitMQ.
A very simple solution for this could be the following procedure:
Form is valid.
A temporary file is created.
Cronjob gets executed every five minutes and starts a symfony command.
The command checks if the file exists and if it's empty.
If so, the command works of the background task. But before this, the command write it's process id in the file to prevent from beeing excuted a second time.
Remove the file when the command has finished.
As long as the file exists you can show a hint for the user that the task is running.

Is there a way to revive a coordinator?

An oozie coordinator we own has been killed for operational reasons about a week ago. The cluster is now back up and running and ready for business. Can we revive it somehow so it will keep its run history and backfill all missing runs, or do we have to schedule a brand new one?
oozie job -resume xxxxxxx-xxxxxxxxxxxxxxx-oozie-oozi-C doesn't error out, but it also doesn't change the status of the coordinator back to RUNNING.
Have you tried out the killed -> ignored -> running transition? Based on the docs it should be possible.
It's a two step process: first one is based -ignore, second one is -change.
I've never tried to do this though :)

How to reschedule a coordinator job in OOZIE without restarting the job?

When i changed the start time of a coordinator job in job.properties in oozie, the job is not taking the changed time, instead its running in the old scheduled time.
Old job.properties:
startMinute=08
startTime=${startDate}T${startHour}:${startMinute}Z
New job.properties:
startMinute=07
startTime=${startDate}T${startHour}:${startMinute}Z
The job is not running at the changed time:07th minute,its running at 08th minute in every hour.
Please can you let me know the solution, how i can make the job pickup the updated properties(changed timing) without restarting or killing the job.
You can't really change the timing of the co-ordinator via any methods given by Oozie(v3.3.2) . When you submit a job the contents properties are stored in the database whereas the actual workflow is in the HDFS.
Everytime you execute the co-ordinator it is necessary to have the workflow in the path specified in properties during job submission but the properties file is not needed. What I mean to imply is the properties file does not come into the picture after submitting the job.
One hack is to update the time directly in the database using SQL query.But I am not sure about the implications of it.The property might become inconsistent across the database.
You have to kill the job and resubmit a new one.
Note: oozie provides a way to change the concurrency,endtime and pausetime as specified in the official docs.

Resources