What does the START_MANUAL status code mean in Oozie? - oozie

I've got a java action that's been suspended for 5 days with this status, and I don't know what oozie wants me to do. Any ideas?

It means that there was an error. And possibly multiple retries if you have retries configured. When retries are configured, after the first error the job goes to START_RETRY and Oozie retries automatically.
After the maximum number of retries is reached, the flow goes to state START_MANUAL. In that state, Oozie assumes that you will intervene and either fix the error and resume the job or kill the job.
There is a little more info in the documentation: http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html#a18_User-Retry_for_Workflow_Actions_since_Oozie_3.1

Related

Autosys retry job if main job fails

I have my spring boot application hosted on PCF. I have two autosys job defined:
Main job: Hits the post request, but there is a database condition check, in case it fails the post request will return HTTP Status 425. In case I am getting anything apart from HTTP Status 200, the retry job should run.
Retry job: Retry job is also hitting the post request iff the main job fails (here fail meaning not getting HTTP status response to be 200).
However, even if the HTTP status fro the main job is not 400, still the autosys job run is showing success. How can I get the FAILURE status for main job in case of anything apart from HTTP Status 200. Please help. If more details are required let me know.

dp:url-open's timeout value is getting ignored in datapower

I am providing a timeout of one second , however when the URL is down it is taking 120+ seconds for the response to come. Is there some variable or something that overrides the timeout in do:url-open ?
Update: I was calling the dp:url-open on request-transformation as well as on response-transformation. So the overriden timeout is 60 sec, adding both side it was becoming 120 sec.
Here's how I am calling this (I am storing the time before and after dp:url-open calls, and then returning them in the response):
Case 1: When the url is reachable I am getting a result like:
Case 2: When url is not reachable:
Update: FIXED: It seems the port that I was using was getting timed-out in the firewall first there it used to spend 1 minute. I was earlier trying to hit an application running on port 8077, later I changed that to 8088, And I started seeing the same timeout that I was passing.
The do:url-open() timeout only affects the operation done in the script but not the service itself. It depends on how you have built the solution but the time-out from the do:url-open() should be honored.
You can check this by setting logs to debug and adding a <xsl:message>Before url-open</xsl:message> and one after to see in the log if it is your url-open call or teh service that waits 120+ sec.
If it is the url-open you have most likely some error in the script and if it is the service that halts the response you need to return from the script (or throw an error depending on your needs) to halt the service.
You can set the time-out for the service itself or set a time-out in the User Agent for the specific URL you are calling as well.
Please note that the time-out will terminate the service after that time if you set it on service level so 1 sec. would not be recommended!

Is there any way to pass the error text of a failed Airflow task into another task?

I have a DAG defined that contains a number of tasks, the last of which is only run if any of the previous tasks fail. This task simply posts to a Slack channel that the DAG run experienced errors.
What I would really like is if the message sent to the Slack channel contained the actual error that is logged in the task logs, to provide immediate context to the error and perhaps save Ops from having to dig through the logs.
Is this at all possible?

Task with no status leads to DAG failure

I have a DAG that fetches data from Elasticsearch and ingests into the data lake. The first task, BeginIngestion, opens in several tasks (one for each resource), and these tasks open in more tasks (one for each shard). After the shards are fetched, the data is uploaded to S3 and then closed into a task EndIngestion, followed by a task AuditIngestion.
It was executing correctly, but now all tasks are executed successfully, but the "closing task" EndIngestion remains with no status. When I refresh the webserver's page, the DAG is marked as Failed.
This image shows successful upstream tasks, with the task end_ingestion with no status and the DAG marked as Failed.
I also dug into the task instance details and found
Dagrun Running: Task instance's dagrun was not in the 'running' state but in the state 'failed'.
Trigger Rule: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 49, 'successes': 49}, upstream_task_ids=['s3_finish_upload_ingestion_raichucrud_complain', 's3_finish_upload_ingestion_raichucrud_interaction', 's3_finish_upload_ingestion_raichucrud_company', 's3_finish_upload_ingestion_raichucrud_user', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_location', 's3_finish_upload_ingestion_raichucrud_companytoken', 's3_finish_upload_ingestion_raichucrud_indexevolution', 's3_finish_upload_ingestion_raichucrud_companyindex', 's3_finish_upload_ingestion_raichucrud_producttype', 's3_finish_upload_ingestion_raichucrud_categorycomplainsto', 's3_finish_upload_ingestion_raichucrud_companyresponsible', 's3_finish_upload_ingestion_raichucrud_category', 's3_finish_upload_ingestion_raichucrud_additionalfieldoption', 's3_finish_upload_ingestion_raichucrud_privatecontactconfiguration', 's3_finish_upload_ingestion_raichucrud_phone', 's3_finish_upload_ingestion_raichucrud_presence', 's3_finish_upload_ingestion_raichucrud_responsible', 's3_finish_upload_ingestion_raichucrud_store', 's3_finish_upload_ingestion_raichucrud_socialprofile', 's3_finish_upload_ingestion_raichucrud_product', 's3_finish_upload_ingestion_raichucrud_macrorankingpresenceto', 's3_finish_upload_ingestion_raichucrud_macroinfoto', 's3_finish_upload_ingestion_raichucrud_raphoneproblem', 's3_finish_upload_ingestion_raichucrud_macrocomplainsto', 's3_finish_upload_ingestion_raichucrud_testimony', 's3_finish_upload_ingestion_raichucrud_additionalfield', 's3_finish_upload_ingestion_raichucrud_companypageblockitem', 's3_finish_upload_ingestion_raichucrud_rachatconfiguration', 's3_finish_upload_ingestion_raichucrud_macrorankingitemto', 's3_finish_upload_ingestion_raichucrud_purchaseproduct', 's3_finish_upload_ingestion_raichucrud_rachatproblem', 's3_finish_upload_ingestion_raichucrud_role', 's3_finish_upload_ingestion_raichucrud_requestmoderation', 's3_finish_upload_ingestion_raichucrud_categoryproblemto', 's3_finish_upload_ingestion_raichucrud_companypageblock', 's3_finish_upload_ingestion_raichucrud_problemtype', 's3_finish_upload_ingestion_raichucrud_key', 's3_finish_upload_ingestion_raichucrud_macro', 's3_finish_upload_ingestion_raichucrud_url', 's3_finish_upload_ingestion_raichucrud_document', 's3_finish_upload_ingestion_raichucrud_transactionkey', 's3_finish_upload_ingestion_raichucrud_catprobitemcompany', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_categoryinfoto', 's3_finish_upload_ingestion_raichucrud_marketplace', 's3_finish_upload_ingestion_raichucrud_macroproblemto', 's3_finish_upload_ingestion_raichucrud_categoryrankingto', 's3_finish_upload_ingestion_raichucrud_macrorankingto', 's3_finish_upload_ingestion_raichucrud_categorypageto']
As you see, the "Trigger Rule" field says that one of the tasks is in a "non-successful state", but at the same time the stats shows that all upstreams are marked as successful.
If I reset the database, it doesn't happen, but I can't reset it for every execution (hourly). I also don't want to reset it.
Does anyone have any light?
PS: I am running in an EC2 instance (c4.xlarge) with LocalExecutor.
[EDIT]
I found in the scheduler log that the DAG is in deadlock:
[2017-08-25 19:25:25,821] {models.py:4076} DagFileProcessor157 INFO - Deadlock; marking run failed
I guess this may be due to some exception treatment.
I have had this exact issue before, for me my code was generating duplicate task ids. And it looks like in your case there is also a duplicate id:
s3_finish_upload_ingestion_raichucrud_privatecontactinteraction
This is probably a year late for you, but hopefully this will save others, lots of debugging time :)

Async Handler for Update Contract States Operation encountered some errors,

The Job called "Contract States job" fails every times it runs with the following message.
"Async Handler for Update Contract States Operation encountered some
errors, see log for more detail.Detail: "
I've tried setting trace to verbose even but can't see the log, or any errors why this job is failing every time. This job is crucial to the contract entity as it need to renew (via plugin) promptly so to generate contract-detail lines in time for other process to follow.
Could someone please point me to this log or how I can trap this error causing this job to fail?
Trace displayed errors on some contracts not being able to update state because the TotalPrice was not equal to the sum of its contract lines. Further diffing showed that another "unsupported script" would change contract states - thus causing this operation to fail on these contracts.
The fix was to have a look at the "unsupported script" rather.

Resources