Oozie JA009 error on sub-action - oozie

I experience the following situation: I have created and launched an Oozie workflow comprising of two hive actions. Moments after the wf starts, the first action gets JA009 error and the wf is marked as SUSPENDED. Now the interesting part: the first action actually continues running and succeeds, although marked with the above error; in this moment the wf is stuck, not passing to the second action.
Any ideas on how to debug this?
Err msg:
JA009: Cannot initialize Cluster. Please check your configuration for
mapreduce.framework.name and the correspond server addresses.
Env info:
Oozie 4.2.0.2.5.3.0-37
Hadoop 2.7.3.2.5.3.0-37
Hive 1.2.1000.2.5.3.0-37

Related

Airflow Dataflow Job status error even though Dataflowtemplate run successful

I am orchestrating Dataflow Template job via Composer and using DataflowTemplatedJobStartOperator and DataflowJobStatusSensor for running the job. I am getting following error with sensor operator
Failure log of DataflowJobStatusSensor
job_status = job["currentState"]
KeyError: 'currentState'
Error
Dataflow Template job runs successfully but DataflowJobStatusSensor fails always with the error . I have attached screenshot of the whole orchestration
[2022-02-11 04:18:11,057] {dataflow.py:100} INFO - Waiting for job to be in one of the states: JOB_STATE_DONE.
[2022-02-11 04:18:11,109] {credentials_provider.py:300} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-02-11 04:18:11,776] {taskinstance.py:1152} ERROR - 'currentState'
Traceback (most recent call last):
Code
wait_for_job = DataflowJobStatusSensor(
task_id="wait_for_job",
job_id="{{task_instance.xcom_pull('start_x_job')['dataflow_job_id']}}",
expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
location=gce_region
)
Xcom value -
return_value
{"id": "2022-02-12_02_35_39-14489165686319399318", "projectId": "xx38", "name": "start-x-job-0b4921", "type": "JOB_TYPE_BATCH", "currentStateTime": "1970-01-01T00:00:00Z", "createTime": "2022-02-12T10:35:40.423475Z", "location": "us-xxx", "startTime": "2022-02-12T10:35:40.423475Z"}
Any clue why I am getting the Error - currentstate
Thanks
After checking documentation for version 1.10.15, it gives you the option to run airflow providers (from version 2.0.*) on airflow 1.0. So you shouldn't haver issues, as described in my comments you should be able to run example_dataflow although you might need to update the code to reflect your version.
For what I see from your error message, have you also check your credentials as described on Google Cloud Connection page. Use the example or a small dag run using the operators to test your connection. You can find video-guides like this video. Remember that the credentials must be within reach of your airflow application.
Also, If you are using google-dataflow-composer you should be able to setup your credentials as show on DataflowTemplateOperator Configuration.
As a final note, if you find messy to move forward with airflow migration and latest updates, your best approach is to use kubernetes Operator. In the short term, this will allow to create image with latest updates and you only have to pass credential info to the image and you will be able to update your docker image to the latest and it will still working regardless of the version of airflow that you are using. It's a short term solution, still you should consider migrating to 2.0.*.

Is there any way to pass the error text of a failed Airflow task into another task?

I have a DAG defined that contains a number of tasks, the last of which is only run if any of the previous tasks fail. This task simply posts to a Slack channel that the DAG run experienced errors.
What I would really like is if the message sent to the Slack channel contained the actual error that is logged in the task logs, to provide immediate context to the error and perhaps save Ops from having to dig through the logs.
Is this at all possible?

Control-M Prerequisites - Make job dependent on server availability

I want to know if I can add pre-requisite conditions for a job based on server availability. Suppose Job J runs from job server JS, and it interacts with two other servers SERVER1 and SERVER2.
I want to configure job J such that it runs only when SERVER1 and SERVER2 are available. In case any of the two servers is down, the job should wait for servers to come back online.
I don't know if this should be a comment or an answer, but what you are looking for is not natively available within Control-M.
The simplest solution I can think for you is to configure a sleep job to run on SERVER1 and SERVER2 and have them as pre-decessors to job J. These sleep jobs will only run when the agents on SERVER1/2 are available therefore confirming server availability prior to execution of job J.
Alternatively you could write a script that loops waiting for SERVER1/2 to respond to pings then complete and configure this job as a pre-decessor to job J.
I'm still newbie in Control-M but we have implemented a solution with similar goals with a job hook to proof nodes.
Assumed, your target server (node) called JS which interacts with SERVER1 (let's call node01). Any number of servers / nodes can be added later, let's see with just one node.
Overview components:
Jobs created for monitor changes and check status while OK and NOT OK status
Quantitative resource created for each node, for example node01_run (or stacked, as you wish)
Jobs are containing quantitative resource "node01_run" with least 1 free resource
If everything ok, jobs should run as expected
If downtime is recognized, quantitative resource (QR) will be changed to 0, so affected jobs should not run,
If the node is up again, quantitative resource will be set to the original value (10, 100, 1000, ...) and the jobs should run again as usual
Job name: node01_check_resource
Job Goal ---> Check if quantitative resource already existing
Job OS Command ---> ecaqrtab LIST node01_run
Result yes ---> do nothing,
Result no ---> Job node01_create_resource, command: ecaqrtab ADD node01_run 100 (or as many as you wish)
Job name: node01_check (cyclic)
Job Goal ---> Check if node up
Job OS Command ---> As you define that you node is up: check webservice, check uptime, wmi result, ping, ...
Result up ---> rerun job in x minutes
Result no ---> go for next job
Job name: node01_up-down
Job Goal ---> Case for switching between status up and status down
Job OS Command ---> ecaqrtab UPDATE node01_run 0
On-Do action: ---> when job ended, remove condition that node01_check cannot start (as is defined as cyclic job)
Job name: node01_check_down (cyclic)
Job Goal ---> Check status while known status is down
Job OS Command ---> As defined in node01_check
Result down ---> Do nothing as job is defined as cyclic
Result up ---> Remove some contitions
Job name: node01_down-up
Job Goal ---> Switching all back to normal operation
Job OS Command ---> ecaqrtab UPDATE node01_run 100
Action ---> Add condition that node01_check can run again
You can define such job hooks for multiple nodes and you can define in each job, which nodes should be up and running (means where the quantitative resource is higher than 0). It can be monitored multiple hosts and still set the same resource - as you wish.
I hope this helps further, unless you have found a suitable solution already.

BizTalk Batch Configuration Filter Not Saving

BizTalk Version 2010
I am trying to configure an EDI send batch but my Filter is not saving (or at least not displaying) after I start the batch
My filter is
BTS.ReceivePortName == EDI210GencoExport_ReceivePort
Before Starting Batch
After Starting Batch
When I try to receive messages I get this error
{ABF67403-4F99-4DED-BF15-30B0C9EE4666}
{AC708D34-DCF8-4DA9-BE95-7DCE3A507F0D}
FILE
D:\BizTalkFiles\JohnDeere\EDI210GencoExport\ReadyForBiztalk*.xml
Receive Location_EDI210GencoExport_ReceivePort
The output message of the receive pipeline "EDI210GencoExport.ReceiveBatchPipeline, EDI210GencoExport, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ffc03ec86640e930" failed routing because there is no subscribing orchestration or send port. The sequence number of the suspended message is 1.
What am I missing?
This is from the production version.
You can see the filter after the batch is started
What you are seeing in the expected behavior. The Filter is blank for a running batch. UI bug? Maybe, either way, it still read-only.
However, as I pointed out elsewhere, I don't think BTS.ReceivePortName is available for the Batch Marker component to evaluate.
More importantly, you do need to use the Batch Marker Pipeline Component to resolve the Filter to the Batch.

Task with no status leads to DAG failure

I have a DAG that fetches data from Elasticsearch and ingests into the data lake. The first task, BeginIngestion, opens in several tasks (one for each resource), and these tasks open in more tasks (one for each shard). After the shards are fetched, the data is uploaded to S3 and then closed into a task EndIngestion, followed by a task AuditIngestion.
It was executing correctly, but now all tasks are executed successfully, but the "closing task" EndIngestion remains with no status. When I refresh the webserver's page, the DAG is marked as Failed.
This image shows successful upstream tasks, with the task end_ingestion with no status and the DAG marked as Failed.
I also dug into the task instance details and found
Dagrun Running: Task instance's dagrun was not in the 'running' state but in the state 'failed'.
Trigger Rule: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 49, 'successes': 49}, upstream_task_ids=['s3_finish_upload_ingestion_raichucrud_complain', 's3_finish_upload_ingestion_raichucrud_interaction', 's3_finish_upload_ingestion_raichucrud_company', 's3_finish_upload_ingestion_raichucrud_user', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_location', 's3_finish_upload_ingestion_raichucrud_companytoken', 's3_finish_upload_ingestion_raichucrud_indexevolution', 's3_finish_upload_ingestion_raichucrud_companyindex', 's3_finish_upload_ingestion_raichucrud_producttype', 's3_finish_upload_ingestion_raichucrud_categorycomplainsto', 's3_finish_upload_ingestion_raichucrud_companyresponsible', 's3_finish_upload_ingestion_raichucrud_category', 's3_finish_upload_ingestion_raichucrud_additionalfieldoption', 's3_finish_upload_ingestion_raichucrud_privatecontactconfiguration', 's3_finish_upload_ingestion_raichucrud_phone', 's3_finish_upload_ingestion_raichucrud_presence', 's3_finish_upload_ingestion_raichucrud_responsible', 's3_finish_upload_ingestion_raichucrud_store', 's3_finish_upload_ingestion_raichucrud_socialprofile', 's3_finish_upload_ingestion_raichucrud_product', 's3_finish_upload_ingestion_raichucrud_macrorankingpresenceto', 's3_finish_upload_ingestion_raichucrud_macroinfoto', 's3_finish_upload_ingestion_raichucrud_raphoneproblem', 's3_finish_upload_ingestion_raichucrud_macrocomplainsto', 's3_finish_upload_ingestion_raichucrud_testimony', 's3_finish_upload_ingestion_raichucrud_additionalfield', 's3_finish_upload_ingestion_raichucrud_companypageblockitem', 's3_finish_upload_ingestion_raichucrud_rachatconfiguration', 's3_finish_upload_ingestion_raichucrud_macrorankingitemto', 's3_finish_upload_ingestion_raichucrud_purchaseproduct', 's3_finish_upload_ingestion_raichucrud_rachatproblem', 's3_finish_upload_ingestion_raichucrud_role', 's3_finish_upload_ingestion_raichucrud_requestmoderation', 's3_finish_upload_ingestion_raichucrud_categoryproblemto', 's3_finish_upload_ingestion_raichucrud_companypageblock', 's3_finish_upload_ingestion_raichucrud_problemtype', 's3_finish_upload_ingestion_raichucrud_key', 's3_finish_upload_ingestion_raichucrud_macro', 's3_finish_upload_ingestion_raichucrud_url', 's3_finish_upload_ingestion_raichucrud_document', 's3_finish_upload_ingestion_raichucrud_transactionkey', 's3_finish_upload_ingestion_raichucrud_catprobitemcompany', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_categoryinfoto', 's3_finish_upload_ingestion_raichucrud_marketplace', 's3_finish_upload_ingestion_raichucrud_macroproblemto', 's3_finish_upload_ingestion_raichucrud_categoryrankingto', 's3_finish_upload_ingestion_raichucrud_macrorankingto', 's3_finish_upload_ingestion_raichucrud_categorypageto']
As you see, the "Trigger Rule" field says that one of the tasks is in a "non-successful state", but at the same time the stats shows that all upstreams are marked as successful.
If I reset the database, it doesn't happen, but I can't reset it for every execution (hourly). I also don't want to reset it.
Does anyone have any light?
PS: I am running in an EC2 instance (c4.xlarge) with LocalExecutor.
[EDIT]
I found in the scheduler log that the DAG is in deadlock:
[2017-08-25 19:25:25,821] {models.py:4076} DagFileProcessor157 INFO - Deadlock; marking run failed
I guess this may be due to some exception treatment.
I have had this exact issue before, for me my code was generating duplicate task ids. And it looks like in your case there is also a duplicate id:
s3_finish_upload_ingestion_raichucrud_privatecontactinteraction
This is probably a year late for you, but hopefully this will save others, lots of debugging time :)

Resources