Airflow Dataflow Job status error even though Dataflowtemplate run successful - airflow

I am orchestrating Dataflow Template job via Composer and using DataflowTemplatedJobStartOperator and DataflowJobStatusSensor for running the job. I am getting following error with sensor operator
Failure log of DataflowJobStatusSensor
job_status = job["currentState"]
KeyError: 'currentState'
Error
Dataflow Template job runs successfully but DataflowJobStatusSensor fails always with the error . I have attached screenshot of the whole orchestration
[2022-02-11 04:18:11,057] {dataflow.py:100} INFO - Waiting for job to be in one of the states: JOB_STATE_DONE.
[2022-02-11 04:18:11,109] {credentials_provider.py:300} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-02-11 04:18:11,776] {taskinstance.py:1152} ERROR - 'currentState'
Traceback (most recent call last):
Code
wait_for_job = DataflowJobStatusSensor(
task_id="wait_for_job",
job_id="{{task_instance.xcom_pull('start_x_job')['dataflow_job_id']}}",
expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
location=gce_region
)
Xcom value -
return_value
{"id": "2022-02-12_02_35_39-14489165686319399318", "projectId": "xx38", "name": "start-x-job-0b4921", "type": "JOB_TYPE_BATCH", "currentStateTime": "1970-01-01T00:00:00Z", "createTime": "2022-02-12T10:35:40.423475Z", "location": "us-xxx", "startTime": "2022-02-12T10:35:40.423475Z"}
Any clue why I am getting the Error - currentstate
Thanks

After checking documentation for version 1.10.15, it gives you the option to run airflow providers (from version 2.0.*) on airflow 1.0. So you shouldn't haver issues, as described in my comments you should be able to run example_dataflow although you might need to update the code to reflect your version.
For what I see from your error message, have you also check your credentials as described on Google Cloud Connection page. Use the example or a small dag run using the operators to test your connection. You can find video-guides like this video. Remember that the credentials must be within reach of your airflow application.
Also, If you are using google-dataflow-composer you should be able to setup your credentials as show on DataflowTemplateOperator Configuration.
As a final note, if you find messy to move forward with airflow migration and latest updates, your best approach is to use kubernetes Operator. In the short term, this will allow to create image with latest updates and you only have to pass credential info to the image and you will be able to update your docker image to the latest and it will still working regardless of the version of airflow that you are using. It's a short term solution, still you should consider migrating to 2.0.*.

Related

Airflow ECS-Operator not fetching CloudWatch Logs

I'm using Airflow's EcsOperator, ECS tasks writing to Cloudwatch.
Sometimes Airflow log fetcher collects logs from CloudWatch and sometimes does not.
On the CloudWatch console, I always see the logs.
On tasks that take a long time, I usually see the log or at least part of it.
Someone had the same issue with ECSOperator?
First ECSOperator is deprecated and removed in provider version 5.0.0
You should switch to EcsRunTaskOperator.
In EcsRunTaskOperator there is awslogs_fetch_interval which control over the interval to fetch logs from Ecs. The default is 30 seconds. If you wish for more frequent polls then set the parameter value accordingly.
You didn't mention what provider version you are on but this part of the code was refactored in version 5.0.0 (PR) so upgrading the Amazon provider might also resolve your issue.

Airflow dag cannot find connection-id

I am managing a Google Cloud Composer environment which runs Airflow for a data engineering team. I have recently been asked to troubleshoot one of the dags they run which is failing with this error : [12:41:18,119] {credentials_utils.py:23} WARNING - [redacted-name] connection ID not available, falling back to Google default credentials
The job is basically a data pipeline which reads from various sources and stores data into GBQ. The odd part is that they have a strictly similar Dag running for a different project and it works perfectly.
I have recreated the .json credentials for the service account behind the connection as well as the connection itself in Airflow. I have sanitized the code to see if there was any hidden spaces or so.
My knowledge of Airflow is limited and I have not been able to find any similar issue in my research, any one have encountered this before?
So the DE team came back to me saying it was actually a deployment issue where an internal module involved in service account authentication was being utilized inside another DAG running in stage environment, rendering it impossible to proceed to credential fetch from the connection ID.

Airflow - backend database to get reports

I am trying to get some useful information from airflow backend. I need to get following details
How many times a particular job failed.
which task has failed over and over.
The problem is all our task has dependency on their upstream, and so when it fails, we fix the issue and mark it as success. This changes status in database as well. Is there a place I can get historical records?
following query shows which task failed. However if I mark it as success from UI, status is updated in database as well. And I have no way to show if this was failed.
select * from dag_run where dag_id='test_spark'
Currently, there is no native way but you can check log table -- it adds a record whenever there is action via CLI or UI.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Could not create internal topics - Stream-thread exception

I am trying to execute a simple Wordcount stream application but I face the error "Could not create internal topics - Stream-thread exception"
I have seen a similar thread but that seems to be more of a network issue.
Here is no security enabled on the kafka broker.
Only one broker is configured and still this issue.
Can someone let me know how to fix this?
Clean your temporary kafka queues.
Run --list command on kafka to see all the queues starting with your names and ending with -changelog & -repartition and manually run delete on them.
This one worked for me.
Also, check your settings on delete.topic.enable for actual deletion happening. It was not the default setting until 1.0.0 - see https://issues.apache.org/jira/browse/KAFKA-5384
i have connected to kafka using kafka tool and delete them manually

Resources