I'm trying to use the Cloud Composer to run my workflow. I wanted to use "GoogleCloudStorageToGoogleCloudStorageOperator" operator which is available from Apache Airflow v1.10, but not being supported in current cloud composer (it supports only Apache Airflow v1.9 for now (2019/01/16)).
Following the guidance of the Google's blog post, I added the operator to a cloud composer environment by myself, and it worked well until a few days ago.
However, for now, when I tried to create a new cloud composer env and to deploy the same DAG that worked well previously, I got a following error message on the Airflow Web UI. And DAG is failed.
Broken DAG: [/home/airflow/gcs/dags/xxx.py] Relationships can only be set between Operators; received GoogleCloudStorageToGoogleCloudStorageOperator
I couldn't understand why this error occurred even if I used the same code and followed the same procedure to deploy my DAG to the cloud composer.
I appreciate for those who give me any advice to solve this problem.
This was due to a bug in Composer 1.4.2 which was already fixed.
Airflow error importing DAG using plugin - Relationships can only be set between Operators
Try out the DAG on Astronomer Cloud (http://astronomer.io/cloud), free 30 day trial.
Disclosure: I work at Astronomer.
Related
I have an airflow dag running in a VM, but in order to facilitate the event driven triggering I'm trying to set up cloud composer in GCP. However, I only see an option in cloud composer to install pypi packages.
I need rosbag package in order to run my bash script, is there any way to do that in cloud composer? Or it's better to either run Airflow in a VM or a container with Kubernetes?
You can add your own requirements in Cloud Composer
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
However knowing rosbag pretty well (I've been robotics engineer using ROS for quite some time) - this might not be super easy to work out the right set of dependencies. Airflow has > 500 dependencies overall and it is highly likely some of them might clash with the particular version of ROS.
Also ROS has its own, specific way of initialization and setting up all the environment variables, sourcing certain scripts - which you will have to emulate yourself, modify PYTHONPATH and possibly do some initialization.
I'd say your best bet will be to use DockerOperator and use ROS from a Docker image. This can be done even with GPU support if needed (been there, done that) and it will provide the right level of isolation - both Airflow and ROS are utilising Python and dependencies a lot, and this might be the simplest way.
I am running airflow 2.0, setting up airflow dag for the first time, and following quick start tutorials.
After creating and running the py file I don't see the dag created it does not list for me.
setting:
airflow.cfg:dags_folder = /Users/vik/src/airflow/dags
my python file is in this folder. There are no errors here.
I am able to see the example dags in example-dags.
I did airflow db init
airflow webserver
airflow scheduler
then try to list the dags
I think I am missing something
I don't know exactly how you installed everything, but I highly recommend Astronomer CLI for simplicity and quick setup. With that you'll be able to setup a first DAG pretty quickly. Here is also the video tutorial that helps you understand how to install / setup everything.
A few things to try:
Make sure the scheduleris running (run airflow scheduler) or try to restart it .
Using the Airflow CLI, run airflow config list and make sure that the loaded config is in fact what you are expecting, check the value of dags_folder.
Try running airflow dags list from the CLI, and check the the filepath if your dag is shown in the results.
If there was an error parsing your DAG, and therefore could not be loaded by the scheduler, you can find the logs in ${AIRFLOW_HOME}/logs/scheduler/${date}/your_dag_id.log
I am using BigQueryCheckOperator in Airflow to know if the data exists in BQ Table, but the Dag is failing with this ERROR - <HttpError 404 when requesting https://bigquery.googleapis.com/bigquery/v2/projects/
Here is the logs of the Dag
Can Someone tell me how to fix this issue?
This is known Airflow issue querying Bigquery datasets residing in non multi-regional locations (US,EU) within some of Bigquery operator submodules, the pull request #8273 has been already raised.
You can also check out this Stack thread for most accurate problem explanation.
By now, it was announced to bring this problem fixed in Airflow 2.0 release, however community group has been pushed Backport package in order to help the users with older Airflow 1.10.* versions and it will be considered in further building Airflow images for GCP Composer.
Looking for a workaround, you can try to adjust BashOperator invoking bq command-line tool in attempt to perform the certain action against Bigquery dataset inside the particular DAG file.
In Airflow scheduler, there are things like heartbeat and max_threads.
See How to reduce airflow dag scheduling latency in production?.
If I am using Google Cloud Composer, do I have to worry/set these values?
If not, what are the values that Google Cloud Composer uses?
You can see the airflow config in the composer instance bucket gs://composer_instance_bucket/airflow.cfg. You can tune this configuration as you wish, keeping in mind that cloud composer has some configurations blocked.
Also, if you go in the Airflow UI -> Admin -> Configuration you can see the full configuration.
If you'd like more control/visibility of these variables, consider hosted Airflow # Astronomer https://www.astronomer.io/cloud/ as it runs vanilla Airflow.
I'm developing a DAG on Cloud Composer; my code is separated into a main python file and one package with subfolders, it looks like this:
my_dag1.py
package1/__init__.py
package1/functions.py
package1/package2/__init__.py
package1/package2/more_functions.py
I updated one of the functions on package1/functions.py to take an additional argument (and update the reference in my_dag1.py). The code would run correctly on my local environment and I was not getting any errors when running
gcloud beta composer environments run my-airflow-environment list_dags --location europe-west1
But the Web UI raised a python error
TypeError: my_function() got an unexpected keyword argument
'new_argument'
I have tried to rename the function and the error changed to
NameError: name 'my_function' is not defined
I tried changing the name of the DAG and to upload the files to the dag folder zipped and unzipped, but nothing worked.
The error disappeared only after I renamed the package folder.
I suspect the issue is related to the scheduler picking up my_dag1.py but not package1/functions.py. The error appeared out of nowhere as I have made similar updates on the previous weeks.
Any idea on how to fix this issue without refactoring the whole code structure?
EDIT-1
Here's the link to related discussion on Google Groups
I've run into a similar issue. the "Broken DAG" error won't dismiss in Web UI. I guess this is a cache bug in Web server of AirFlow.
Background.
I created a customized operator with Airflow Plugin features.
After I import the customized operator, the airflow Web UI keep shows the Broken DAG error says that it can't find the customized operator.
Why I think it's a bug in Web server Airflow?
I can manually run the DAG with the command airflow test, so the import should be correct.
Even if I remove the related DAG file from the /dags/ folder of airflow, the error still there.
Here are What I did to resolve this issue.
restart airflow web service. (sometimes you can resolve the issue only by this).
make sure no DAG is running, restart airflow scheduler service.
make sure no DAG is running, restart airflow worker
Hopefully, it can help someone has the same issue.
Try restarting the webserver with:
gcloud beta composer environments restart-web-server ENVIRONMENT_NAME --location=LOCATION