i couldnt find tutorial for cloud ML engine + airflow, someone please help deploy a cloud ml engine model and orchestrate with airflow to run training with new data every hour
You can schedule ML Engine jobs using the Composer quick start available here and the Airflow ML Engine operators' documentation here. The model package gets created on GCS when you train a job on MLEngine or it can even be created manually. If you intend to do hyper parameter optimization then your package will need to contain a setup.py as mentioned here.
Below is an example of the DAG for iris scikit model (reference here)-
with models.DAG(
'composer_sample_ml',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
train_model = mlengine_operator.MLEngineTrainingOperator(
task_id='train_model',
project_id='PROJECT_ID',
job_id='{}_{}'.format('iris_train_job', str(uuid.uuid4())),
package_uris='gs://BUCKET_ID/scikit_learn_job_dir/packages/PACKAGE_ID/iris_sklearn_trainer-0.1.tar.gz',
training_python_module='iris_sklearn_trainer.iris',
training_args=["--jobDir='gs://BUCKET_ID/scikit_learn_job_dir'"],
region='us-central1',
scale_tier='BASIC',
runtimeVersion = '1.8',
pythonVersion = '2.7'
)
train_model
Here's a tutorial that includes ML Engine and Composer: https://cloud.google.com/solutions/machine-learning/recommendation-system-tensorflow-deploy
As an update for anyone who may be looking post-June 2020, in addition to the resources above, these two PRs include updates to the example DAG and a guide for these operators. I don't think they're updated in the docs yet, but once it is I'll try to update
https://github.com/apache/airflow/pull/9727
https://github.com/apache/airflow/pull/9798/files?short_path=b8207cb#diff-b8207cb1601efda85d3cbcf0087a6347
Related
At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors).
We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.
I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.
I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows?
Or is there a good way to use DBX without DB workflows?
author of dbx here.
TL;DR - dbx is not opinionated in terms of the orchestrator choice.
It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.
The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).
Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:
Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
In Airflow, use the Databricks Operator to launch the job (either by name or by id).
I am running airflow 2.0, setting up airflow dag for the first time, and following quick start tutorials.
After creating and running the py file I don't see the dag created it does not list for me.
setting:
airflow.cfg:dags_folder = /Users/vik/src/airflow/dags
my python file is in this folder. There are no errors here.
I am able to see the example dags in example-dags.
I did airflow db init
airflow webserver
airflow scheduler
then try to list the dags
I think I am missing something
I don't know exactly how you installed everything, but I highly recommend Astronomer CLI for simplicity and quick setup. With that you'll be able to setup a first DAG pretty quickly. Here is also the video tutorial that helps you understand how to install / setup everything.
A few things to try:
Make sure the scheduleris running (run airflow scheduler) or try to restart it .
Using the Airflow CLI, run airflow config list and make sure that the loaded config is in fact what you are expecting, check the value of dags_folder.
Try running airflow dags list from the CLI, and check the the filepath if your dag is shown in the results.
If there was an error parsing your DAG, and therefore could not be loaded by the scheduler, you can find the logs in ${AIRFLOW_HOME}/logs/scheduler/${date}/your_dag_id.log
We currently deployed helm Airflow in AWS EKS and want to trigger dbt models from it.
A few questions:
1. What would be the ideal way to deploy dbt? I am thinking about deploying another container for dbt only or installing dbt in the same container running Airflow via pip or brew.
2. If the ideal way to run dbt is in its own container, how do I connect airflow to dbt?
Please feel free to add any relevant information!
I think you should consider switching to the Official chart that he Apache Airflow community released recently: https://airflow.apache.org/docs/helm-chart/stable/index.html - it is prepared and maintained by the same community that builds Airflow.
I think one of the best descriptions about how to integrate dbt you can find in this Astronomer's blog: https://www.astronomer.io/blog/airflow-dbt-1
Summarising it - if you do not want to use dbt cloud, you can install dbt as pip package, and either run it via Bash script, or using dedicated DBT operators. If you already use Airflow through image, connecting dbt image to it when it should be invoked in another image, while technically possible, is a bit challenging and likely not worth the hassle.
You should simply extend the Airflow Image and add dbt as pip package. You can learn how to extend or customize the Airflow Image here: https://airflow.apache.org/docs/docker-stack/build.html
Small follow-up. Not sure if you have seen the talk from last week's Airflow Summit but I highly recommend it: https://airflowsummit.org/sessions/2021/building-a-robust-data-pipeline-with-the-dag-stack/ - it might give you a bit more answers :)
I am using BigQueryCheckOperator in Airflow to know if the data exists in BQ Table, but the Dag is failing with this ERROR - <HttpError 404 when requesting https://bigquery.googleapis.com/bigquery/v2/projects/
Here is the logs of the Dag
Can Someone tell me how to fix this issue?
This is known Airflow issue querying Bigquery datasets residing in non multi-regional locations (US,EU) within some of Bigquery operator submodules, the pull request #8273 has been already raised.
You can also check out this Stack thread for most accurate problem explanation.
By now, it was announced to bring this problem fixed in Airflow 2.0 release, however community group has been pushed Backport package in order to help the users with older Airflow 1.10.* versions and it will be considered in further building Airflow images for GCP Composer.
Looking for a workaround, you can try to adjust BashOperator invoking bq command-line tool in attempt to perform the certain action against Bigquery dataset inside the particular DAG file.
I'm trying to use the Cloud Composer to run my workflow. I wanted to use "GoogleCloudStorageToGoogleCloudStorageOperator" operator which is available from Apache Airflow v1.10, but not being supported in current cloud composer (it supports only Apache Airflow v1.9 for now (2019/01/16)).
Following the guidance of the Google's blog post, I added the operator to a cloud composer environment by myself, and it worked well until a few days ago.
However, for now, when I tried to create a new cloud composer env and to deploy the same DAG that worked well previously, I got a following error message on the Airflow Web UI. And DAG is failed.
Broken DAG: [/home/airflow/gcs/dags/xxx.py] Relationships can only be set between Operators; received GoogleCloudStorageToGoogleCloudStorageOperator
I couldn't understand why this error occurred even if I used the same code and followed the same procedure to deploy my DAG to the cloud composer.
I appreciate for those who give me any advice to solve this problem.
This was due to a bug in Composer 1.4.2 which was already fixed.
Airflow error importing DAG using plugin - Relationships can only be set between Operators
Try out the DAG on Astronomer Cloud (http://astronomer.io/cloud), free 30 day trial.
Disclosure: I work at Astronomer.