I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.
The structure of my dags folder in the respective GCS bucket is as follows:
/dags/
dataflow.py <- DAG
dataflow/
pipeline.py <- pipeline
setup.py
my_modules/
__init__.py
commons.py <- the module I want to import in the pipeline
The setup.py is very basic, but according to the Apache Beam docs and answers on SO:
import setuptools
setuptools.setup(setuptools.find_packages())
In the DAG file (dataflow.py) I set the setup_file option and pass it to Dataflow:
default_dag_args = {
... ,
'dataflow_default_options': {
... ,
'runner': 'DataflowRunner',
'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
}
}
Within the pipeline file (pipeline.py) I try to use
from my_modules import commons
but this fails. The log in Google Cloud Composer (Apache Airflow) says:
gcp_dataflow_hook.py:132} WARNING - b' File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11\n from my_modules import commons\n ^\nSyntaxError: invalid syntax'
The basic idea behind the setup.py file is documented here
Also, there are similar questions on SO which helped me:
Google Dataflow - Failed to import custom python modules
Dataflow/apache beam: manage custom module dependencies
I'm actually wondering why my pipelines fails with a Syntax Error and not a module not found kind of error...
I tried to reproduce your issue and then try to solve it, so I created the same folder structure you already have:
/dags/
dataflow.py
dataflow/
pipeline.py -> pipeline
setup.py
my_modules/
__init__.py
common.py
Therefore, to make it work, the change I made is to copy these folders to a place where the instance is running the code is able to find it, for example in the /tmp/ folder of the instance.
So, my DAG would be something like this:
1 - Fist of all I declare my arguments:
default_args = {
'start_date': datetime(xxxx, x, x),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dataflow_default_options': {
'project': '<project>',
'region': '<region>',
'stagingLocation': 'gs://<bucket>/stage',
'tempLocation': 'gs://<bucket>/temp',
'setup_file': <setup.py>,
'runner': 'DataflowRunner'
}
}
2- After this, I created the DAG and before running the Dataflow task, I copied the whole folder directory, above created, into the /tmp/ folder of the instance Task t1, and after this, I run the pipeline from the /tmp/ directory Task t2:
with DAG(
'composer_df',
default_args=default_args,
description='datflow dag',
schedule_interval="xxxx") as dag:
def copy_dependencies():
process = subprocess.Popen(['gsutil','cp', '-r' ,'gs://<bucket>/dags/*',
'/tmp/'])
process.communicate()
t1 = python_operator.PythonOperator(
task_id='copy_dependencies',
python_callable=copy_dependencies,
provide_context=False
)
t2 = DataFlowPythonOperator(task_id="composer_dataflow",
py_file='/tmp/dataflow/pipeline.py', job_name='job_composer')
t1 >> t2
That's how I created the DAG file dataflow.py, and then, in the pipeline.py the package to import would be like:
from my_modules import commons
It should work fine, since the folder directory is understandable for the VM.
Related
I'm trying to implement custom XCOM backend.
Those are the steps I did:
Created "include" directory at the main Airflow dir (AIRFLOW_HOME).
Created these "custom_xcom_backend.py" file inside:
from typing import Any
from airflow.models.xcom import BaseXCom
import pandas as pd
class CustomXComBackend(BaseXCom):
#staticmethod
def serialize_value(value: Any):
if isinstance(value, pd.DataFrame):
value = value.to_json(orient='records')
return BaseXCom.serialize_value(value)
#staticmethod
def deserialize_value(result) -> Any:
result = BaseXCom.deserialize_value(result)
result = df = pd.read_json(result)
return result
Set at config file:
xcom_backend = include.custom_xcom_backend.CustomXComBackend
When I restarted webserver I got:
airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "xcom_backend" key in "core" section. Current value: "include.cust...
My guess is that it not recognizing the "include" folder
But how can I fix it?
*Note: There is no docker. It is installed on a Ubuntu machine.
Thanks!
So I solved it:
Put custom_xcom_backend.py into the plugins directory
set at config file:
xcom_backend = custom_xcom_backend.CustomXComBackend
Restart all airflow related services
*Note: Do not store DataFrames that way (bad practice).
Sources I used:
https://www.youtube.com/watch?v=iI0ymwOij88
I am trying to write some DAG integrity tests in airflow. The issue I am coming across is the DAG that I am testing, I have references to variables in some of the tasks within that DAG.
eg: Variable.get("AIRFLOW_VAR_BLOB_CONTAINER")
I seem to be getting the error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable
from this because when testing via pytest, those variables (and the variables table) don't exist. Does anyone know any workarounds or suggested methods to handle Variables/Connection references when running DAG Integrity tests?
Thanks,
You can create a local metastore for testing. Running airflow db init without any other settings will create a SQLite metastore in your home directory which you can use during testing. My default additional settings for a local metastore for testing are:
AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__LOAD_EXAMPLES=False (to ensure there are no defaults to make things magically work)
AIRFLOW__CORE__UNIT_TEST_MODE=True (Set default test settings, skip certain actions, etc.)
AIRFLOW_HOME=[project root dir] (To avoid Airflow files in your home dir)
Running airflow db init with these settings results in three files in your project root dir:
unittests.db
unittests.cfg
webserver_config.py
It's probably a good idea to add those to your .gitignore. With this set up you can safely test against the local metastore unittests.db during your tests (ensure that when running pytest, the same env vars are set).
Alternatively, if you don't want a local metastore for reasons, you will have to resort to mocking to substitute the call Airflow makes to the metastore. This requires knowledge of the internals of Airflow. An example:
import datetime
from unittest import mock
from airflow.models import DAG
from airflow.operators.bash import BashOperator
def test_bash_operator(tmp_path):
with DAG(dag_id="test_dag", start_date=datetime.datetime(2021, 1, 1), schedule_interval="#daily") as dag:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
output_file = tmp_path / "output.txt"
test = BashOperator(task_id="test", bash_command="echo {{ var.json.employees }} > " + str(output_file))
dag.clear()
test.run(
start_date=dag.start_date,
end_date=dag.start_date,
ignore_first_depends_on_past=True,
ignore_ti_state=True,
)
variable_get_mock.assert_called_once()
assert output_file.read_text() == f"[{', '.join(employees)}]\n"
These lines:
with mock.patch("airflow.models.variable.Variable.get") as variable_get_mock:
employees = ["Alice", "Bob", "Charlie"]
variable_get_mock.return_value = employees
Determine that the function airflow.models.variable.Variable.get isn't actually called but instead this list is returned: ["Alice", "Bob", "Charlie"]. Since task.run() doesn't return anything, I made the bash_command write to a tmp_path, and read the file to assert if the content is what I expected.
This avoids the need for a metastore entirely, but mocking can be a lot of work and complex once your tests grow beyond basic examples like these.
i am a newbie to Airflow. i have some .jar jobs generated with Talend Open Studio for Big Data, and i want to schedule and manage those with Airflow my question is , does Airflow support .jar file or generated by TOS as DAG ?
and if it does how ?
or is there any alternative to run .jar on Airlow ?
im using Airflow v1.10.3
the jobs are mainly to extract and process data from a mongodb database then update the database with the new processed data.
Thanks !
Airflow does support running jar files. You do this through the BashOperator.
Quick example:
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime
import os
import sys
args = {
'owner': 'you',
'start_date': datetime(2019, 4, 24),
'provide_context': True
}
dag = DAG(
task_id = 'runjar',
schedule_interval = None, #manually triggered
default_args = args)
run_jar_task= BashOperator(
task_id = 'runjar',
dag = dag,
bash_command = 'java -cp /path/to/your/jar.jar param1 param2'
)
Airflow will happily run .jar files. There is a few examples kicking about for you to have a look at.
Running a standard .jar file: run_jar.py
Running a "built" Talend jobl loan_application_data.py
Obviously with both these examples the .jar or Talend file(s) will need to be on the server Airflow is executing on (as well as Java).
Airflow packaged DAGs seem like a great building block for a sane production airflow deployment.
I have a DAG with dynamic subDAGs, driven by a config file, something like:
config.yaml:
imports:
- project_foo
- project_bar`
which yields subdag tasks like imports.project_{foo|bar}.step{1|2|3}.
I've normally read in the config file using python's open function, a la config = open(os.path.join(os.path.split(__file__)[0], 'config.yaml')
Unfortunately, when using packaged DAGs, this results in an error:
Broken DAG: [/home/airflow/dags/workflows.zip] [Errno 20] Not a directory: '/home/airflow/dags/workflows.zip/config.yaml'
Any thoughts / best practices to recommend here?
It's a bit of a kludge, but I eventually just fell back on reading zip file contents via ZipFile.
import yaml
from zipfile import ZipFile
import logging
import re
def get_config(yaml_filename):
"""Parses and returns the given YAML config file.
For packaged DAGs, gracefully handles unzipping.
"""
zip, post_zip = re.search(r'(.*\.zip)?(.*)', yaml_filename).groups()
if zip:
contents = ZipFile(zip).read(post_zip.lstrip('/'))
else:
contents = open(post_zip).read()
result = yaml.safe_load(contents)
logging.info('Parsed config: %s', result)
return result
which works as you'd expect from the main dag.py:
get_config(os.path.join(path.split(__file__)[0], 'config.yaml'))
I have created a python_scripts/ folder under my dags/ folder.
I have 2 different dags running the same python_operator - calling to 2 different python scripts located in the python_scripts/ folder.
They both write output files BUT:
one of them creates the file under the dags/ folder, and one of them creates it in the plugins/ folder.
How does Airflow determine the working path?
How can I get Airflow to write all outputs to the same folder?
One thing you could try, that I use in my dags, would be to set you working path by adding os.chdir('some/path') in your DAG.
This only works if you do not put it into an operator, as those are run in subprocesses and therefore do not change the working path of the parent process.
The other solution I could think of would be using absolute paths when specifying your output.
For the approach with os.chdir try the following and you should see both files get created in the folder defined with path='/home/chr/test/':
from datetime import datetime
import os
import logging
from airflow import DAG
from airflow.exceptions import AirflowException
from airflow.operators.python_operator import PythonOperator
log = logging.getLogger(__name__)
default_args = {
'owner': 'admin',
'depends_on_past': False,
'retries': 0
}
dag = DAG('test_dag',
description='Test DAG',
catchup=False,
schedule_interval='0 0 * * *',
default_args=default_args,
start_date=datetime(2018, 8, 8))
path = '/home/chr/test'
if os.path.isdir(path):
os.chdir(path)
else:
os.mkdir(path)
os.chdir(path)
def write_some_file():
try:
with open("/home/chr/test/absolute_testfile.txt", "wt") as fout:
fout.write('test1\n')
with open("relative_testfile.txt", "wt") as fout:
fout.write('test2\n')
except Exception as e:
log.error(e)
raise AirflowException(e)
write_file_task = PythonOperator(
task_id='write_some_file',
python_callable=write_some_file,
dag=dag
)
Also, please try to provide code next time you ask a question, as it is almost impossible to find out what the problem is, just by reading your question.