airflow "python operator" writes files to different locations - airflow

I have created a python_scripts/ folder under my dags/ folder.
I have 2 different dags running the same python_operator - calling to 2 different python scripts located in the python_scripts/ folder.
They both write output files BUT:
one of them creates the file under the dags/ folder, and one of them creates it in the plugins/ folder.
How does Airflow determine the working path?
How can I get Airflow to write all outputs to the same folder?

One thing you could try, that I use in my dags, would be to set you working path by adding os.chdir('some/path') in your DAG.
This only works if you do not put it into an operator, as those are run in subprocesses and therefore do not change the working path of the parent process.
The other solution I could think of would be using absolute paths when specifying your output.
For the approach with os.chdir try the following and you should see both files get created in the folder defined with path='/home/chr/test/':
from datetime import datetime
import os
import logging
from airflow import DAG
from airflow.exceptions import AirflowException
from airflow.operators.python_operator import PythonOperator
log = logging.getLogger(__name__)
default_args = {
'owner': 'admin',
'depends_on_past': False,
'retries': 0
}
dag = DAG('test_dag',
description='Test DAG',
catchup=False,
schedule_interval='0 0 * * *',
default_args=default_args,
start_date=datetime(2018, 8, 8))
path = '/home/chr/test'
if os.path.isdir(path):
os.chdir(path)
else:
os.mkdir(path)
os.chdir(path)
def write_some_file():
try:
with open("/home/chr/test/absolute_testfile.txt", "wt") as fout:
fout.write('test1\n')
with open("relative_testfile.txt", "wt") as fout:
fout.write('test2\n')
except Exception as e:
log.error(e)
raise AirflowException(e)
write_file_task = PythonOperator(
task_id='write_some_file',
python_callable=write_some_file,
dag=dag
)
Also, please try to provide code next time you ask a question, as it is almost impossible to find out what the problem is, just by reading your question.

Related

Apache airflow doens't display the DAG

I'm facing some issues trying to set up a basic DAG file inside the Airflow (but also I have other two files).
I'm using the LocalExecutor through the Ubuntu and saved my files at "C:\Users\tdamasce\Documents\workspace" with the dag and log file inside it.
My script is
# step 1 - libraries
from email.policy import default
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
# step 2
default_args = {
'ownwer': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'retries':0
}
# step 3
dag = DAG(
dag_id='DAG-1',
default_args=default_args,
catchup=False,
schedule_interval=timedelta(minutes=5)
)
# step 4
start = DummyOperator(
task_id='start',
dag=dag
)
end = DummyOperator(
task_id='end',
dag=dag
)
My DAG stays like that:
Please, let me know if any add info is needed
As per your updated Question , I can see that you place the DAgs under a directory
"C:\Users\tdamasce\Documents\workspace" with the dag and log file
inside it.
you need to add dags to dags_folder (specified in airflow.cfg. By default it's $AIRFLOW_HOME/dags subfolder). See if your AIRFLOW_HOME variable and you should found a dag folder there.
you can also check airflow list_dags - this will list out all the dags,
Still you are not able to get that in the UI , then restart the servers.

read cli input without calling python operator

we want to read cli input pass to dag from UI during Dagtrigger in Dag.
i tried below code but its not working. here i am passing input as {""kpi":"ID123"}
and i want to print this ip value in my function get_data_from_bq
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
from airflow import models
from airflow.models import Variable
from google.cloud import bigquery
from airflow.configuration import conf
LOCATION = Variable.get("HDM_PROJECT_LOCATION")
PROJECT_ID = Variable.get("HDM_PROJECT_ID")
client = bigquery.Client()
kpi='{{ kpi}}'
# default arguments
default_dag_args = {
'start_date':days_ago(0),
'retries': 0,
'project_id': PROJECT_ID
}
# Setting airflow environment varriable,getting hdm_batch_details data and updating it
def get_data_from_bq(**kwargs):
print("op is:")
print(kpi)
#Dag Defination
with models.DAG(
'00_test_sql1',
schedule_interval=None,
default_args=default_dag_args) as dag:
v_run_sql_01 = PythonOperator(
task_id='Run_SQL',
provide_context=True,
python_callable=get_data_from_bq,
location=LOCATION,
use_legacy_sql=False)
v_run_sql_01
Note: I don't want to use any operator to read data passed from cli
Note: I don't want to use any operator to read data passed from cli
This is impossible. Dag Run is only created when there are tasks to run.
You should understand that :
DAG + its top level code - builds DAG structure consisting of Tasks
DAG Run -> is single instance of DAG run which contains Task Instances to be executed. Dag Run simply consists of task instancess that belong to the DAG run with the given "dag run".
The configuration that you pass is "dag_run.conf" not "dag.conf" - which meanss that it is only specified for the DagRun, which is valid only for all Task Instances that belong to it.
Only Task Instances have access to dag_run.conf

Airflow: Create DAG from a separate file

In airflow, I'm trying to make a function that is dedicated to generate DAGs in a file:
dynamic_dags.py:
def generate_dag(name):
with DAG(
dag_id=f'dag_{name}',
default_args=args,
start_date=days_ago(2),
schedule_interval='5 5 * * *',
tags=['Test'],
catchup=False
) as dag:
dummy_task=DummyOperator(
task_id="dynamic_dummy_task",
dag=dag
)
return dag
Then in another file I'm trying to import the dags from a separate file:
load_dags.py:
from dynamic_dag import generate_dag
globals()["Dynamic_DAG_A"] = generate_dag('A')
However, the dags are not shown up on web UI.
But if I do them in a single file as below code, it will work:
def generate_dag(name):
with DAG(
dag_id=f'dag_{name}',
default_args=args,
start_date=days_ago(2),
schedule_interval='5 5 * * *',
tags=['Test'],
catchup=False
) as dag:
dummy_task=DummyOperator(
task_id="dynamic_dummy_task",
dag=dag
)
return dag
globals()["Dynamic_DAG_A"] = generate_dag('A')
I'm wondering why doing it in two separate files doesn't work.
I think if you are using Airflow 1.10, then the DAG files should contain DAG and airlfow:
https://airflow.apache.org/docs/apache-airflow/1.10.15/concepts.html?highlight=airflowignore#dags
When searching for DAGs, Airflow only considers python files that contain the strings “airflow” and “DAG” by default. To consider all python files instead, disable the DAG_DISCOVERY_SAFE_MODE configuration flag.
In Airflow 2 it's been changed (slightly - dag is case-insensitive):
https://airflow.apache.org/docs/apache-airflow/2.2.2/concepts/dags.html
When searching for DAGs inside the DAG_FOLDER, Airflow only considers Python files that contain the strings airflow and dag (case-insensitively) as an optimization.
To consider all Python files instead, disable the DAG_DISCOVERY_SAFE_MODE configuration flag.
I think you simply miss 'airflow' in your load_dags.py. You can add it wherever - including comments.

Google Dataflow: Import custom Python module

I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.
The structure of my dags folder in the respective GCS bucket is as follows:
/dags/
dataflow.py <- DAG
dataflow/
pipeline.py <- pipeline
setup.py
my_modules/
__init__.py
commons.py <- the module I want to import in the pipeline
The setup.py is very basic, but according to the Apache Beam docs and answers on SO:
import setuptools
setuptools.setup(setuptools.find_packages())
In the DAG file (dataflow.py) I set the setup_file option and pass it to Dataflow:
default_dag_args = {
... ,
'dataflow_default_options': {
... ,
'runner': 'DataflowRunner',
'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
}
}
Within the pipeline file (pipeline.py) I try to use
from my_modules import commons
but this fails. The log in Google Cloud Composer (Apache Airflow) says:
gcp_dataflow_hook.py:132} WARNING - b' File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11\n from my_modules import commons\n ^\nSyntaxError: invalid syntax'
The basic idea behind the setup.py file is documented here
Also, there are similar questions on SO which helped me:
Google Dataflow - Failed to import custom python modules
Dataflow/apache beam: manage custom module dependencies
I'm actually wondering why my pipelines fails with a Syntax Error and not a module not found kind of error...
I tried to reproduce your issue and then try to solve it, so I created the same folder structure you already have:
/dags/
dataflow.py
dataflow/
pipeline.py -> pipeline
setup.py
my_modules/
__init__.py
common.py
Therefore, to make it work, the change I made is to copy these folders to a place where the instance is running the code is able to find it, for example in the /tmp/ folder of the instance.
So, my DAG would be something like this:
1 - Fist of all I declare my arguments:
default_args = {
'start_date': datetime(xxxx, x, x),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dataflow_default_options': {
'project': '<project>',
'region': '<region>',
'stagingLocation': 'gs://<bucket>/stage',
'tempLocation': 'gs://<bucket>/temp',
'setup_file': <setup.py>,
'runner': 'DataflowRunner'
}
}
2- After this, I created the DAG and before running the Dataflow task, I copied the whole folder directory, above created, into the /tmp/ folder of the instance Task t1, and after this, I run the pipeline from the /tmp/ directory Task t2:
with DAG(
'composer_df',
default_args=default_args,
description='datflow dag',
schedule_interval="xxxx") as dag:
def copy_dependencies():
process = subprocess.Popen(['gsutil','cp', '-r' ,'gs://<bucket>/dags/*',
'/tmp/'])
process.communicate()
t1 = python_operator.PythonOperator(
task_id='copy_dependencies',
python_callable=copy_dependencies,
provide_context=False
)
t2 = DataFlowPythonOperator(task_id="composer_dataflow",
py_file='/tmp/dataflow/pipeline.py', job_name='job_composer')
t1 >> t2
That's how I created the DAG file dataflow.py, and then, in the pipeline.py the package to import would be like:
from my_modules import commons
It should work fine, since the folder directory is understandable for the VM.

Does Airflow support jar file?

i am a newbie to Airflow. i have some .jar jobs generated with Talend Open Studio for Big Data, and i want to schedule and manage those with Airflow my question is , does Airflow support .jar file or generated by TOS as DAG ?
and if it does how ?
or is there any alternative to run .jar on Airlow ?
im using Airflow v1.10.3
the jobs are mainly to extract and process data from a mongodb database then update the database with the new processed data.
Thanks !
Airflow does support running jar files. You do this through the BashOperator.
Quick example:
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime
import os
import sys
args = {
'owner': 'you',
'start_date': datetime(2019, 4, 24),
'provide_context': True
}
dag = DAG(
task_id = 'runjar',
schedule_interval = None, #manually triggered
default_args = args)
run_jar_task= BashOperator(
task_id = 'runjar',
dag = dag,
bash_command = 'java -cp /path/to/your/jar.jar param1 param2'
)
Airflow will happily run .jar files. There is a few examples kicking about for you to have a look at.
Running a standard .jar file: run_jar.py
Running a "built" Talend jobl loan_application_data.py
Obviously with both these examples the .jar or Talend file(s) will need to be on the server Airflow is executing on (as well as Java).

Resources