i am a newbie to Airflow. i have some .jar jobs generated with Talend Open Studio for Big Data, and i want to schedule and manage those with Airflow my question is , does Airflow support .jar file or generated by TOS as DAG ?
and if it does how ?
or is there any alternative to run .jar on Airlow ?
im using Airflow v1.10.3
the jobs are mainly to extract and process data from a mongodb database then update the database with the new processed data.
Thanks !
Airflow does support running jar files. You do this through the BashOperator.
Quick example:
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime
import os
import sys
args = {
'owner': 'you',
'start_date': datetime(2019, 4, 24),
'provide_context': True
}
dag = DAG(
task_id = 'runjar',
schedule_interval = None, #manually triggered
default_args = args)
run_jar_task= BashOperator(
task_id = 'runjar',
dag = dag,
bash_command = 'java -cp /path/to/your/jar.jar param1 param2'
)
Airflow will happily run .jar files. There is a few examples kicking about for you to have a look at.
Running a standard .jar file: run_jar.py
Running a "built" Talend jobl loan_application_data.py
Obviously with both these examples the .jar or Talend file(s) will need to be on the server Airflow is executing on (as well as Java).
Related
I'm new to Airflow. I'm following the offical tutorial to set up the first DAG and task
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'admin',
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
with DAG(
dag_id="hello_world_dag",
description="Hello world DAG",
start_date=datetime(2023, 1, 16),
schedule_interval='#daily',
default_args=default_args
) as dag:
task1 = BashOperator(
task_id="hello_task",
bash_command="echo hello world!"
)
task1
When I tried to run this manually, it always failed. I've checked the web server logs and the scheduler logs, they don't have any obvious errors. I also checked the task run logs, it's empty.
The setup is pretty simple: SequentialExecutor with sqlite. My question is: where can I see the worker logs, or any other places that have any useful message logged?
Ok finally figured this out.
Firstly let me correct my question - there's actually an error raised in scheduler log that the "BashTaskRunner" cannot be loaded. So I searched Airflow's source code, and found it was renamed to StandardBashRunner like 3 years ago(link).
This is the only occurrence of the word BashTaskRunner in the whole repo. So I'm curious how the AIRFLOW_HOME/airflow.cfg is generated, which sets this as the default task_runner value.
I'm facing some issues trying to set up a basic DAG file inside the Airflow (but also I have other two files).
I'm using the LocalExecutor through the Ubuntu and saved my files at "C:\Users\tdamasce\Documents\workspace" with the dag and log file inside it.
My script is
# step 1 - libraries
from email.policy import default
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
# step 2
default_args = {
'ownwer': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'retries':0
}
# step 3
dag = DAG(
dag_id='DAG-1',
default_args=default_args,
catchup=False,
schedule_interval=timedelta(minutes=5)
)
# step 4
start = DummyOperator(
task_id='start',
dag=dag
)
end = DummyOperator(
task_id='end',
dag=dag
)
My DAG stays like that:
Please, let me know if any add info is needed
As per your updated Question , I can see that you place the DAgs under a directory
"C:\Users\tdamasce\Documents\workspace" with the dag and log file
inside it.
you need to add dags to dags_folder (specified in airflow.cfg. By default it's $AIRFLOW_HOME/dags subfolder). See if your AIRFLOW_HOME variable and you should found a dag folder there.
you can also check airflow list_dags - this will list out all the dags,
Still you are not able to get that in the UI , then restart the servers.
I am using airflow v2.0 on windows 10 WSL (Ubuntu 20.04).
The warning message is :
/home/jainri/.local/lib/python3.8/site-packages/airflow/models/dag.py:1342: PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
warnings.warn(
Done.
Due to this warning, the dags showing in web UI are also some example dags included with apache airflow. I have setup **AIRFLOW_HOME** and it also picks up dags from there. But the list of example dags also displayed. I have posted the image of WEB UI also.
WebUI
This is the dag below that I am trying to run:
import datetime
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#
# TODO: Define a function for the python operator to call
#
def greet():
logging.info("Hello Rishabh!!")
dag = DAG(
'lesson1.demo1',
start_date = datetime.datetime.now()
end_date
)
#
# TODO: Define the task below using PythonOperator
#
greet_task = PythonOperator(
task_id='greet_task',
python_callable=greet,
dag=dag
)
Also, the main issue is like the list of dags showing in webUI is some example dags. That shows up a huge list along with my own dags. Which makes it cumbersome to look for my own dags.
I found the issue, the error you are seeing is because of airflow/example_dags/example_complex.py (one of the example_dags) that is shipped with Airflow.
Disable loading of example_dags by setting AIRFLOW__CORE__LOAD_EXAMPLES=False as an environment variable or set [core] load_examples = False in airflow.cfg (docs).
I have created a new DAG using the following code. It is calling a python script.
Code:
from __future__ import print_function
from builtins import range
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
args = {
'owner': 'admin'
}
dag = DAG(
dag_id='workflow_file_upload', default_args=args,
schedule_interval=None)
t1 = BashOperator(
task_id='testairflow',
bash_command='python /root/DataLake_Scripts/File_Upload_GCP.py',
dag=dag)
I have placed it in $airflowhome/dags folder.
after that I have run :
airflow scheduler
I am trying to see the DAG in WebUI however it is not visible there. There is no error coming.
I've met the same issue.
I figured out that the problem is in initial sqlite db. I suppose it's some feature of airflow 1.10.3
Anyway I solved the problem using postgresql backend.
These links will help you:
link
link
link
All instructions are suitable for python 3.
You'll see your DAG after execution of 'airflow webserver' and 'airflow scheduler' commands.
Also notice that you should call 'sudo service postgresql restart' exactly before 'airflow initdb' command.
I have created a python_scripts/ folder under my dags/ folder.
I have 2 different dags running the same python_operator - calling to 2 different python scripts located in the python_scripts/ folder.
They both write output files BUT:
one of them creates the file under the dags/ folder, and one of them creates it in the plugins/ folder.
How does Airflow determine the working path?
How can I get Airflow to write all outputs to the same folder?
One thing you could try, that I use in my dags, would be to set you working path by adding os.chdir('some/path') in your DAG.
This only works if you do not put it into an operator, as those are run in subprocesses and therefore do not change the working path of the parent process.
The other solution I could think of would be using absolute paths when specifying your output.
For the approach with os.chdir try the following and you should see both files get created in the folder defined with path='/home/chr/test/':
from datetime import datetime
import os
import logging
from airflow import DAG
from airflow.exceptions import AirflowException
from airflow.operators.python_operator import PythonOperator
log = logging.getLogger(__name__)
default_args = {
'owner': 'admin',
'depends_on_past': False,
'retries': 0
}
dag = DAG('test_dag',
description='Test DAG',
catchup=False,
schedule_interval='0 0 * * *',
default_args=default_args,
start_date=datetime(2018, 8, 8))
path = '/home/chr/test'
if os.path.isdir(path):
os.chdir(path)
else:
os.mkdir(path)
os.chdir(path)
def write_some_file():
try:
with open("/home/chr/test/absolute_testfile.txt", "wt") as fout:
fout.write('test1\n')
with open("relative_testfile.txt", "wt") as fout:
fout.write('test2\n')
except Exception as e:
log.error(e)
raise AirflowException(e)
write_file_task = PythonOperator(
task_id='write_some_file',
python_callable=write_some_file,
dag=dag
)
Also, please try to provide code next time you ask a question, as it is almost impossible to find out what the problem is, just by reading your question.