I want to have a single DAG to download data from an FTP, I don't need all the data in the FTP just certain files. The files get uploaded daily at certain times throughout the day and I want to retrieve these files shortly after they are available on the FTP site.
Ex FTP schedule:
/Data/US/XASE/yyyymmdd.csv #uploaded daily at 9:00 PM UTC
/Data/EU/TRWB/yyyymmdd.csv #uploaded daily at 1:00 PM UTC
...
/Data/EU/XEUR/yyyymmdd.csv #uploaded daily at 11:00 AM UTC
How can I set the scheduler in the dag so that I can copy the data from the FTP site as they are uploaded and not have a separate dag for each upload time?
I think you have three options for scheduling here.
Option 1
You run exactly at 11AM,1PM,9PM UTC with the following schedule 0 11,13,21 * * *. Or maybe 5 mins after the full hour to add some buffer (5 11,13,21 * * *).
Option 2
You run the DAG more regularly and check if the files are available and then download them within the Task. This makes sense if there is a higher chance that the file upload is delayed.
For example */10 10-22 * * * would run every 10 minutes between 10:00-22:00.
Option 3
You schedule a DAG once per day (#once) and then work with TimeDeltaSensor. I think this option is least preferable as you have a lot of tasks just "waiting" - which can block the execution of other airflow tasks.
Besides that it also depends heavily how you want to handle the download from the FTP itself.
I guess you could create a task for every file to download per day and put a task based on BranchPythonOperator in front to avoid trying to download the same file multiple times.
Or you put the whole logic into a PythonOperator including a logic that just downloads certain files based on execution_date.
Related
I'm trying to make my DAGs run every Monday at 08:00 AM. For this purpose, I have defined the correspondent schedule interval schedule_interval= '0 8 * * 1'.
However, two problems arise - which are likely due to the same issue:
My DAGs never seem to trigger
When I force the DAGs to run, they always run to the previous Monday, e.g. if I force the start today (21-10-2021) it will actually trigger a run on the previous week's Monday, 11-09-2021.
Why does this occur and how can I fix it?
It's not delayed.
Airflow schedule tasks at the END of the interval. You can check this answer for more details.
This behavior make sense in the ETL domain as normally you run ETL at the end of a specific time interval. To give example: Today you are parsing yesterday data.
That said - on Airflow >= 2.2.0 a new concept of Timetables has been introduced with the completion of AIP-39 Richer scheduler_interval see release notes. In simple words Airflow decoupled the when to run (Timetable) from the on what interval of time to process (Data Interval) thus resolved the issue you experience from the root. You can read the documentation about it here.
As stated in Apache Airflow documentation, I can control how often a DAG is updated by setting configuration variable min_file_process_interval in your airflow.cfg file:
min_file_process_interval
Number of seconds after which a DAG file is parsed. The DAG file is parsed every min_file_process_interval number of seconds. Updates to DAGs are reflected after this interval. Keeping this number low will increase CPU usage.
However, I didn't find any clue or best practice about which value should I set for min_file_process_interval.
Example
My DAG changes once a day. By default min_file_process_interval is set to 30 seconds. It means most of the time updating DAG is useless: as long as DAG doesn't change, updated DAG and previous DAG are the same. It consumes resources and generates logs. But if I update DAG only once a day, do I risk to run wrong DAG if DAG changes after the daily DAG update or the DAG is also updated just before run ?
What value for min_file_process_interval should I set in this case ?
EDIT: As stated in Elad's answer responding to a previous version of this question, dynamic DAGs should be avoided. However, If I have dynamic DAGs, how to choose min_file_process_interval?
You are mixing two different things.
min_file_process_interval means how often Airflow scan the .py files and update the DAG within Airflow. Consider that when you deploy new .py file Airflow needs to read it and create it in the metastore database - so the setting is about how often it happens.
For your use case the DAG code should not be updated every day - In fact it should not be updated at all. It should just run everyday. Your dag just need to be able to handle the correct file per date. You code can be something like:
from airflow.providers.ftp.sensors.ftp import FTPSensor
with DAG(dag_id='stackoverflow',
default_args=default_args,
schedule_interval="#daily",
catchup=False
) as dag:
# Waits for a file or directory to be present on FTP.
sensor_op = FTPSensor(
task_id='sensor_task',
path='/my_folder/{{ ds }}/file.csv', #path to your file in the server
fail_on_transient_errors=False,
ftp_conn_id='ftp_default'
)
# Operator to process the file
operator_op = SomeOperator()
sensor_op >> operator_op
In that DAG it will start a run everyday - the first operator is sensor thus if the file for that day isn't present the workflow will wait until it appears only once it appear the workflow will continue to the 2nd operator which should process the file.
Note that the path parameter of FTPSensor is templated. This means you can use macros like {{ ds }} this will give you a path that contains each day date like:
/my_folder/2021-05-01/file.csv
/my_folder/2021-05-02/file.csv
/my_folder/2021-05-03/file.csv
You can also do path='/my_folder/{{ ds }}.csv' which will give:
/my_folder/2021-05-01.csv
/my_folder/2021-05-02.csv
/my_folder/2021-05-03.csv
While assigning two different DAGs for different tasks its showing "Tried to set relationships between tasks in more than one DAG"
Task can not be assigned to more than one DAG.
You can however create a code that generate tasks from configuration file thus using the same configuration file to be used in daily DAG & weekly DAG.
Another option is to use one 1 daily dag that uses also DayOfWeekSensor.
In that way you can execute your daily tasks as usual and have more tasks that will run only once a week.
Is there a best practice to schedule some task to a specified date and hour in meteor (e.g. what buffer doe with tweets)? I thought of a collection (sendlater) that contains all the info needed (date, time, code to run). And a server code that checks (maybe every minute) if there are codes to run, and runs them.
I want to make a report of start and end times of a Autosys job from last three months.
How can i get it. Do i need to check archived history or logs?
If yes, please lemme know the details.
TIA
Autosys internally uses Oracle or Sybase database. As long as the data is available in the DB you can fetch it using autorep command. To get past run time use -r handle.
For example: autorep -J JobA -r -30
The above will give you last 30th run time for the job.
However, due to performance bottleneck that may arise due to historical data in the DBs the DBAs generally purge the data after a while. I have seen period of 1 day to 7 days based on the number of the jobs and database instance power.
Other approximate way would to be use the log files created by autosys if the option stdout is specified with unique filenames.
For example: you can have the attribute as std_out: $JOB_NAME.out.date +%m.%s
In this case the log file will be created as soon as the job starts which you can get from the filename using text function on unix,etc.
For the end-time, you can use the last modified time - this is where the approximate part comes in as the time would depend if your job had an echo to the log file or not. It can either be close or far based on the command of the script.
This method will not let you know the times for the box jobs as they never have a log attribute, for that you can depend on the first job in the box.