I'm trying to use a sensor for a directory with format YYYYMMDD, that's means every day the directory will change.
For example: Today, 20211129 sensor is checking if there are a directory with this name, if not created, that means the files haven't arrived. Tomorrow, suppose the files will arrive(and the directory 20211130 will be create) and the airflow have to be abble to change the date checking for 20211130.
I'm trying to do that with macros, but doesnt work properly.
Just for test effects, I set to every minute to see if time changed.
dir_main_name = f"archive/files"
templated_log_dir = f"{dir_main_name}/"+'{{(execution_date).strftime("%Y-%m-%d-%M")}}'
args = {
'start_date': datetime(2021, 11, 29),
}
with DAG(dag_id='GCS_check', catchup=False, schedule_interval="*/1 * * * * *", default_args=args) as dag:
sensor_check = GoogleCloudStorageObjectSensor(
task_id='gcs_polling',
bucket=BUCKET_NAME,
object=f"{dir_main_name}/"+'{{ds}}'
)
sensor_check
I've tested with "ds" and "execution_date", but neither works.
As the log shows, the time of sensors checks differ than the time I put for the directory's name(just as example), the directory's name doesn't change:
[**2021-11-29, 16:48:52** UTC] {gcs.py:84} INFO - Sensor checks existence of : dev-datalake-archive-myralis-com, archive/files/**2021-11-29T16:48:47+00:00**
[**2021-11-29, 16:48:52** UTC] {credentials_provider.py:295} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[**2021-11-29, 16:49:52** UTC] {gcs.py:84} INFO - Sensor checks existence of : dev-datalake-archive-myralis-com, archive/files/**2021-11-29T16:48:47+00:00**
[**2021-11-29, 16:49:52** UTC] {credentials_provider.py:295} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[**2021-11-29, 16:50:53** UTC] {gcs.py:84} INFO - Sensor checks existence of : dev-datalake-archive-myralis-com, archive/files/**2021-11-29T16:48:47+00:00**
Stay in the same hour(2021-11-29T16:48:47+00:00)
Is there a way the value of directory be the same of the date(and hour) as time of sensors check?
like the last line, exec 2021-11-29, 16:50:53 and the directory should be 2021-11-29T16:50:53+00:00
Related
I am relatively new to airflow & currently running 2.2.3. When trying to set up a costume timetable I was getting the error :
airflow ValueError: Not a valid timetable
Following the discussion here: https://github.com/apache/airflow/issues/19578
I changed a line in the my version of the airflow/models/dag.py that you get on installing airflow.
I change the infer_automated_data_interval function as so
def infer_automated_data_interval(self, logical_date: datetime) -> DataInterval:
"""Infer a data interval for a run against this DAG.
This method is used to bridge runs created prior to AIP-39
implementation, which do not have an explicit data interval. Therefore,
this method only considers ``schedule_interval`` values valid prior to
Airflow 2.2.
DO NOT use this method is there is a known data interval.
"""
timetable_type = type(self.timetable)
if issubclass(timetable_type, (NullTimetable, OnceTimetable)):
return DataInterval.exact(timezone.coerce_datetime(logical_date))
start = timezone.coerce_datetime(logical_date)
if issubclass(timetable_type, CronDataIntervalTimetable):
end = cast(CronDataIntervalTimetable, self.timetable)._get_next(start)
elif issubclass(timetable_type, DeltaDataIntervalTimetable):
end = cast(DeltaDataIntervalTimetable, self.timetable)._get_next(start)
else:
end = logical_date # I added this to avoid raising the value error
#raise ValueError(f"Not a valid timetable: {self.timetable!r}") # I commented this out
return DataInterval(start, end)
As far as I can see the jobs are now running on the expected schedule - but this feels like a very hacky and dangerous solution. Has anyone else had this issue?
What is this function supposed to be checking and why it's failing?
Thanks!
If I run this locally in the cli it runs successfully and copies the files from another bucket/key to mine into the correct location.
aws s3 sync s3://client_export/ref/commissions/snapshot_date=2022-01-01/ s3://bi-dev/KSM/refinery29/commissions/snapshot_date=2022-01-01/
When I try with the S3CopyObjectOperator I see the NoSuchKey error:
copy_commissions_data = S3CopyObjectOperator(
task_id='copy_commissions_data',
aws_conn_id='aws_default',
source_bucket_name='client_export',
dest_bucket_name='bi-dev',
source_bucket_key='ref/commissions/snapshot_date=2022-01-01,
dest_bucket_key='KSM/refix/commissions/snapshot_date=2022-01-01',
dag=dag
)
I've also tried adding a / before and after the key names and both but I get the same error
You are missing quote in the end of line 6, should be:
source_bucket_key='ref/commissions/snapshot_date=2022-01-01',
(1) The test suite with installation & running instructions:
https://github.com/TeddyTeddy/robot-fw-rest-instance-library-tests-v2
(2) The test suite is testing a locally running JSON RESTFUL API server:
https://github.com/typicode/json-server
The DB file for the server (i.e. db.json) is at the root of (1). The server reads the file and creates the API endpoints based on it.
An example run of the server:
(base) ~/Python/Robot/robot-fw-rest-instance-library-tests-v2$ json-server --watch db.json
\{^_^}/ hi!
Loading db.json
Done
Resources
http://localhost:3000/posts
http://localhost:3000/comments
http://localhost:3000/albums
http://localhost:3000/photos
http://localhost:3000/users
http://localhost:3000/todos
(3) With the given db.json, you can make the following request to the server:
GET /posts?_start=<start_index>&_end=<end_index>
where _start is inclusive and _end is exclusive. Note that start_index starts from 0 just like in Array.Slice method.
(4) To be able to comprehensively test (3), i wrote the following Robot Test Case, which is provided in (1):
Slicing Posts With All Possible Start And End Combinations
[Documentation] Referring to the API documentation:
... GET /posts?_start=20&_end=30
... where _start is inclusive and _end is exclusive
... This test case make the above API call with all possible combinations of _start and _end values.
... For each call, the test case fetches expected_posts from database for the same _start and _end.
... It then compares the expected_posts with observed_posts. It also calculates the expected length
... of observed_posts and compares that with the observed length of observed_posts
[Tags] read-tested slicing run-me-only
FOR ${start_index} IN RANGE ${0} ${NUMBER_OF_POSTS+10}
FOR ${end_index} IN RANGE ${start_index+1} ${NUMBER_OF_POSTS + 10 +1}
Log To Console start_index:${start_index}
Log To Console end_index:${end_index}
# note that start_index starts from zero when posts are fetched from database
${expected_posts} = Fetch Posts From Database ${start_index} ${end_index}
# note that start_index starts from 0 too when posts are fetched via API call
# test call
${observed_posts} = Get Sliced Posts ${start_index} ${end_index}
Should Be Equal ${expected_posts} ${observed_posts}
# note that start_index is between [0, NUMBER_OF_POSTS-1]
# and end_index is between [start_index+1, start_index+NUMBER_OF_POSTS]
# we expect observed_posts to be a non-empty list at least containing 1 item
${observed_length} = Get Length ${observed_posts}
# calculate expected_length of the observed_posts list
IF ${end_index} < ${NUMBER_OF_POSTS}
${expected_length} = Evaluate $end_index-$start_index
ELSE IF ${end_index} >= ${NUMBER_OF_POSTS} and ${start_index} < ${NUMBER_OF_POSTS}
${expected_length} = Evaluate $NUMBER_OF_POSTS-$start_index
ELSE
${expected_length} = Set Variable ${0}
END
Should Be Equal ${expected_length} ${observed_length}
Free Memory ${expected_posts} # (*)
Free Memory ${observed_posts} # (*)
END
Reload Library REST # (**)
END
Note that when you follow the instruction to run the test suite via ./run, you will only execute this test case (because of --include run-me-only tag in run command).
The problem
As the test case runs, the amount of memory Robot & RESTInstance use grows to gigabyte levels in a few minutes.
The Question
How can I prevent this from happening?
How can I free the memory used in inner loop's iteration?
My failed attempts to fix the problem
I added the codes marked with (*) into the test case with the following custom keyword:
#keyword
def free_memory(reference):
del reference
Note also that I use RESTInstance library to make the GET call:
Get Sliced Posts
[Documentation] start_index starts from 1 as we fetch from the API now
[Arguments] ${start_index} ${end_index}
GET /posts?_start=${start_index}&_end=${end_index}
${posts} = Output response body
[Return] ${posts}
AFAIK, RESTInstance library keeps a list of instance objects:
https://asyrjasalo.github.io/RESTinstance/#Rest%20Instances
So, this list is growing by adding an instance object per API call. So, I tried:
Reload Library REST # (**)
in the test case, once the iteration with the outermost FOR loop ended. I thought the list would be destroyed & re-created as we reload the library, but the memory consumption kept rising still.
In my Airflow DAG I have a task that needs to know if it's the first time it's ran or if it's a retry run. I need to adjust my logic in the task if it's a retry attempt.
I have a few ideas on how I could store the number of retries for the task but I'm not sure if any of them are legitimate or if there's an easier built in way to get this information within the task.
I'm wondering if I can just have an integer variable inside the dag that I append every time the task runs. Then if the task if reran I could check the value of the variable to see that it's greater than 1 and hence would be a retry run. But I'm not sure if mutable global variables work that way in Airflow since there can be multiple workers for different tasks (I'm not sure on this though).
Write it in an XCOM variable?
The retry number is available from the task instance, which is available via the macro {{ task_instance }}. https://airflow.apache.org/code.html#default-variables
If you are using the python operator simply add provide_context=True, to your operator kwargs, and then in the callable do kwargs['task_instance'].try_number
Otherwise you can do something like:
t = BashOperator(
task_id='try_number_test',
bash_command='echo "{{ task_instance.try_number }}"',
dag=dag)
Edit:
When the task instance is cleared, it will set the max_retry number to be the current try_number + retry value. So you could do something like:
ti = # whatever method you do to get the task_instance object
is_first = ti.max_tries - ti.task.retries + 1 == ti.try_number
Airflow will increments the try_number by 1 when running, so I imagine you'd need the + 1 when subtracting the max_tries from the configured retry value. But I didn't test that to confirm
#cwurtz answer was spot on. I was able to use it like this:
def _get_actual_try_number(self, context):
'''
Returns the real try_number that you also see in task details or logs.
'''
return context['task_instance'].try_number
def _get_relative_try_number(self, context):
'''
When a task is cleared, the try_numbers continue to increment.
This returns the try number relative to the last clearing.
'''
ti = context['task_instance']
actual_try_number = self._get_actual_try_number(context)
# When the task instance is cleared, it will set the max_retry
# number to be the current try_number + retry value.
# From https://stackoverflow.com/a/51757521
relative_first_try = ti.max_tries - ti.task.retries + 1
return actual_try_number - relative_first_try + 1
I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.