I am writing a sensor which scan s3 files for fix period of time and add the list of new files arrived at that period to xcom for next task. For that, I am trying to access list of files passed in xcom from previous run. I can do that using below snippet.
context['task_instance'].get_previous_ti(state=State.SUCCESS).xcom_pull(key='new_files',task_ids=self.task_id,dag_id=self.dag_id)
However, context object is passed in poke method and I was to access it in init. Is there another way to do it without using context.
Note - I do not want to directly access underlying database for xcom.
Thanks
I found this solution which (kinda) uses the underlying database but you dont have to create a sqlalchemy connection directly to use it.
The trick is using the airflow.models.DagRun object and specifically the find() function which allows you to grab all dags by id between two dates, then pull out the task instances and from there, access the xcoms.
default_args = {
"start_date": days_ago(0),
"retries": 0,
"max_active_runs": 1,
}
with models.DAG(
f"prev_xcom_tester",
catchup=False,
default_args=default_args,
schedule_interval="#hourly",
tags=["testing"],
) as dag:
def get_new_value(**context):
num = randint(1, 100)
logging.info(f"building new value: {num}")
return num
def get_prev_xcom(**context):
try:
dag_runs = models.DagRun.find(
dag_id="prev_xcom_tester",
execution_start_date=(datetime.now(timezone.utc) - timedelta(days=1)),
execution_end_date=datetime.now(timezone.utc),
)
this_val = context["ti"].xcom_pull(task_ids="get_new_value")
for dr in dag_runs[:-1]:
prev_val = dr.get_task_instance("get_new_value").xcom_pull(
"get_new_value"
)
logging.info(f"Checking dag run: {dr}, xcom was: {prev_val}")
if this_val == prev_val:
logging.info(f"we already processed {this_val} in {dr}")
return (
dag_runs[-2]
.get_task_instance("get_new_value")
.xcom_pull("get_new_value")
)
except Exception as e:
logging.info(e)
return 0
def check_vals_match(**context):
ti = context["ti"]
prev_run_val = ti.xcom_pull(task_ids="get_prev_xcoms")
current_run_val = ti.xcom_pull(task_ids="get_new_value")
logging.info(
f"Prev Run Val: {prev_run_val}\nCurrent Run Val: {current_run_val}"
)
return prev_run_val == current_run_val
xcom_setter = PythonOperator(task_id="get_new_value", python_callable=get_new_value)
xcom_getter = PythonOperator(
task_id="get_prev_xcoms",
python_callable=get_prev_xcom,
)
xcom_checker = PythonOperator(
task_id="check_xcoms_match", python_callable=check_vals_match
)
xcom_setter >> xcom_getter >> xcom_checker
This dag demonstrates how to:
Set a random int between 1 and 100 and passing it through xcom
Find all dagruns by dag_id and time span -> check if we have processed this value in the past
Return True if current value matches value from previous run.
Hope this helps!
Related
I have a DAG that fetches a list of items from a source, in batches of 10 at a time, and then does a dynamic task mapping on each batch. Here is the code
def tutorial_taskflow_api():
#task(multiple_outputs=True)
def get_items(limit, cur):
#actual logic is to fetch items and cursor from external API call
if cur == None:
cursor =limit+1
items = range (0, limit)
else:
cursor = cur+limit+1
items = range(cur, cur+limit)
return {'cursor': cursor, 'kinds': items}
#task
def process_item(item):
print(f"Processing item {item}")
#task
def get_cursor_from_response(response):
return response['cursor']
#task
def get_items_from_response(response):
return response['items']
cursor = None
limit = 10
while True:
response = get_items(limit, cursor)
items = get_items_from_response(response)
cursor = get_cursor_from_response(response)
if cursor:
process_item.expand(item=items)
if cursor == None:
break
tutorial_taskflow_api()
If you see, I attempt to get a list of items from a source, in batches of 10, and then do a dynamic task mapping on each of the batch.
However, when I import this item, i get the Dag Import timeout error:
Broken DAG: [/opt/airflow/dags/Test.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/decorators/base.py", line 144, in _find_id_suffixes
for task_id in dag.task_ids:
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/timeout.py", line 69, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: DagBag import timeout for /opt/airflow/dags/Test.py after 30.0s.
Please take a look at these docs to improve your DAG import time:
* https://airflow.apache.org/docs/apache-airflow/2.5.1/best-practices.html#top-level-python-code
* https://airflow.apache.org/docs/apache-airflow/2.5.1/best-practices.html#reducing-dag-complexity, PID: 23822
How to solve this?
I went through the documentation and found that executing the While loop logic shouldn't really be there, but in some other task. But if I put that in some other task, how can I perform dynamic task mapping from inside that other task?
This code:
while True:
response = get_items(limit, cursor)
items = get_items_from_response(response)
cursor = get_cursor_from_response(response)
if cursor:
process_item.expand(item=items)
if cursor == None:
break
is running in the DagFileProcessor before creating a DAG run, and it's executing every min_file_process_interval, and each time Airflow retry to run a task in this dag. Airflow has some timeouts like dagbag_import_timeout which is the maximum duration the different DagFileProcessor have to process the dag files before a timeout exception, in your case if you have a big batch, or the API has some latency, you can easily exceed this duration.
Also you are considering cursor = get_cursor_from_response(response) as a normal python variable, but it is not the case, where the value is not available before creating a dag run.
Solution and best practices:
The Dynamic Task Mapping is designed to solve this problem, and it's flexible, so you can use it in different ways:
import pendulum
from airflow.decorators import dag, task
#dag(dag_id="tutorial_taskflow_api", start_date=pendulum.datetime(2023, 1, 1), schedule=None)
def tutorial_taskflow_api():
#task
def get_items(limit):
data = []
start_ind = 0
while True:
end_ind = min(start_ind + limit, 95) # 95 records in the API
items = range(start_ind, end_ind) if start_ind <= 90 else None # a fake end of data
if items is None:
break
data.extend(items)
start_ind = end_ind
return data
#task
def process_item(item):
print(f"Processing item {item}")
process_item.expand(item=get_items(limit=10))
tutorial_taskflow_api()
But if you want to process the data in batches, the best way is the mapped task groups, but unfortunately the nested mapped tasks is not supported yet, so you need to process items in a loop:
import pendulum
from airflow.decorators import dag, task, task_group
#dag(dag_id="tutorial_taskflow_api", start_date=pendulum.datetime(2023, 1, 1), schedule=None)
def tutorial_taskflow_api():
#task
def get_pages(limit):
start_ind = 0
pages = []
while True:
end_ind = min(start_ind + limit, 95) # 95 records in the API
page = dict(start=start_ind, end=end_ind) if start_ind <= 90 else None # a fake end of data
if page is None:
break
pages.append(page)
start_ind = end_ind
return pages
#task_group()
def process_batch(start, end):
#task
def get_items(start, end):
return list(range(start, end))
#task
def process_items(items):
for item in items:
print(f"Processing item {item}")
process_items(get_items(start=start, end=end))
process_batch.expand_kwargs(get_pages(10))
tutorial_taskflow_api()
Update:
There is the conf max_map_length which the maximum number of parallel mapped tasks/task group you can have. If you have some picks in your API, you can increase this limit (not recommended) or calculating the limit (batch size) dynamically:
import pendulum
from airflow.decorators import dag, task, task_group
#dag(dag_id="tutorial_taskflow_api", start_date=pendulum.datetime(2023, 1, 1), schedule=None)
def tutorial_taskflow_api():
#task
def get_limit():
import math
max_map_length = 1024
elements_count = 9999 # get from the API
preferd_batch_size = 10
return max(preferd_batch_size, math.ceil(elements_count/max_map_length))
#task
def get_pages(limit):
start_ind = 0
pages = []
while True:
end_ind = min(start_ind + limit, 95) # 95 records in the API
page = dict(start=start_ind, end=end_ind) if start_ind <= 90 else None # a fake end of data
if page is None:
break
pages.append(page)
start_ind = end_ind
return pages
#task_group()
def process_batch(start, end):
#task
def get_items(start, end):
return list(range(start, end))
#task
def process_items(items):
for item in items:
print(f"Processing item {item}")
process_items(get_items(start=start, end=end))
process_batch.expand_kwargs(get_pages(get_limit()))
tutorial_taskflow_api()
I have posted a discussion question about this here as well https://github.com/apache/airflow/discussions/19868
Is it possible to specify arguments to a custom xcom backend? If I could force a task to return data (pyarrow table/dataset, pandas dataframe) which would save a file in the correct container with a "predictable file location" path, then that would be amazing. A lot of my custom operator code deals with creating the blob_path, saving the blob, and pushing a list of the blob_paths to xcom.
Since I work with many clients, I would prefer to have the data for Client A inside of the client-a container which uses a different SAS
When I save a file I consider that a "stage" of the data so I would prefer to keep it, so ideally I could provide a blob_path which matches the folder structure I generally use
class WasbXComBackend(BaseXCom):
def __init__(
self,
container: str = "airflow-xcom-backend",
path: str = guid(),
partition_columns: Optional[list[str]] = None,
existing_data_behavior: Optional[str] = None,
) -> None:
super().__init__()
self.container = container
self.path = path
self.partition_columns = partition_columns
self.existing_data_behavior = existing_data_behavior
#staticmethod
def serialize_value(self, value: Any):
if isinstance(value, pd.DataFrame):
hook = AzureBlobHook(wasb_conn_id="azure_blob")
with io.StringIO() as buf:
value.to_csv(path_or_buf=buf, index=False)
hook.load_string(
container_name=self.container,
blob_name=f"{self.path}.csv",
string_data=buf.getvalue(),
)
value = f"{self.container}/{self.path}.csv"
elif isinstance(value, pa.Table):
hook = AzureBlobHook(wasb_conn_id="azure_blob")
write_options = ds.ParquetFileFormat().make_write_options(
version="2.6", use_dictionary=True, compression="snappy"
)
written_files = []
ds.write_dataset(
data=value,
schema=value.schema,
base_dir=f"{self.container}/{self.path}",
format="parquet",
partitioning=self.partition_columns,
partitioning_flavor="hive",
existing_data_behavior=self.existing_data_behavior,
basename_template=f"{self.task_id}-{self.ts_nodash}-{{i}}.parquet",
filesystem=hook.create_filesystem(),
file_options=write_options,
file_visitor=lambda x: written_files.append(x.path),
use_threads=True,
max_partitions=2_000,
)
value = written_files
return BaseXCom.serialize_value(value)
#staticmethod
def deserialize_value(self, result) -> Any:
result = BaseXCom.deserialize_value(result)
if isinstance(result, str) and result.endswith(".csv"):
hook = AzureBlobHook(wasb_conn_id="azure_blob")
with io.BytesIO() as input_io:
hook.get_stream(
container_name=self.container,
blob_name=str(self.path),
input_stream=input_io,
)
input_io.seek(0)
return pd.read_csv(input_io)
elif isinstance(result, list) and ".parquet" in result:
hook = AzureBlobHook(wasb_conn_id="azure_blob")
return ds.dataset(
source=result, partitioning="hive", filesystem=hook.create_filesystem()
)
return result
It's not clear exactly what information you want to be able to retrieve to use as part of your "predictable file location". But there is a PR to pass basic things like dag_id, task_id etc on to serialize_value so that you can use them when naming your stored objects.
Until that is merged, you'll have to override BaseXCom.set.
You need to override BaseXCom.set
a working ,in production example
class MyXComBackend(BaseXCom):
#classmethod
#provide_session
def set(cls, key, value, execution_date, task_id, dag_id, session=None):
session.expunge_all()
# logic to use this custom_xcom_backend only with the necessary dag and task
if cls.is_task_to_custom_xcom(dag_id, task_id):
value = cls.custom_backend_saving_fn(value, dag_id, execution_date, task_id)
else:
value = BaseXCom.serialize_value(value)
# remove any duplicate XComs
session.query(cls).filter(
cls.key == key, cls.execution_date == execution_date, cls.task_id == task_id, cls.dag_id == dag_id
).delete()
session.commit()
# insert new XCom
from airflow.models.xcom import XCom # noqa
session.add(XCom(key=key, value=value, execution_date=execution_date, task_id=task_id, dag_id=dag_id))
session.commit()
#staticmethod
def is_task_to_custom_xcom(dag_id: str, task_id: str) -> bool:
return True # custom your logic here if necessary
def set_timer(update: Update, context: CallbackContext) -> None:
"""Add a job to the queue."""
chat_id = update.message.chat_id
try:
# args[0] should contain the time for the timer in seconds
due = int(context.args[0])
if due < 0:
update.message.reply_text('Sorry we can not go back to future!')
return
job_removed = remove_job_if_exists(str(chat_id), context)
context.job_queue.run_once(alarm, due, context=chat_id, name=str(chat_id))
text = 'Timer successfully set!'
if job_removed:
text += ' Old one was removed.'
update.message.reply_text(text)
except (IndexError, ValueError):
update.message.reply_text('Usage: /set <seconds>')
How do I compile here by putting job queue run_repeated?
In Airflow's UI, if I hover over any of my task IDs, it'll show me the "Run", "Started", and "Ended" dates all with a very verbose format i.e. 2021-02-12T18:57:45.314249+00:00.
How do I change the default preferences in Airflow's UI so that it simply shows 2/12/21 6:57:45pm? (i.e. without the fractions of a second)
Additionally, how do I ensure that this time is showing in America/Chicago time as opposed to UTC? I've tried editing the "default_timezone" and the "default_ui_timezone" arguments in my airflow.cfg file to America/Chicago, but the changes don't seem to be reflected on the UI even after rebooting the webserver.
I have managed to achieve the result you wanted. You'll need the edit a javascript file in the source code of airflow to achieve that.
Firstly, locate your module location by launching:
python3 -m pip show apache-airflow
And look for the "Location" attribute, which is the path to where the module is contained. Open that folder, then navigate as follows:
airflow -> www -> static -> dist
Here you need to look for a file named taskInstances.somehash.js
Open it with your IDE and locate the following lines:
const defaultFormat = 'YYYY-MM-DD, HH:mm:ss';
const defaultFormatWithTZ = 'YYYY-MM-DD, HH:mm:ss z';
const defaultTZFormat = 'z (Z)';
const dateTimeAttrFormat = 'YYYY-MM-DDThh:mm:ssTZD';
You can hereby change the format as you please, such as:
const defaultFormat = 'DD/MM/YY hh:mm:ss';
const defaultFormatWithTZ = 'DD/MM/YY hh:mm:ss z';
const defaultTZFormat = 'z (Z)';
const dateTimeAttrFormat = 'DD/MM/YY hh:mm:ss';
Now jump to the makeDateTimeHTML function and modify as follows:
function makeDateTimeHTML(start, end) {
// check task ended or not
const isEnded = end && end instanceof moment && end.isValid();
return `Started: ${start.format('DD/MM/YY hh:mm:ss')}<br>Ended: ${isEnded ? end.format('DD/MM/YY hh:mm:ss') : 'Not ended yet'}<br>`;
}
Lastly, locate these this statement:
if (ti.start_date instanceof moment) {
tt += `Started: ${Object(_main__WEBPACK_IMPORTED_MODULE_0__["escapeHtml"])(ti.start_date.toISOString())}<br>`;
} else {
tt += `Started: ${Object(_main__WEBPACK_IMPORTED_MODULE_0__["escapeHtml"])(ti.start_date)}<br>`;
}
// Calculate duration on the fly if task instance is still running
And change to:
if (ti.start_date instanceof moment) {
tt += `Started: ${Object(_main__WEBPACK_IMPORTED_MODULE_0__["escapeHtml"])(ti.start_date)}<br>`;
} else {
tt += `Started: ${Object(_main__WEBPACK_IMPORTED_MODULE_0__["escapeHtml"])(ti.start_date)}<br>`;
}
// Calculate duration on the fly if task instance is still running
Took me a while to figure out, so hopefully this will be of your liking.
I have created the database in SQL lite and improved the little program to handle it (list, add, remove records). At this point I am trying to list the contents from the database using the prepared statement step() function. However, I can't iterate over the rows and columns on the database.
I suspect that the reason for that is that I am not handling the statement appropriately in this line:
stmt:Sqlite.Statement = null
If that is the case, how to pass the statement from the main (init) function to the children function?
This is the entire code so far:
// Trying to do a cookbook program
// raw_imput for Genie included, compile with valac --pkg sqlite3 cookbook.gs
[indent=4]
uses Sqlite
def raw_input (query:string = ""):string
stdout.printf ("%s", query)
return stdin.read_line ()
init
db : Sqlite.Database? = null
if (Sqlite.Database.open ("cookbook.db3", out db) != Sqlite.OK)
stderr.printf ("Error: %d: %s \n", db.errcode (), db.errmsg ())
Process.exit (-1)
loop:bool = true
while loop = true
print "==================================================="
print " RECIPE DATABASE "
print " 1 - Show All Recipes"
print " 2 - Search for a recipe"
print " 3 - Show a Recipe"
print " 4 - Delete a recipe"
print " 5 - Add a recipe"
print " 6 - Print a recipe"
print " 0 - Exit"
print "==================================================="
response:string = raw_input("Enter a selection -> ")
if response == "1" // Show All Recipes
PrintAllRecipes()
else if response is "2" // Search for a recipe
pass
else if response is "3" //Show a Recipe
pass
else if response is "4"//Delete a recipe
pass
else if response is "5" //Add a recipe
pass
else if response is "6" //Print a recipe
pass
else if response is "0" //Exit
print "Goodbye"
Process.exit (-1)
else
print "Unrecognized command. Try again."
def PrintAllRecipes ()
print "%-5s%-30s%-20s%-30s", "Item", "Name", "Serves", "Source"
print "--------------------------------------------------------------------------------------"
stmt:Sqlite.Statement = null
param_position:int = stmt.bind_parameter_index ("$UID")
//assert (param_position > 0)
stmt.bind_int (param_position, 1)
cols:int = stmt.column_count ()
while stmt.step () == Sqlite.ROW
for i:int = 0 to cols
i++
col_name:string = stmt.column_name (i)
val:string = stmt.column_text (i)
type_id:int = stmt.column_type (i)
stdout.printf ("column: %s\n", col_name)
stdout.printf ("value: %s\n", val)
stdout.printf ("type: %d\n", type_id)
/* while stmt.step () == Sqlite.ROW
col_item:string = stmt.column_name (1)
col_name:string = stmt.column_name (2)
col_serves:string = stmt.column_name (3)
col_source:string = stmt.column_name (4)
print "%-5s%-30s%-20s%-30s", col_item, col_name, col_serves, col_source */
Extra questions are:
Does the definitions of functions should come before or after init? I have noticed that they wouldn't be called if I left all of them after init. But by leaving raw_input in the beginning the error disappeared.
I was trying to define PrintAllRecipes() within a class, for didactic reasons. But I ended up making it "invisible" to the main routine.
Many thanks,
Yes, you need to assign a prepared statement, not null, to stmt. For example:
// Trying to do a cookbook program
// raw_input for Genie included, compile with
// valac --pkg sqlite3 --pkg gee-0.8 cookbook.gs
[indent=4]
uses Sqlite
init
db:Database
if (Database.open ("cookbook.db3", out db) != OK)
stderr.printf ("Error: %d: %s \n", db.errcode (), db.errmsg ())
Process.exit (-1)
while true
response:string = UserInterface.get_input_from_menu()
if response is "1" // Show All Recipes
PrintAllRecipes( db )
else if response is "2" // Search for a recipe
pass
else if response is "3" //Show a Recipe
pass
else if response is "4"//Delete a recipe
pass
else if response is "5" //Add a recipe
pass
else if response is "6" //Print a recipe
pass
else if response is "0" //Exit
print "Goodbye"
break
else
print "Unrecognized command. Try again."
namespace UserInterface
def get_input_from_menu():string
show_menu()
return raw_input("Enter a selection -> ")
def raw_input (query:string = ""):string
stdout.printf ("%s", query)
return stdin.read_line ()
def show_menu()
print """===================================================
RECIPE DATABASE
1 - Show All Recipes
2 - Search for a recipe
3 - Show a Recipe
4 - Delete a recipe
5 - Add a recipe
6 - Print a recipe
0 - Exit
==================================================="""
namespace PreparedStatements
def select_all( db:Database ):Statement
statement:Statement
db.prepare_v2( """
select name, servings as serves, source from Recipes
""", -1, out statement )
return statement
def PrintAllRecipes ( db:Database )
print "%-5s%-30s%-20s%-30s", "Item", "Name", "Serves", "Source"
print "--------------------------------------------------------------------------------------"
stmt:Statement = PreparedStatements.select_all( db )
cols:int = stmt.column_count ()
var row = new dict of string, string
item:int = 1
while stmt.step() == ROW
for i:int = 0 to (cols - 1)
row[ stmt.column_name( i ) ] = stmt.column_text( i )
stdout.printf( "%-5s", item.to_string( "%03i" ))
stdout.printf( "%-30s", row[ "name" ])
stdout.printf( "%-20s", row[ "serves" ])
stdout.printf( "%-30s\n", row[ "source" ])
item++
A few pointers
Generally you want to avoid assigning null. null is no value. For example a boolean can either be true or false and nothing else, but a variable that can have no value makes things more complicated.
a:bool? = null
if a == null
print "I'm a boolean variable, but I am neither true nor false???"
If you are looking to declare a variable in Genie before assigning a value, for example when calling a function with an out parameter, don't assign anything. I have changed db:Database to show this
Process.exit( -1 ) should probably be used sparingly and really only for error conditions that you want to signal to a calling command line script. I don't think a user selected exit from the program is such an error condition, so I have changed Process.exit( -1 ) to break for that
The definition of functions doesn't matter whether it is before or after init, I prefer to put them after so the first function that is called, i.e. init, is at the top and easy to read
A class is a data type and yes, it can have functions, but usually you need some data defined in the class and the function is written to act on that data. A function in a class is often called a 'method' and in the past with object oriented programming classes were defined to group methods together. These methods had no data to act on and are defined as 'static' methods. The modern practise is to mainly use static methods for creating more complex object constructors, look up 'factory' methods and creational design patterns. Instead to group functions, and other syntax, we use namespaces. I have used a couple of namespaces in the example. Usually a namespace is given its own file or files. If you are thinking of splitting your Genie project into more source files then take a look at https://wiki.gnome.org/Projects/Genie#A_Simple_Build_Script
A primary key should be internal to the database and would not be presented to a user, only a database administrator would be interested in such things. So I have changed 'item' in the output to be a count of the number of entries displayed
Genie and Vala bind the SQLite C interface. If you need more details on a particular function take a look at C-language Interface Specification for SQLite