I have a sensor that waits for a file to appear in an external file system
The sensor uses mode="reschedule"
I would like to trigger a specific behavior after X failed attempts.
Is there any straightforward way to know how many times the sensor has already attempted to run the poke method?
My quick fix so far has been to push an XCom with the attempt number, and increase it every time the poke method returns False. Is there any built-in mechanism for this?
Thank you
I had a similar problem when sensor mode = "reschedule", trying to poke a different path to a file based on the current time without directly referencing pendulum.now or datetime.now
I used task_reschedules (as done in the base sensor operator to get try_number for reschedule mode https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/sensors/base.html#BaseSensorOperator.execute)
def execute(self, context):
task_reschedules = TaskReschedule.find_for_task_instance(context['ti'])
self.poke_number = (len(task_reschedules) + 1)
super().execute(context)
then self.poke_number can be used within poke(), and current time is approximately execution_date + (poke_number * poke_interval).
Apparently, the XCom thing isn't working, because pushed XComs don't seem to be available between pokes; they always return undefined.
try_number inside task_instance doesn't help either, as pokes don't count as a new try number
I ended up computing the attempt number by hand:
attempt_no = math.ceil((pendulum.now(tz='utc') - kwargs['ti'].start_date).seconds / kwargs['task'].poke_interval)
The code will work fine as long as individual executions of the poke method don't last longer than the poke interval (which they shouldn't)
Best
Related
This is my code:
write_api = client.write_api(write_options=ASYNCHRONOUS)
write_api.write(bucket, org, data, write_precision=WritePrecision.US)
1 - How can I detect writing errors?
2 - Should I initialize write_api each time I want to write or I can initialize it once and use the same object all the time?
callback = write_api.write(bucket, org, data, write_precision=WritePrecision.US)
callback.wait()
callback.get()
Is the only way I found. Unfortunately the wait basically makes it synchronous decreasing the performance.
I have use case to create 2 tasks of BigqueryOperator that have same destination table but I need one to run daily, and the second one to be run manually just when I need.
Below are the illustration of Tree View
| task_3rd_adhoc
| task_3rd
|---- task_2nd
|---- task_1st_a
|---- task_1st_b
From example above, DAG are run daily. And I aim to the task will be:
task_1st_a and task_1st_b run first. Target table are:
project.dataset.table_1st_a with _PARTITIONTIME = execution date, and
project.dataset.table_1st_b with _PARTITIONTIME = execution date.
then task_2nd_a will run after task_1st_a and task_1st_b finish. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_2nd with _PARTITIONTIME = execution date.
then task_3rd will run after task_2nd success. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_3rd with PARTITIONTIME = D-2 from execution date.
task_3rd_adhoc will not run in daily job. I need this when I want to backfill table project.dataset.table_3rd. With target table:
project.dataset.table_3rd with _PARTITIONTIME = execution_date
But I still can't find what is the correct TriggerRule for step #4 above. I tried TriggerRule.DUMMY because I thought it can be used to set no Trigger, but task_3rd_adhoc also run in daily job when I tried create DAG above.
(based on this doc dependencies are just for show, trigger at will)
First of all, you've misunderstood TriggerRule.DUMMY.
Usually, when you wire tasks together task_a >> task_b, B would run only after A is complete (success / failed, based on B's trigger_rule).
TriggerRule.DUMMY means that even after wiring tasks A & B together as before, B would run independently of A (run at will). It doesn't mean run at your will, rather it runs at Airflow's will (it will trigger it whenever it feels like). So clearly tasks having dummy trigger rule will pretty much ALWAYS run, albeit, at an unpredictable time
What you need here (to have a particular task in DAG always but run it only when manually specified) is a combination of
AirflowSkipException
Variable
Here's roughly how you can do
A Variable should hold the command for this task (whether or not it should run). This Variable, of course, you can edit anytime from UI (thereby controlling whether or not that task runs in next DagRun)
In the Operator's code (execute() method for custom-operator or just python_callable in case of PythonOperator), you'll check value of Variable (whether or not the task is supposed to run)
Based on the Variable value, if the task is NOT supposed to run, you must throw an AirflowSkipException, so that the task will be marked at skipped. Or else, it will just run as usual
I have an airflow DAG that works perfectly when files are present, but error->fails when the source files are not there.
Randomly, I recieve files from a given source, that my DAG picks up and processes. While I need to run the DAG daily, files are not necessarily there daily. Could be monday, wednesday, or even sunday evening.
I'm not worried about days with no new files. i worry about days when new files come and it breaks.
How do I tell the DAG that when no file exist then gracefully exit with success?
My DAG below (please ignore schedule setting. I'm still in development mode):
import airflow
from airflow import models
from airflow.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.operators.gcs_to_gcs import GoogleCloudStorageToGoogleCloudStorageOperator
args = {
'owner': 'Airflow',
'start_date': airflow.utils.dates.days_ago(2),
'email': ['email#gmail.com'],
'email_on_failure': True,
'schedule_interval': 'None',
}
dag = models.DAG(
dag_id='Source1_Ingestion',
default_args=args
)
# [START load ATTOM File to STAGING]
load_File_to_Source1_RAW = GoogleCloudStorageToBigQueryOperator(
task_id='Source1_GCS_to_GBQ_Raw',
bucket='Source1_files',
source_objects=['To_Process/*.txt'],
destination_project_dataset_table='Source1.Source1_RAW',
schema_fields=[
{'name': 'datarow', 'type': 'STRING', 'mode': 'NULLABLE'},
],
field_delimiter='§',
write_disposition='WRITE_TRUNCATE',
google_cloud_storage_conn_id='GCP_EDW_Staging',
bigquery_conn_id='GCP_EDW_Staging',
dag=dag)
# [END howto_operator_gcs_to_bq]
# [START move files to Archive]
archive_attom_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='Archive_Source1_Files',
source_bucket='Source1_files',
source_object='To_Process/*.txt',
destination_bucket='Source1_files',
destination_object='Archive/',
move_object=True,
google_cloud_storage_conn_id='GCP_EDW_Staging',
dag=dag
)
# [END move files to archive]
load_File_to_Source1_RAW.set_downstream(archive_Source1_files)
One way to approach this would be to add a Sensor Operator to the workflow.
Nehil Jain describes sensors nicely:
Sensors are a special kind of airflow operator that will keep running until a certain criterion is met. For example, you know a file will arrive at your S3 bucket during certain time period, but the exact time when the file arrives is inconsistent.
For your use case, it looks like there's a Google Cloud Sensor, which "checks for the existence of a file in Google Cloud Storage." The reason you'd incorporate a sensor is that you're decoupling the operation "determine if a file exists" from the operation "get the file (and do something with it)".
By default, sensors have two methods (source):
poke: the code to run at poke_interval times, which tests to see if the condition is true
execute: use the poke method to test for a condition on a schedule defined by the poke_interval; fails out when the timeout argument is reached
In a common file-detection sensor, the operator receives instructions to check a source for a file on a schedule (e.g. check every 5 minutes for up to 3 hours to see if the file exists). If the sensor succeeds in meeting its test condition, it succeeds and allows the DAG to continue downstream to the next operator(s). If it fails to find the file, it times out and the sensor operator is marked failed.
With just a sensor operator, you've already succeeded in separating the error cases - the DAG fails at the GoogleCloudStorageObjectSensor instead of the GoogleCloudStorageToBigQueryOperator when the file doesn't exist, and fails at the GoogleCloudStorageToBigQueryOperator when something is wrong with the transfer logic. Importantly for your use case, Airflow supports a soft_fail argument, which "mark[s] the task as SKIPPED on failure"
For this next part, I'll caveat this next part by explicitly stating that I'm not intimately familiar with the GoogleCloudStorage operators. If the operator doesn't allow wildcarding in the sensor, you may need to rewire the sensor's poke method to allow for more complex, pattern based file detection. This is where Airflow's plug-in architecture can really shine, allowing you to modify and extend existing operators to meet your exact needs.
The example I'll give you here is that the SFTPSensor only supports poking for a specific file out of the box. I needed wildcard based poking, so I wrote a plugin that modifies the SFTPSensor to support regular expressions in file identification. In my case, it was just modifying the poke to switch from polling for the existence of a single file to polling a list of files and then passing it through a regular expression to filter the list.
At a cursory glance, it looks like the way that the GoogleCloudStorageSensor pokes for an object is with the hook.exists method. I can't speak to whether a wildcard would work there, but if it doesn't, it looks like there's a hook.list method which would allow you to implement a similar workflow to what I did for the SFTPRegexSensor.
I've included some of the source code for the SFTPRegexSensor Plugin's poke method, modified for how I think it'd work with GCS in case it's helpful:
def poke(self, context):
# create a hook (removed some of the SSH/SFTP intricacies for simplicity)
# get list of file(s) matching regex
files = hook.list(self.bucket, self.prefix) # you need to define operator paramters for the choices that are dynamic in the operator's poke (e.g. which bucket, what the file prefix is); swapped in the GCS args
regex = re.compile(self.remote_filename)
files = list(filter(regex.search, files))
if not files:
return False
return True
I have a DAG without a schedule (it is run manually as needed). It has many tasks. Sometimes I want to 'skip' some initial tasks by changing the task state to SUCCESS manually. Changing task state of a manually executed DAG fails, seemingly because of a bug in parsing the execution_date.
Is there another way to individually setting task states for a manually executed DAG?
Example run below. The execution date of the Task is 01-13T17:27:13.130427, and I believe the milliseconds are not being parsed correctly.
Traceback
Traceback (most recent call last):
File "/opt/conda/envs/jumpman_prod/lib/python3.6/site-packages/airflow/www/views.py", line 2372, in set_task_instance_state
execution_date = datetime.strptime(execution_date, '%Y-%m-%d %H:%M:%S')
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 365, in _strptime
data_string[found.end():])
ValueError: unconverted data remains: ..130427
It's not working from Task Instances page, but you can do it in another page:
- open DAG graph view
- select needed Run (screen 1) and click go
- select needed task
- in a popup window click Mark success (screen 2)
- then confirm.
PS it relates to airflow 1.9 version
Screen 1
Screen 2
What you may want to do to accomplish this is using branching, which, as the name suggests, allows you to follow different execution paths according to some conditions, just like an if in any programming language.
You can use the BranchPythonOperator (documented here) to attain this goal: the idea is that this operator is configured by a python_callable, a function that outputs the task_id to execute next (which should, of course, be a task which is directly downstream from the BranchPythonOperator itself).
Using branching will set the skipped tasks to the proper state automatically, as mentioned in the documentation:
All other “branches” or directly downstream tasks are marked with a state of skipped so that these paths can’t move forward. The skipped states are propagated downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred.
The resulting DAG would look something like the following:
(source: apache.org)
Branching is documented here, on the official Apache Airflow documentation.
Please give me an advice for this:
I want to get the time when a signal is sent from a mote(I was thinking to generate a interruption when the SFD pin gets from 1 to 0) I didn't find a solution for that, but I found this component:
Component: tos.chips.cc2420_tkn154.CC2420TransmitP
which provides cc2420Tx which seems to give me the time a need. But I can't manage to use it, as by default it usest the component from cc2420 folder and not the one from cc2420_tkn154 folder.
The main ideea is that I'd like to measure the time from sending the signal to recieving ack. I need Microsecond precision. All these would help me to get the distance between two motes.
Any ideea would be helpfull. I searched all over: forums, tinyos documentantion, examples...
Thank you :)
I do not know how low-level you want to get, but if you have a timer, in nesC you can get the local time every time the timer fires:
uint32_t timestamp;
event void myTimer.fired() {
timestamp = call myTimer.getNow();
printf("Timestamp: %ld \n", timestamp);
}
If you do not have a timer, you can use the component LocalTimeMilliC.
Add this to your configuration file:
components LocalTimeMilliC;
TestC.LocalTime -> LocalTimeMilliC;
...and in the module section of the implementation:
uses interface LocalTime<TMilli>;
...and in the code:
timestamp = call LocalTime.get();
However, the local time of each mote will start again when you reset the mote. You would have to synchronize the different times. If you want to calculate the distance between motes, this may not be the best way. To cite from the abstract of this paper:
Location of the deployed sensor nodes can be found either by TOA, TDOA or Received Signal Strength (RSS) measurements.
For RSSI, there is a demo in the folder tinyos-2.1.1/apps/tutorials.