I have been evaluating airflow .I have this use case where I have a workflow that runs every hour to get hourly aggregates of the data. and another that runs every day to get daily aggregates of the same. Is it possible to create a combined workflow where the daily aggregate will run only if all the hourly aggregates have succeed in past day? I have seen that you can create sub dags but can the two dags run at a different frequency ? If yes How?
Not sure how you want this to work but while there isn't a straightforward way of doing this, there are a few ways you could use the extensive suite of airflow operators to build such a dag.
Example you could make the hourly dags depend_on_past and then use a python branch operator to make the day aggregation task/dag be run/triggered at the end of the hourly dag for the last run of the day. Check out the PythonBranchOperator and the TriggerDagRunOperator.
You could also create your own sensor for the daily aggregator to make sure that all hourly dags for that day have succeeded. Check out ExternalTaskSensor for reference.
It might be ugly, but using the PythonOperator there is a pretty straight forward way of doing it "behind the scenes":
dag = DAG('hourly_daily_update_v0',
schedule_interval='#hourly')
hourly_update = PythonOperator(task_id='update_hourly_v0',
python_callable=update_hourly,
provide_context=True,
dag=dag)
daily_update = PythonOperator(task_id='update_daily_v0',
python_callable=update_daily,
provide_context=True,
dag=dag)
So you call both hourly and daily the Airflow way. However in the update_daily() call you can make a check for the hour:
def update_daily(**context):
if context['execution_date'].hour == 0: # hour 0
# Do all the things!
else:
# Do none of the things!
Airflow will successfully run update_daily() 24 times a day, but in reality it will only do the work once at hour 0. You can extend this however you please. Only problem is the slight step outside of Airflow's assumed pattern, which will cause some disinformation between hours 1 and 24.
Related
I'm downloading some kind of data from a webserver which limites the number of queries to 100 per hour.
Do you know an effective way to insert a time lag between r-script code lines to automatically run the script and gather the data after (approx.) 10 hours?
Many thanks in advance!
I need create an empty loop that runs for a given time, for example 2 hours. The loop just runs for nothing, no matter what it does, it is important that it loads R executions for exactly 2 hours.
for example, let's have some kind of script
model=lm(Sepal.Length~Sepal.Width,data=iris)
after this row there is an empty loop that does something for exactly 2 hours
for i....
after the empty loop has completed via 2 hours, continue to execute subsequent rows
summary(model)
predict(model,iris)
(no matter what row, it is important that in a certain place of code the loop wasted for 2 hours)
How it can be done?
Thanks for your help.
There is no need to do this using a loop.
You can simply suspend all execution for n seconds by using Sys.sleep(n). So to suspend for 2 hours you can use Sys.sleep(2*60*60)
I have a "succeeded" metric that is just the timestamp. I want to see the time between successive successes (this is how long the data is stale for). I have
derivative(Success)
but I also want to know how long between the last success time and the current time. since derivative transforms xs[n] to xs[n+1] - xs[n], the "last" delta doesn't exist. How can I do this? Something like:
derivative(append(Success, now()))
I don't see any graphite functions for appending series, and I don't see any user-defined graphite functions.
The general problem is to be alerted when the data is stale, via graphite monitoring. There may be a better solution than the one I'm thinking about.
identity is a function whose value at any given time is the timestamp of that time.
keepLastValue is a function that takes a series and replicates data points forward over gaps in the data.
So then diffSeries(identity("now"), keepLastValue(Success)) will be a "sawtooth" series that climbs steadily while Success isn't updated, and jumps down to zero (or close to it — there might be some time skew) every time Success has a data point. If you use graphite monitoring to get the current value of that expression and compare it to some threshold, it will probably do what you want.
I want a r function which makes my loop to run after evey 5 mins.
I have a loop that downloads market data from google finance.I want this loop to run in the interval of every 30 mins.
Is it possible?
An alternative to making your script loop: use an external job scheduling tool to call your script over the desired interval. If you have linux, I recommend checking out cron. Here's a SO response describing how to set up a cron job to kick off an R script: https://stackoverflow.com/a/10116439/819544
You can use Sys.sleep(100) to stop execution for 100 seconds. It's a little inefficient vs. running some other process in the same instance and setting up a proper timer. But it's pretty easy.
I need to test my C++ program, which uses system time.
The program is large and it uses third party libraries which possibly also use system time.
I want to see how my program behaves for different dates / times.
Is it possible to change the system time only for one running process in UNIX?
Many thanks...
Well, I found the answer by myself.
In unix shell there is an environmental variable TZ which is the timezone and it is used by all C/C++ time functions.
This variable can be manipulated to set time in current unix shell to any arbitrary time (limited to number of seconds, not milliseconds) and even change date.
Examples:
export TZ=A-02:10:20
(shifts time by 2 hours 10 minutes and 20 seconds from now forward).
export TZ=A+02:10:20
(the same shift backwards)
If you want to change the date, you can use large number of hours, for example:
export TZ=A-72:10:20
Unfortunately it does not let you change date too much, from my experiments it is up to several days back/forward. So for changing month/year this does not work.
(use 'date' command to check current date/time after setting TZ variable).
To cancel all changes use
export TZ=
I guess you can use libfaketime.