I want to handle failovers.
But, sensor failed -> retries only sensor self.
But, I want to trigger before operator.
This is flow chart.
a -> a_sensor (failed) -> a (retry) -> a_sensor -> (done)
Can I do this?
I recommend waiting the EMR job in the operator itself, even if this keeps the task running and occupying the worker slot, but it doesn't consume much resources, and you can simply manage the timeout, cleanup and the retry strategy:
class EmrOperator(BaseOperator):
...
def execute():
run_job():
wait_job()
def wait_job():
while not (is_finished()):
sleep(10s)
def on_kill():
cleanup()
And you can use the official operator EmrAddStepsOperator which already supports this.
And if you want to implement what you have mentioned in the question, Airflow doesn't support the retry for a group of tasks yet, but you can achieve this using callbacks:
a = EmrOperator(..., retries=0)
a_sensor = EmrSensor(, retries=0, on_failure_callback=emr_a_callback)
def emr_a_callback(ti, dag_run,):
max_retries = 3
retry_num = ti.xcom_pull(dag_ids=ti.task_id, key="retry_num")
if retry > max_retries:
retrun # do nothing
task_a = dag_run.get_task_instance("<task a id>")
task_a.state = None # pass a state to None
ti.state = None # pass the sensor state to None
I'm trying to get my task dependencies correct. So far I have 8 tasks that are dependent on the previous task which is great, but then I want to have a task run before these tasks first, but I can't seem to get it correct. In the screenshot you'll see 8 tasks itemized_costs_to_s3_0... 1...2........7. but before that I want to have itemized_cost latest only like I have for the traffic line items (see screenshot). This is my code for those 2 tasks:
how I would like it to look:
l = []
for endpoint in ENDPOINTS:
latest_only = LatestOnlyOperator(
task_id=f'{endpoint.name}_latest_only',
)
if endpoint.name is 'itemized_costs':
a = []
# Load each end points data to S3
for i in range(0, 8):
s3 = PToS3Operator(
task_id=f'{endpoint.name}_to_S3_{i}',
task_no=f'{int(i)}',
pool_slots=5,
endpoint=endpoint
)
a.append(s3)
if i not in [0]:
a[i-1] >> a[i]
else:
s3 = PToS3Operator(
task_id=f'{endpoint.name}_to_S3',
pool_slots=5,
endpoint=endpoint
)
latest_only >> s3
i´m trying to figure out an error message from robotframework (Jenkins 2.289.2). It is the error:
[ ERROR ] Execution stopped by user.
I try set set up some robot test cases with python (3.92) on Linux (OpenSuse). I have a simple testprogramm in python which is called by a shell (which in turn is configured in the robot pipeline):
#!/bin/bash
python -m robot /home/john/robotest/fuenf.robot
in "fuenf.robot" i make a simple call to a member function of a class which i instantiated in the Library section
Socket Call
${ausgabe} = test Prog
The python code sniplet:
def test_Prog(self):
print("Test .... test .... test ..... test ... ")
return
which is working when called from python interpreter
The robot pipeline is:
pipeline {
agent any
stages {
stage('Stage 1') {
steps {
sh "/bin/bash /home/john/robotest/calltest05.bash"
}
}
}
}
When i look into the jenkins pipeline step output:
hudson.AbortException: script returned exit code 253 at
org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.handleExit(DurableTaskStep.java:659)
at
org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:605)
at
org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker
.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
The console output gives:
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] Start of Pipeline
[Pipeline] node
Running on Jenkins in /var/lib/jenkins/workspace/TestPipelineParallel
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Stage 1)
[Pipeline] sh
+ /bin/bash /home/john/robotest/calltest05.bash
[ ERROR ] Execution stopped by user.
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: script returned exit code 253
Finished: FAILURE
I can try what i like, in every combination, i get the error:
[ ERROR ] Execution stopped by user.
I have no idea why this is happening or where to look at.
This
/bin/bash /home/john/robotest/calltest05.bash
returned 253 and Jenkins is expecting 0.
If that shell script calls some process kill it may cause the Execution stopped by user. error.
I have a customized sensor that looked like below. The idea is one dag can have different tasks that can start from different time, and take advantage of the built-in airflow reschedule system.
class MySensor(BaseSensorOperator):
def __init__(self, *, start_time, tz, ...)
super().__init__(**kwargs)
self._start_time = start_time
self._tz = tz
#provide_session
def execute(self, context, session: Session=None):
dt_start = datetime.combine(context['next_execution_date'].date(), self._start_time)
dt_start = dt_start.replace(tzinfo=self._tz)
if datetime.now().timestamp() < dt_start.timestamp():
dt_reschedule = datetime.utcnow().replace(tzinfo=UTC)
dt_reschedule += timedelta(seconds=dt_start.timestamp()-datetime.now().timestamp())
raise AirflowRescheduleException(dt_reschedule)
return super().execute(context)
In the dag, I have something as below. However, I notice when the mode is 'poke', which is default, the sensor will not work properly.
with DAG( schedule='0 10 * * 1-5', ... ) as dag:
task1 = MySensor(start_time=time(14,0), mode='poke')
task2 = MySensor(start_time=time(16,0), mode='reschedule')
... ...
From the log, i can see the following:
{taskinstance.py:1141} INFO - Rescheduling task, mark task as UP_FOR_RESCHEDULE
[5s later]
{local_task_job.py:102} INFO - Task exited with return code 0
[14s later]
{taskinstance.py:687} DEBUG - <TaskInstance: mydag.mytask execution_date [failed]> dependency 'Task Instance State' PASSED: False, Task in in the 'failed' state which is not a valid state for execution. The task must be cleared in order to be run.
{taskinstance.py:664} INFO - Dependencies not met for <TaskInstance ... [failed]> ...
Why rescheduling not working with mode='poke'? And when did the scheduler(?) flip the state of the taskinstance from "up_for_reschedule" to "failed"? Any better way to start the each task/sensor at different time? The sensor is an improved version of FileSensor, and checks a bunch of files or files with patterns. My current option is to force every task with mode='reschedule'
Airflow version 1.10.12
I'm playing with different asynchronous HTTP servers to see how they can handle multiple simultaneous connections. To force a time-consuming I/O operation I use the pg_sleep PostgreSQL function to emulate a time-consuming database query. Here is for instance what I did with Node.js:
var http = require('http');
var pg = require('pg');
var conString = "postgres://al:al#localhost/al";
/* SQL query that takes a long time to complete */
var slowQuery = 'SELECT 42 as number, pg_sleep(0.300);';
var server = http.createServer(function(req, res) {
pg.connect(conString, function(err, client, done) {
client.query(slowQuery, [], function(err, result) {
done();
res.writeHead(200, {'content-type': 'text/plain'});
res.end("Result: " + result.rows[0].number);
});
});
})
console.log("Serve http://127.0.0.1:3001/")
server.listen(3001)
So this a very simple request handler that does an SQL query taking 300ms and returns a response. When I try benchmarking it I get the following results:
$ ab -n 20 -c 10 http://127.0.0.1:3001/
Time taken for tests: 0.678 seconds
Complete requests: 20
Requests per second: 29.49 [#/sec] (mean)
Time per request: 339.116 [ms] (mean)
This shows clearly that requests are executed in parallel. Each request takes 300ms to complete and because we have 2 batches of 10 requests executed in parallel, it takes 600ms overall.
Now I'm trying to do the same with Elixir, since I heard it does asynchronous I/O transparently. Here is my naive approach:
defmodule Toto do
import Plug.Conn
def init(options) do
{:ok, pid} = Postgrex.Connection.start_link(
username: "al", password: "al", database: "al")
options ++ [pid: pid]
end
def call(conn, opts) do
sql = "SELECT 42, pg_sleep(0.300);"
result = Postgrex.Connection.query!(opts[:pid], sql, [])
[{value, _}] = result.rows
conn
|> put_resp_content_type("text/plain")
|> send_resp(200, "Result: #{value}")
end
end
In case that might relevant, here is my supervisor:
defmodule Toto.Supervisor do
use Application
def start(type, args) do
import Supervisor.Spec, warn: false
children = [
worker(Plug.Adapters.Cowboy, [Toto, []], function: :http),
]
opts = [strategy: :one_for_one, name: Toto.Supervisor]
Supervisor.start_link(children, opts)
end
end
As you might expect, this doesn't give me the expected result:
$ ab -n 20 -c 10 http://127.0.0.1:4000/
Time taken for tests: 6.056 seconds
Requests per second: 3.30 [#/sec] (mean)
Time per request: 3028.038 [ms] (mean)
It looks like there's no parallelism, requests are handled one after the other. What am I doing wrong?
Elixir should be completely fine with this setup. The difference is that your node.js code is creating a connection to the database for every request. However, in your Elixir code, init is called once (and not per request!) so you end-up with a single process that sends queries to Postgres for all requests, which then becomes your bottleneck.
The easiest solution would be to move the connection to Postgres out of init and into call. However, I would advise you to use Ecto which will set up a connection pool to the database too. You can also play with the pool configuration for optimal results.
UPDATE This was just test code, if you want to do something like this see #AlexMarandon's Ecto pool answer instead.
I've just been playing with moving the connection setup as José suggested:
defmodule Toto do
import Plug.Conn
def init(options) do
options
end
def call(conn, opts) do
{ :ok, pid } = Postgrex.Connection.start_link(username: "chris", password: "", database: "ecto_test")
sql = "SELECT 42, pg_sleep(0.300);"
result = Postgrex.Connection.query!(pid, sql, [])
[{value, _}] = result.rows
conn
|> put_resp_content_type("text/plain")
|> send_resp(200, "Result: #{value}")
end
end
With results:
% ab -n 20 -c 10 http://127.0.0.1:4000/
Time taken for tests: 0.832 seconds
Requests per second: 24.05 [#/sec] (mean)
Time per request: 415.818 [ms] (mean)
Here is the code I came up with following José's answer:
defmodule Toto do
import Plug.Conn
def init(options) do
options
end
def call(conn, _opts) do
sql = "SELECT 42, pg_sleep(0.300);"
result = Ecto.Adapters.SQL.query(Repo, sql, [])
[{value, _}] = result.rows
conn
|> put_resp_content_type("text/plain")
|> send_resp(200, "Result: #{value}")
end
end
For this to work we need to declare a repo module:
defmodule Repo do
use Ecto.Repo, otp_app: :toto
end
And start that repo in the supervisor:
defmodule Toto.Supervisor do
use Application
def start(type, args) do
import Supervisor.Spec, warn: false
children = [
worker(Plug.Adapters.Cowboy, [Toto, []], function: :http),
worker(Repo, [])
]
opts = [strategy: :one_for_one, name: Toto.Supervisor]
Supervisor.start_link(children, opts)
end
end
As José mentioned, I got the best performance by tweaking the configuration a bit:
config :toto, Repo,
adapter: Ecto.Adapters.Postgres,
database: "al",
username: "al",
password: "al",
size: 10,
lazy: false
Here is the result of my benchmark (after a few runs so that the pool has the time to "warm up") with default configuration:
$ ab -n 20 -c 10 http://127.0.0.1:4000/
Time taken for tests: 0.874 seconds
Requests per second: 22.89 [#/sec] (mean)
Time per request: 436.890 [ms] (mean)
And here is the result with size: 10 and lazy: false:
$ ab -n 20 -c 10 http://127.0.0.1:4000/
Time taken for tests: 0.619 seconds
Requests per second: 32.30 [#/sec] (mean)
Time per request: 309.564 [ms] (mean)