How to retry failed step in run-workers? - automated-tests

I want to mix the advantages of run-workers (multiple agents to gain time) and run-rerun (fight against flaky tests) but I don't know how I can do that.
Currently, I do npx codeceptjs run-workers --suites 5 --steps but I have some flaky tests. I want to have the same feature of run-rerun with retry failed tests (minimum success, maximum reruns). Is it possible ?
What is the solution ? Somebody helps me, please.

What about use the retryfailedstep plugin?
You also could try other retries options:
Auto retry
Retry Step
Retry Scenario
Retry Feature
Once you configure that, try to run your tests using npx codeceptjs run-workers 5


Custom Operator States (queued, success, etc.) in Apache Airflow?

In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.

how to speed up initialization of run in firebase test lab

When running:
gcloud firebase test android run --type=instrumentation --app=app.apk --test=test_app.apk
The firebase command line is stuck many minutes in "Creating individual test executions".
When debugging further it seems that the command line polls a backend "" periodically till it get's an ok.
Is there a way to speed this up? This step can take 5 minutes and it takes unnessecary CI time
The command line was missing the part: --device model=NexusLowRes,version=29 --verbosity=debug
I analyzed the issue further. It takes about 100 sec to upload both app and test app and another 150 s to create the test execution. so i think that it is a limitation in the system and nothing can be done here. Maybe the size of the apk is limiting. It is about 200 mb and it takes a lot time to scan this.
Please see my comment on your question asking for additional details that could affect the answer.
One option is to add --async to your command. This will only poll the matrix status until it verifies that the matrix is created successfully, then exit without waiting for the test to actually run.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or, checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Airflow Cluster Policy not taking effect

I'm attempting to use a Cluster Policy in Airflow 1.9. I followed the instructions in the official documentation, but it doesn't seem to be taking effect.
In my file at $AIRFLOW_HOME/config/, I've defined the method as the docs instructed and it has the following signature:
def policy(task_instance):
Additional concerns:
What Airflow component is actually running the policy code (is it the scheduler)?
Is there a recommended way to unit test cluster policy code? If not, what about local testing?
Can anyone help me understand why this Cluster Policy isn't taking effect?
I'm using Airflow 1.9.
So you seem to have the file in the right place according to the documents:
And your signature is right:
But you haven't shown what you did and how that "did not work".
I believe the def policy(task): signature is run on the scheduler after DAG parsing (as the docs seem to say) while the def task_instance_mutation_hook(ti): signature is run by the task executor on the worker. That's probably why you're not seeing some changes take.
EG timeout or queue is something the scheduler enforces, but connection ID is something the worker needs to know during execution.
So if what you wanted to work was a timeout policy, it should have, but if what you wanted to work was a connection ID enforcement, that wouldn't have.
