I have a python project which uses Gitlab Job retry api to retry a job of a pipeline.
But my retry job is getting failed with the error "This job depends on other jobs with expired / erased artifact". What could be the reason for this error?
stages:
- build
build:
tags: [kubernetes, linux, default]
image: #image-url
stage: build
script:
- python3 setup.py sdist bdist wheel
artifacts:
paths:
- $CI_PROJECT_DIR/dist
- ${CI_PROJECT_DIR}/job
- ${CI_PROJECT_DIR}/*.egg-info/PKG-INFO
expire_in: 600 mins
your artifacts expire after 600 minutes, so if you re-run a pipeline stage later than that, the artifact will not be present any more. If the pipeline stage you re-ran depends on a previous stage's artifact, then the error you are seeing occurs
Related
Airflow was working fine for several weeks and suddenly started getting errors for a few days.
Dags fail randomly with this error.
Log file does not exist: airflow_path/1.log
Fetching from: http://:8793/airflow_path/1.log
*** Failed to fetch log file from worker. The request to ':///' is missing either an 'http://
I had a similar issue, and I figured that in my case the worker node (I was using Celery Executor) was exhausted and therefore unavailable to execute any dags on it, can you check the CPU and memory utilized by the worker node (or its alternative if you are not using celery executor).
You can try to increase the CPU and memory for that applicable node and try.
Happened to me as well using LocalExecutor and an Airflow setup on Docker Compose. Eventually, I figured that the webserver would fail to fetch old logs whenever I recreated my Docker containers. Digging deeper, I realized that the webserver was failing to fetch the logs because it didn't have access to the filesystem of the scheduler (where the logs live).
The fix was to ensure that both the scheduler and the webserver services in docker-compose.yml share a volume with the logs, i.e.:
# docker-compose.yml
version: "3.9"
services:
scheduler:
image: ...
volumes:
- airflow_logs:/airflow/logs
...
webserver:
image: ...
volumes:
- airflow_logs:/airflow/logs
...
volumes:
airflow_logs:
Apache Airflow version: 1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version): Not using Kubernetes or docker
Environment: CentOS Linux release 7.7.1908 (Core) Linux 3.10.0-1062.el7.x86_64
Python Version: 3.7.6
Executor: LocalExecutor
What happened:
I write a simple dag to clean airflow logs. Everything is OK when I use 'airflow test' command to test it, I also trigger it manually in WebUI which use 'airflow run' command to start my task, it is still OK.
But after I reboot my server and restart my webserver & scheduler service (in daemon mode), every time I trigger the exactly same dag, it still get scheduled like usual, but exit with code 1 immediately after start a new process to run task.
I also use 'airflow test' command again to check if there is something wrong with my code now, but everything seems OK when using 'airflow test', but exit silently when using 'airflow run', it is really weird.
Here's the task log when it's manually triggered in WebUI ( I've changed the log level to DEBUG, but still can't find anything useful), or you can read the attached log file: task error log.txt
Reading local file: /root/airflow/logs/airflow_log_cleanup/log_cleanup_worker_num_1/2020-04-29T13:51:44.071744+00:00/1.log
[2020-04-29 21:51:53,744] {base_task_runner.py:61} DEBUG - Planning to run as the user
[2020-04-29 21:51:53,750] {taskinstance.py:686} DEBUG - dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
[2020-04-29 21:51:53,754] {taskinstance.py:686} DEBUG - dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.
[2020-04-29 21:51:53,754] {taskinstance.py:686} DEBUG - dependency 'Task Instance State' PASSED: True, Task state queued was valid.
[2020-04-29 21:51:53,754] {taskinstance.py:669} INFO - Dependencies all met for
[2020-04-29 21:51:53,757] {taskinstance.py:686} DEBUG - dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
[2020-04-29 21:51:53,760] {taskinstance.py:686} DEBUG - dependency 'Pool Slots Available' PASSED: True, ('There are enough open slots in %s to execute the task', 'default_pool')
[2020-04-29 21:51:53,766] {taskinstance.py:686} DEBUG - dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.
[2020-04-29 21:51:53,768] {taskinstance.py:686} DEBUG - dependency 'Task Concurrency' PASSED: True, Task concurrency is not set.
[2020-04-29 21:51:53,768] {taskinstance.py:669} INFO - Dependencies all met for
[2020-04-29 21:51:53,768] {taskinstance.py:879} INFO -
[2020-04-29 21:51:53,768] {taskinstance.py:880} INFO - Starting attempt 1 of 2
[2020-04-29 21:51:53,768] {taskinstance.py:881} INFO -
[2020-04-29 21:51:53,779] {taskinstance.py:900} INFO - Executing on 2020-04-29T13:51:44.071744+00:00
[2020-04-29 21:51:53,781] {standard_task_runner.py:53} INFO - Started process 29718 to run task
[2020-04-29 21:51:53,805] {logging_mixin.py:112} INFO - [2020-04-29 21:51:53,805] {cli_action_loggers.py:68} DEBUG - Calling callbacks: []
[2020-04-29 21:51:53,818] {logging_mixin.py:112} INFO - [2020-04-29 21:51:53,817] {cli_action_loggers.py:86} DEBUG - Calling callbacks: []
[2020-04-29 21:51:58,759] {logging_mixin.py:112} INFO - [2020-04-29 21:51:58,759] {base_job.py:200} DEBUG - [heartbeat]
[2020-04-29 21:51:58,759] {logging_mixin.py:112} INFO - [2020-04-29 21:51:58,759] {local_task_job.py:124} DEBUG - Time since last heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 4.98824 s
[2020-04-29 21:52:03,753] {logging_mixin.py:112} INFO - [2020-04-29 21:52:03,753] {local_task_job.py:103} INFO - Task exited with return code 1
How to reproduce it:
I really don't know how to reproduce it. because it happens suddenly, and seems like permanently??
Anything else we need to know:
I try to figure out the difference between 'airflow test' and 'airflow run', it might have something to do with process fork I guess?
What I've tried to solve this problem but all failed:
clear all dag/dag run/task instance info, remove all files under /root/airflow except for the config file, and restart my service
reboot my server again
uninstall airflow and install it again
I finally figure out how to reproduce this bug.
When you config email in airflow.cfg and your dag contains email operator or use smtp serivce, if your smtp password contains character like "^", the first task of your dag will 100% exited with return code 1 without any error information, in my case the first task is merely a python operator.
Although I think it's my bad to mess up smtp service, there should be some reasonable hints, actually it takes me a whole week to debug this, I have to reset everything in my airflow environment and slowly change configuration to see when does this bug happens.
Hope this information is helpful
I met an issue that my task in a tag never got pick up by workers for some reason.
When I look at the task details:
All dependencies are met but the task instance is not running. In most
cases this just means that the task will probably be scheduled soon
unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow
administrator for assistance.
I checked the scheduler, no errors in the log, also restarted it a few times.
I also checked the airflow websever log, only notice this:
22/11/2018 12:10:39[2018-11-22 01:10:39,747] {{cli.py:644}} DEBUG - [5
/ 5] killing 1 workers 22/11/2018 12:10:39[2018-11-22 01:10:39 +0000]
[43] [INFO] Handling signal: ttou 22/11/2018 12:10:39[2018-11-22
01:10:39 +0000] [348] [INFO] Worker exiting (pid: 348)
Not sure what happens, it worked fine before.
Airflow version 1.9.0, never change the version, only playing around some of the config: min_file_process_interval and dag_dir_list_interval (but I put it back to default when encounter this issue)
I do notice that this happens when I am playing around with some of the airflow config and rebuild our docker airflow image, then I revert it back to the original version, which used to work. Then the problem solved.
I also notice one error occurred (but not always captured) in my celery workers when I use the newly built image:
Unrecoverable error: AttributeError("'float' object has no attribute 'items'",)
So find that it is related to the latest redis release (Celery will use redis), you can find more details.
I have a successful bitbucket pipeline calling out to aws CodeDeploy, but I'm wondering if I can add a step that will check and wait for CodeDeploy success, otherwise fail the pipeline. Would this just be possible with a script that loops through a CodeDeploy call that continues to monitor the status of the CodeDeploy push? Any idea what CodeDeploy call that would be?
bitbucket-pipline.yml
image: pitech/gradle-awscli
pipelines:
branches:
develop:
- step:
caches:
- gradle
script:
- gradle build bootRepackage
- mkdir tmp; cp appspec.yml tmp; cp build/libs/thejar*.jar tmp/the.jar; cp -r scripts/ ./tmp/
- pip install awscli --upgrade --user
- aws deploy push --s3-location s3://thebucket/the-deploy.zip --application-name my-staging-app --ignore-hidden-files --source tmp
- aws deploy create-deployment --application-name server-staging --s3-location bucket=staging-codedeploy,key=the-deploy.zip,bundleType=zip --deployment-group-name the-staging --deployment-config-name CodeDeployDefault.AllAtOnce --file-exists-behavior=OVERWRITE
appspec.yml
version: 0.0
os: linux
files:
- source: thejar.jar
destination: /home/ec2-user/the-server/
permissions:
- object: /
pattern: "**"
owner: ec2-user
group: ec2-user
hooks:
ApplicationStop:
- location: scripts/server_stop.sh
timeout: 60
runas: ec2-user
ApplicationStart:
- location: scripts/server_start.sh
timeout: 60
runas: ec2-user
ValidateService:
- location: scripts/server_validate.sh
timeout: 120
runas: ec2-user
Unfortunately it doesn't seem like Bitbucket is waiting for the ValidateService to complete, so I'd need a way in Bitbucket to confirm before marking the build a success.
AWS CLI already has a deployment-successful method which checks the status of a deployment every 15 seconds. You just need to pipe the output of create-deployment to deployment-successful.
In your specific case, it should look like this:
image: pitech/gradle-awscli
pipelines:
branches:
develop:
- step:
caches:
- gradle
script:
- gradle build bootRepackage
- mkdir tmp; cp appspec.yml tmp; cp build/libs/thejar*.jar tmp/the.jar; cp -r scripts/ ./tmp/
- pip install awscli --upgrade --user
- aws deploy push --s3-location s3://thebucket/the-deploy.zip --application-name my-staging-app --ignore-hidden-files --source tmp
- aws deploy create-deployment --application-name server-staging --s3-location bucket=staging-codedeploy,key=the-deploy.zip,bundleType=zip --deployment-group-name the-staging --deployment-config-name CodeDeployDefault.AllAtOnce --file-exists-behavior=OVERWRITE > deployment.json
- aws deploy wait deployment-successful --cli-input-json file://deployment.json
aws deploy create-deployment is an asynchronous call, and BitBucket has no idea that it needs to know about the success of your deployment. Adding a script to your CodeDeploy application will have no effect on BitBucket knowing about your deployment.
You have one (maybe two) options to fix this issue.
#1 Include a script that waits for your deployment to finish
You need to add a script to your BitBucket pipeline to check the status of your deployment to finish. You can either use SNS notifications, or poll the CodeDeploy service directly.
The pseudocode would look something like this:
loop
check_if_deployment_complete
if false, wait and retry
if true && deployment successful, return 0 (success)
if true && deployment failed, return non-zero (failure)
You can use the AWS CLI or your favorite scripting language. Add it at the end of your bitbucket-pipline.yml script. Make sure you use a wait between calls to CodeDeploy to check the status.
#2 (the maybe) Use BitBucket AWS CodeDeploy integration directly
BitBucket integrates with AWS CodeDeploy directly, so you might be able to use their integration rather than your script to integration properly. I don't know if this is supported or not.
for my tests I need a dummy server which outputs sample/test code. I use a node http server for that and start it before my scripts with node ./test/server.js.
I can start it, but the issue is that it's taking up the instance and therefore can't run the tests now.
So the question is, how can I run the server in the background/new instance so it doesn't conflict with that? I do stop the server with the end_script so I don't have to terminate it.
This is my travis-config so far:
language: node_js
node_js:
- "6.1"
cache:
directories:
— node_modules
install:
- npm install
before_script:
- sh -e node ./test/server.js
script:
- ./node_modules/mocha-phantomjs/bin/mocha-phantomjs ./test/browser/jelly.html
- ./node_modules/mocha-phantomjs/bin/mocha-phantomjs ./test/browser/form.html
- ./node_modules/mocha/bin/mocha ./test/node/jelly.js
after_script:
- curl http://localhost:5555/close
You can background the process by appending a &:
before_script:
- node ./test/server.js &