Airflow: connection is not being created when using environment variables

Airflow: connection is not being created when using environment variables - airflow

I want to create a Mongo connection (other than default) without using the Airflow UI.
I read from the Airflow documentation:
Connections in Airflow pipelines can be created using environment
variables. The environment variable needs to have a prefix of
AIRFLOW_CONN_ for Airflow with the value in a URI format to use the
connection properly.
When referencing the connection in the Airflow pipeline, the conn_id
should be the name of the variable without the prefix. For example, if
the conn_id is named postgres_master the environment variable should
be named AIRFLOW_CONN_POSTGRES_MASTER (note that the environment
variable must be all uppercase).
I tried to apply this when using the Puckel docker image.
This is a docker compose using that image:
version: '2.1'
services:
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
webserver:
image: puckel/docker-airflow:1.10.6
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=n
- EXECUTOR=Local
- AIRFLOW_CONN_MY_MONGO=mongodb://mongo:27017
volumes:
- ./src/:/usr/local/airflow/dags
- ./requirements.txt:/requirements.txt
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
Note the line AIRFLOW_CONN_MY_MONGO=mongodb://mongo:27017 where I'm passing the environment variable as the Airflow documentation suggests.
Problem here is that there is no my_mongo connection created when I'm listing the connections in the UI.
Any advice? Thanks!

The connection won't be listed in the UI when you create it with environment variable.
Reason:
Airflow supports the creation of connections via Environment variable for ad-hoc jobs in the DAGs
The connection in the UI are actually saved in the DB and retrieved from it. The ones created by Env vars are not stored in DB
How do I test my connection?
Create a sample DAG and use your connection to run a sample job. It should work fine.

I read a Puckel issue where they mention that the connection is created, but is not showed in the UI. I tested it and in fact the connection works when used in a DAG.

Related

DynamoDB local behaving erratically

This is a very strange situation that's driving me nuts, and I would really appreciate some help here.
I am using CDK to define the DynamoDB table and associated indices. To test them locally, I installed cdklocal and DynamoDB local using localstack. When the computer (Mac running Ventura 13.1) is restarted, everything works as expected. Here is the script I use to bootstrap and start the stack (this is in a file called startStack.sh):
docker-compose up -d
echo "Waiting for 5 seconds"
sleep 5
cd test-app
cdklocal bootstrap
echo "Waiting for 5 seconds"
sleep 5
cdklocal deploy TestAppStack
#cdklocal deploy TestAppStack/ops-table
DYNAMO_ENDPOINT="http://localhost:4566/" dynamodb-admin &
open http://0.0.0.0:8001
cd ..
The test-app directory contains a local copy of the DynamoDB (and associated indices) definition. I do not encounter any errors running the cdklocal (or cdk) deploy commands so I am assuming that the CDK definition is not an issue.
The docker-compose looks like this:
version: "3.8"
services:
localstack:
container_name: AWS-DEVELOPMENT-WITH-LOCALSTACK
image: localstack/localstack:latest
network_mode: bridge
ports:
- "127.0.0.1:53:53"
- "127.0.0.1:53:53/udp"
- "127.0.0.1:443:443"
- "127.0.0.1:4566:4566"
- "127.0.0.1:4571:4571"
- "127.0.0.1:${PORT_WEB_UI-8080}:${PORT_WEB_UI-8080}"
environment:
- DYNAMODB_SHARE_DB=1
- DISABLE_CORS_CHECKS=1
- SERVICES=s3,dynamodb,sns,sqs,firehose,kinesis,ses,sts,cloudformation
- DEBUG=1
- DATA_DIR=/tmp/localstack/data
- PORT_WEB_UI=8080
- LAMBDA_EXECUTOR=local
- KINESIS_ERROR_PROBABILITY=1.0
- DOCKER_HOST=unix:///var/run/docker.sock
- HOST_TMP_FOLDER=./.localstack
volumes:
- './.localstack:/var/lib/localstack'
- '/var/run/docker.sock:/var/run/docker.sock'
Everything works as expected when I first run the startStack.sh file - the dynamodb-admin window opens up correctly and other interfaces can interact with the local DynamoDB table. But after some time (and I have not been able to pinpoint the cause), all interactions with local DynamoDB start failing with the following error(s):
Bootstrapping environment aws://000000000000/us-west-2...
❌ Environment aws://000000000000/us-west-2 failed bootstrapping: UnknownEndpoint: Inaccessible host: `localhost' at port `4566'. This service may not be available in the `us-west-2' region.
at Request.ENOTFOUND_ERROR (/usr/local/lib/node_modules/aws-sdk/lib/event_listeners.js:611:46)
at Request.callListeners (/usr/local/lib/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
at Request.emit (/usr/local/lib/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
at Request.emit (/usr/local/lib/node_modules/aws-sdk/lib/request.js:686:14)
at error2 (/usr/local/lib/node_modules/aws-sdk/lib/event_listeners.js:443:22)
at ClientRequest.<anonymous> (/usr/local/lib/node_modules/aws-sdk/lib/http/node.js:99:9)
at ClientRequest.emit (node:events:513:28)
at ClientRequest.emit (node:domain:489:12)
at Socket.socketErrorListener (node:_http_client:494:9)
at Socket.emit (node:events:513:28) {
code: 'UnknownEndpoint',
region: 'us-west-2',
hostname: 'localhost',
retryable: true,
originalError: [Error],
time: 2023-01-15T06:46:40.614Z
}
Inaccessible host: `localhost' at port `4566'. This service may not be available in the `us-west-2' region.
The script hangs at the following message:
[16:52:01] Retrieved account ID 000000000000 from disk cache
[16:52:01] Assuming role 'arn:aws:iam::000000000000:role/cdk-hnb659fds-deploy-role-000000000000-us-west-2'.
[16:52:01] Assuming role failed: Inaccessible host: `localhost' at port `4566'. This service may not be available in the `us-west-2' region.
[16:52:01] Could not assume role in target account using current credentials Inaccessible host: `localhost' at port `4566'. This service may not be available in the `us-west-2' region. . Please make sure that this role exists in the account. If it doesn't exist, (re)-bootstrap the environment with the right '--trust', using the latest version of the CDK CLI.
current credentials could not be used to assume 'arn:aws:iam::000000000000:role/cdk-hnb659fds-deploy-role-000000000000-us-west-2', but are for the right account. Proceeding anyway.
[16:52:01] Waiting for stack CDKToolkit to finish creating or updating...
Restarting the computer fixes it, but it's not clear what causes the issue in the first place. Restarting Docker does not help either.
Any thoughts on what could be causing the problem and how I can avoid it?

I'm adding this as an answer, although I do not have an affirmative answer I thought I would try to help.
I believe your port is being occupied and thus the process you are running is unable to obtain it resulting in error. Before running the job, check if the port is occupied:
sudo lsof -i :4566

Airflow Log file does not exist:

Airflow was working fine for several weeks and suddenly started getting errors for a few days.
Dags fail randomly with this error.
Log file does not exist: airflow_path/1.log
Fetching from: http://:8793/airflow_path/1.log
*** Failed to fetch log file from worker. The request to ':///' is missing either an 'http://

I had a similar issue, and I figured that in my case the worker node (I was using Celery Executor) was exhausted and therefore unavailable to execute any dags on it, can you check the CPU and memory utilized by the worker node (or its alternative if you are not using celery executor).
You can try to increase the CPU and memory for that applicable node and try.

Happened to me as well using LocalExecutor and an Airflow setup on Docker Compose. Eventually, I figured that the webserver would fail to fetch old logs whenever I recreated my Docker containers. Digging deeper, I realized that the webserver was failing to fetch the logs because it didn't have access to the filesystem of the scheduler (where the logs live).
The fix was to ensure that both the scheduler and the webserver services in docker-compose.yml share a volume with the logs, i.e.:
# docker-compose.yml
version: "3.9"
services:
scheduler:
image: ...
volumes:
- airflow_logs:/airflow/logs
...
webserver:
image: ...
volumes:
- airflow_logs:/airflow/logs
...
volumes:
airflow_logs:

AWS credentials not found for celery-k8s deployment

I'm trying to run dagster using celery-k8s and using the examples/celery-k8s as a start. upon running the pipeline from playground I get
Initialization of resources [s3, io_manager] failed.
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I have configured aws credentials in env variables as mentioned in the document
deployments:
- name: "user-code-deployment-test"
image:
repository: "somasays/dagster-usercode-example"
tag: "0.5"
pullPolicy: Always
dagsterApiGrpcArgs:
- "-f"
- "/workspace/repo.py"
port: 3030
env:
AWS_ACCESS_KEY_ID: AAAAAAAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY: qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
AWS_DEFAULT_REGION: eu-central-1
and I can also see these values are set in the env variables of the pod and can also access the s3 location after pip install awscli and aws s3 ls see the screenshot below the job pod however throws Unable to locate credentials
Please help

The deployment configuration applies to the user code servers. Meanwhile the celery executor runs your pipeline code in separate kubernetes jobs. To provide your secrets there, you will want to configure the env_secrets field of the celery-k8s executor in your pipeline run config.
See https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-k8s/dagster_k8s/job.py#L321-L327 for details on the config.

Airflow (LocalExecutor) - Docker :: Job is failing with Log file does not exist

Airflow version: 1.10.9
Executor : LocalExecutor
Docket Setup
when job runs sometime we are getting following error. I have searched in web, many people faced this issue in celeryExecutor but we are using LocalExecutor(Docker setup). How can I resolve this problem?
*** Log file does not exist: /home/ubuntu/airflow/airflow/logs/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log
*** Fetching from: http://:8793/log/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log
*** Failed to fetch log file from worker. Invalid URL 'http://:8793/log/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log': No host supplied

Here is one approach I've seen when running the scheduler and webserver in their own containers and using LocalExecutor:
Mount a host log directory as a volume into both the scheduler and webserver containers:
volumes:
- /location/on/host/airflow/logs:/opt/airflow/logs
Make sure the user within the airflow containers (usually airflow) has permissions to read and write that directory. If the permissions are wrong you will see an error like the one in your post.
This probably won't scale beyond LocalExecutor usage though.

How to handle multiple schemas in a single migrate with flyway

I'm totally new to Flyway but I'm trying to migrate a number of identical test databases using the docker-compose flyway+mysql arrangement described in https://github.com/flyway/flyway-docker
As far as I can tell, the migrate command can take multiple schemas in its -schemas argument but it only seems to apply the actual SQL migration to the first schema in the list.
For example, when I run the migrate with schemas=test_1,test_2,test_3, flyway creates all three databases but only creates the tables specified in the migration file on the first test_1 database.
Is there a way to apply the SQL migration file to all the schemas in the list?

I'm going to leave this question up in case someone can still answer how multiple schemas is useful if the migration file isn't applied to all databases in the list. But, I was able to handle multiple databases in a docker-compose by overriding the flyway entrypoint and command.
So now my docker-compose service looks like:
services:
flyway:
image: flyway/flyway:6.1.4
volumes:
- ./migrations:/flyway/sql
depends_on:
- db
entrypoint: ["bash"]
command: > -c "/flyway/flyway -url=jdbc:mysql://db -schemas=test1 migrate;
/flyway/flyway -url=jdbc:mysql://db -schemas=test2 migrate"

For me what worked was breaking up my migrations into separate executions in my docker-compose file along with docker-postgresql-multiple-databases as follows:
version: '3.8'
services:
postgres-db:
image: 'postgres:13.3'
environment:
POSTGRES_MULTIPLE_DATABASES: 'customers,addresses'
POSTGRES_USER: 'pocketlaundry'
POSTGRES_PASSWORD: 'iceprism'
volumes:
- ./docker-postgresql-multiple-databases:/docker-entrypoint-initdb.d
expose:
- '5432' # Publishes 5432 to other containers (addresses-flyway, customers-flyway) but NOT to host machine
ports:
- '5432:5432'
addresses-flyway:
image: flyway/flyway:7.12.0
command: -url=jdbc:postgresql://postgres-db:5432/addresses -schemas=public -user=pocketlaundry -password=iceprism -connectRetries=60 migrate
volumes:
- ./sports-ball-project/src/test/resources/db/addresses/migrations:/flyway/sql
depends_on:
- postgres-db
links:
- postgres-db
customers-flyway:
image: flyway/flyway:7.12.0
command: -url=jdbc:postgresql://postgres-db:5432/customers -schemas=public -user=pocketlaundry -password=iceprism -connectRetries=60 migrate
volumes:
- ./sports-ball-project/src/test/resources/db/customers/migrations:/flyway/sql
depends_on:
- postgres-db
links:
- postgres-db

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex