AWS credentials not found for celery-k8s deployment - dagster

I'm trying to run dagster using celery-k8s and using the examples/celery-k8s as a start. upon running the pipeline from playground I get
Initialization of resources [s3, io_manager] failed.
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I have configured aws credentials in env variables as mentioned in the document
deployments:
- name: "user-code-deployment-test"
image:
repository: "somasays/dagster-usercode-example"
tag: "0.5"
pullPolicy: Always
dagsterApiGrpcArgs:
- "-f"
- "/workspace/repo.py"
port: 3030
env:
AWS_ACCESS_KEY_ID: AAAAAAAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY: qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
AWS_DEFAULT_REGION: eu-central-1
and I can also see these values are set in the env variables of the pod and can also access the s3 location after pip install awscli and aws s3 ls see the screenshot below the job pod however throws Unable to locate credentials
Please help

The deployment configuration applies to the user code servers. Meanwhile the celery executor runs your pipeline code in separate kubernetes jobs. To provide your secrets there, you will want to configure the env_secrets field of the celery-k8s executor in your pipeline run config.
See https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-k8s/dagster_k8s/job.py#L321-L327 for details on the config.

Related

S2i build command pass user name password in Azure Devops pipeline

We are using S2i Build command in our Azure Devops pipeline and using the below command task.
`./s2i build http://azuredevopsrepos:8080/tfs/IT/_git/shoppingcart --ref=S2i registry.access.redhat.com/ubi8/dotnet-31 --copy shopping-service`
The above command asks for user name and password when the task is executed,
How could we provide the username and password of the git repository from the command we are trying to execute ?
Git credential information can be put in a file .gitconfig on your home directory.
As I looked at the document*2 for s2i cli, I couldn't find any information for secured git.
I realized that OpenShift BuildConfig uses .gitconfig file while building a container image.*3 So, It could work.
*1: https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage
*2: https://github.com/openshift/source-to-image/blob/master/docs/cli.md#s2i-build
*3: https://docs.openshift.com/container-platform/4.11/cicd/builds/creating-build-inputs.html#builds-gitconfig-file-secured-git_creating-build-inputs
I must admit I am unfamiliar with Azure Devops pipelines, however if this is running a build on OpenShift you can create a secret with your credentials using oc.
oc create secret generic azure-git-credentials --from-literal=username=<your-username> --from-literal=password=<PAT> --type=kubernetes.io/basic-auth
Link the secret we created above to the builder service account, this account is the one OpenShift uses by default behind the scenes when running a new build.
oc secrets link builder azure-git-credentials
Lastly, you will want to link this source-secret to the build config.
oc set build-secret --source bc/<your-build-config> azure-git-credentials
Next time you run your build the credentials should be picked up from the source-secret in the build config.
You can also do this from the UI on OpenShift, steps below are a copy of what is done above, choose one but not both.
Create a secret from YAML, modify the below where indicated:
kind: Secret
apiVersion: v1
metadata:
name: azure-git-credentials
namespace: <your-namespace>
data:
password: <base64-encoded-password-or-PAT>
username: <base64-encoded-username>
type: kubernetes.io/basic-auth
Then under the ServiceAccounts section on OpenShift, find and edit the 'builder' service account.
kind: ServiceAccount
apiVersion: v1
metadata:
name: builder
namespace: xxxxxx
secrets:
- name: azure-git-credentials ### only add this line, do not edit anything else.
And finally, edit your build config for the build finding where the git entry is and adding the source-secret entry:
source:
git:
uri: "https://github.com/user/app.git"
### Add the entries below ###
sourceSecret:
name: "azure-git-credentials"

Airflow:Logs not showing in the UI while tasks running

Airflow Version: 2.2.4
Airflow running in EKS
Issue: Logs not showing in the UI while tasks running
The issue with the logs is that airflow is only writing the logs to the log file rather than standard out as well. This is what's preventing us from being able to see the logs in the web UI while the task is running.
When i get into the pod , i do see log inside the pod
Is there any solution to finding the setting or configuration needed to output to both?
I log as below
kubectl logs detaskdate0.3d55e5ba89ca4ad491bb3e1cadfdaaec -n airflow
Added new context arn:aws:eks:us-west-2:XXXXXXXX:cluster/us-west-2-airflow-cluster to /home/airflow/.kube/config
[2022-05-20 19:56:43,529] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/tss/dq_tss_mod_date_dag.py

Airflow Log file does not exist:

Airflow was working fine for several weeks and suddenly started getting errors for a few days.
Dags fail randomly with this error.
Log file does not exist: airflow_path/1.log
Fetching from: http://:8793/airflow_path/1.log
*** Failed to fetch log file from worker. The request to ':///' is missing either an 'http://
I had a similar issue, and I figured that in my case the worker node (I was using Celery Executor) was exhausted and therefore unavailable to execute any dags on it, can you check the CPU and memory utilized by the worker node (or its alternative if you are not using celery executor).
You can try to increase the CPU and memory for that applicable node and try.
Happened to me as well using LocalExecutor and an Airflow setup on Docker Compose. Eventually, I figured that the webserver would fail to fetch old logs whenever I recreated my Docker containers. Digging deeper, I realized that the webserver was failing to fetch the logs because it didn't have access to the filesystem of the scheduler (where the logs live).
The fix was to ensure that both the scheduler and the webserver services in docker-compose.yml share a volume with the logs, i.e.:
# docker-compose.yml
version: "3.9"
services:
scheduler:
image: ...
volumes:
- airflow_logs:/airflow/logs
...
webserver:
image: ...
volumes:
- airflow_logs:/airflow/logs
...
volumes:
airflow_logs:

Euca 5.0 Ansible Console Task Failing

Background:
I am only able to get past the ansible console install/config tasks by adding --region localhost to anywhere in: /usr/share/eucalyptus-ansible/roles/cloud-post/tasks/console.yml wherever it calls tools that take that argument.
Otherwise each sub task fails like this: ["euca-describe-images: error: connection error (('Connection aborted.', gaierror(-2, 'Name or service not known')))"]
Running the commands from that playbook directly on the euca server being configured gives the same result unless I specify --region localhost
Problem:
I'm stuck here: [cloud-post : update console route53 system domain for eucalyptus-cloud authentication]
Error: "euform-update-stack: error (ValidationError): No updates are to be performed.", "stderr_lines": ["euform-update-stack: error (ValidationError): No updates are to be performed."]
All services are running except the ImagingBackend is Not Ready
No instances are running according to euca-describe-instances
Images are available:
IMAGE ami-5be483c81cf8bd65c eucalyptus-console-image-5-0-823/eucalyptus-console-image-5-0-823.raw.manifest.xml 000216594841 available private x86_64 machine instance-store hvm
TAG image ami-5be483c81cf8bd65c type eucalyptus-console-image
TAG image ami-5be483c81cf8bd65c version 5.0.823
IMAGE ami-f31092ddb73e29af9 eucalyptus-service-image-v5.0.100/eucalyptus-service-image.raw.manifest.xml 000216594841 available privatx86_64 machine instance-store hvm
TAG image ami-f31092ddb73e29af9 provides imaging,loadbalancing
TAG image ami-f31092ddb73e29af9 type eucalyptus-service-image
TAG image ami-f31092ddb73e29af9 version 5.0.100
---
all:
hosts:
exp-euca.lan.com:
exp-enc-[01:02].lan.com:
vars:
vpcmido_public_ip_range: "192.168.100.5-192.168.100.254"
vpcmido_public_ip_cidr: "192.168.100.1/24"
cloud_system_dns_dnsdomain: "cloud.lan.com"
cloud_public_port: 443
eucalyptus_console_cloud_deploy: yes
cloud_service_image_rpm: no
cloud_properties:
services.imaging.worker.ntp_server: "x.x.x.x"
services.loadbalancing.worker.ntp_server: "x.x.x.x"
children:
cloud:
hosts:
exp-euca.lan.com:
console:
hosts:
exp-euca.lan.com:
node:
hosts:
exp-enc-[01:02].lan.com:
EDIT:
Solved. Details are in the comments of the marked answer.
The name error most likely means that DNS for the domain cloud.lan.com is not being correctly delegated to your deployment. To test this, check if the nameserver is found:
dig +short NS cloud.lan.com
you should see "ns1.cloud.lan.com" and then should be able to use that nameserver to resolve services, e.g.
dig +short ec2.cloud.lan.com #ns1.cloud.lan.com
which should be the IP of the host for the compute service.
The second item is a bug in the ansible playbook that occurs when the stack is already present and up to date. To work around it, you can either update your playbook or delete the stack before running the playbook. Depending on how far the playbook progressed you may have a script to do this:
/usr/local/bin/console-manage-stack -a delete
the related playbook change is https://github.com/AppScale/ats-deploy/pull/36

Airflow (LocalExecutor) - Docker :: Job is failing with Log file does not exist

Airflow version: 1.10.9
Executor : LocalExecutor
Docket Setup
when job runs sometime we are getting following error. I have searched in web, many people faced this issue in celeryExecutor but we are using LocalExecutor(Docker setup). How can I resolve this problem?
*** Log file does not exist: /home/ubuntu/airflow/airflow/logs/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log
*** Fetching from: http://:8793/log/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log
*** Failed to fetch log file from worker. Invalid URL 'http://:8793/log/es_update_relevance_score/es_update_relevance_score/2020-05-14T16:26:06.062416+00:00/1.log': No host supplied
Here is one approach I've seen when running the scheduler and webserver in their own containers and using LocalExecutor:
Mount a host log directory as a volume into both the scheduler and webserver containers:
volumes:
- /location/on/host/airflow/logs:/opt/airflow/logs
Make sure the user within the airflow containers (usually airflow) has permissions to read and write that directory. If the permissions are wrong you will see an error like the one in your post.
This probably won't scale beyond LocalExecutor usage though.

Resources