Why doesn't "kubectl describe snapshot" show reason for failure - kubernetes-pvc

Using CSI driver, created volumes and took a snapshot. But Snapshot creation was failure due to some authentication issues. Retries were happening but I was expecting reason for failure in kubectl describe snapshot O/P. Even after 17 minutes the describe O/P was ust printing Waiting for a snapshot default/snapshot-1 to be created by the CSI driver.

Related

GCP's PubSub yields "no available instance" error with scheduled tasks

I would like to setup a scheduled task to run every hour and say send emails. In a Firebase project, I've setup a function with .schedule('every 60 minutes') and can see it successfully in the GCP portal as well. Recently, this error started to show up in the logs: "The request was aborted because there was no available instance. Additional troubleshooting documentation can be found at: https://cloud.google.com/functions/docs/troubleshooting#scalability" tied to this scheduled function.
There are plenty of conversations online that I saw tied to "no available instance" and mainly due to traffic spikes etc.. But for a scheduled task, what can I do to avoid this? My main goal is to not have to worry if I have 3 tasks to run the next hour or 300. I want to auto-scale and do the magic that it does. Am I missing something here?
Thanks

Airflow ECS-Operator not fetching CloudWatch Logs

I'm using Airflow's EcsOperator, ECS tasks writing to Cloudwatch.
Sometimes Airflow log fetcher collects logs from CloudWatch and sometimes does not.
On the CloudWatch console, I always see the logs.
On tasks that take a long time, I usually see the log or at least part of it.
Someone had the same issue with ECSOperator?
First ECSOperator is deprecated and removed in provider version 5.0.0
You should switch to EcsRunTaskOperator.
In EcsRunTaskOperator there is awslogs_fetch_interval which control over the interval to fetch logs from Ecs. The default is 30 seconds. If you wish for more frequent polls then set the parameter value accordingly.
You didn't mention what provider version you are on but this part of the code was refactored in version 5.0.0 (PR) so upgrading the Amazon provider might also resolve your issue.

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

can not start teradata database using "tpa start"

when i try to logon bteq, i get the following error,
*** Warning: RDBMS CRASHED OR SESSIONS RESET. RECOVERY IN PROGRESS.
and then i try to restart the database,
pdestate -a
status:
PDE state: DOWN/HARDSTOP
start:
/etc/init.d/tpa start
result:
Teradata Database Initiator service is starting...
Teradata Database Initiator service started successfully.
then check the status again:
PDE state: DOWN/HARDSTOP
i tried many times, but still can not start database, don't know the reason,
From the "Messages" manual:
Remove file
13821 Crash loop detected during TPA initialization. Explanation: In order to prevent PDE initialization from looping forever when an error prevents it from completing, it is only automatically restarted three times. After three unsuccessful initialization attempts, this message is issued and restart attempts are discontinued, leaving PDE down.
rm /var/opt/teradata/tdtemp/PanicLoopDetected
To start Teradata DBMS use
/etc/init.d/tpa start
To confirm that the database has come up, type
pdestate -a
After giving the command, wait a minute or so and give the same command again. Keep trying. After a few trails (minutes), the database will be started. Looks weird, but works :- )

Task with no status leads to DAG failure

I have a DAG that fetches data from Elasticsearch and ingests into the data lake. The first task, BeginIngestion, opens in several tasks (one for each resource), and these tasks open in more tasks (one for each shard). After the shards are fetched, the data is uploaded to S3 and then closed into a task EndIngestion, followed by a task AuditIngestion.
It was executing correctly, but now all tasks are executed successfully, but the "closing task" EndIngestion remains with no status. When I refresh the webserver's page, the DAG is marked as Failed.
This image shows successful upstream tasks, with the task end_ingestion with no status and the DAG marked as Failed.
I also dug into the task instance details and found
Dagrun Running: Task instance's dagrun was not in the 'running' state but in the state 'failed'.
Trigger Rule: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 49, 'successes': 49}, upstream_task_ids=['s3_finish_upload_ingestion_raichucrud_complain', 's3_finish_upload_ingestion_raichucrud_interaction', 's3_finish_upload_ingestion_raichucrud_company', 's3_finish_upload_ingestion_raichucrud_user', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_location', 's3_finish_upload_ingestion_raichucrud_companytoken', 's3_finish_upload_ingestion_raichucrud_indexevolution', 's3_finish_upload_ingestion_raichucrud_companyindex', 's3_finish_upload_ingestion_raichucrud_producttype', 's3_finish_upload_ingestion_raichucrud_categorycomplainsto', 's3_finish_upload_ingestion_raichucrud_companyresponsible', 's3_finish_upload_ingestion_raichucrud_category', 's3_finish_upload_ingestion_raichucrud_additionalfieldoption', 's3_finish_upload_ingestion_raichucrud_privatecontactconfiguration', 's3_finish_upload_ingestion_raichucrud_phone', 's3_finish_upload_ingestion_raichucrud_presence', 's3_finish_upload_ingestion_raichucrud_responsible', 's3_finish_upload_ingestion_raichucrud_store', 's3_finish_upload_ingestion_raichucrud_socialprofile', 's3_finish_upload_ingestion_raichucrud_product', 's3_finish_upload_ingestion_raichucrud_macrorankingpresenceto', 's3_finish_upload_ingestion_raichucrud_macroinfoto', 's3_finish_upload_ingestion_raichucrud_raphoneproblem', 's3_finish_upload_ingestion_raichucrud_macrocomplainsto', 's3_finish_upload_ingestion_raichucrud_testimony', 's3_finish_upload_ingestion_raichucrud_additionalfield', 's3_finish_upload_ingestion_raichucrud_companypageblockitem', 's3_finish_upload_ingestion_raichucrud_rachatconfiguration', 's3_finish_upload_ingestion_raichucrud_macrorankingitemto', 's3_finish_upload_ingestion_raichucrud_purchaseproduct', 's3_finish_upload_ingestion_raichucrud_rachatproblem', 's3_finish_upload_ingestion_raichucrud_role', 's3_finish_upload_ingestion_raichucrud_requestmoderation', 's3_finish_upload_ingestion_raichucrud_categoryproblemto', 's3_finish_upload_ingestion_raichucrud_companypageblock', 's3_finish_upload_ingestion_raichucrud_problemtype', 's3_finish_upload_ingestion_raichucrud_key', 's3_finish_upload_ingestion_raichucrud_macro', 's3_finish_upload_ingestion_raichucrud_url', 's3_finish_upload_ingestion_raichucrud_document', 's3_finish_upload_ingestion_raichucrud_transactionkey', 's3_finish_upload_ingestion_raichucrud_catprobitemcompany', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_categoryinfoto', 's3_finish_upload_ingestion_raichucrud_marketplace', 's3_finish_upload_ingestion_raichucrud_macroproblemto', 's3_finish_upload_ingestion_raichucrud_categoryrankingto', 's3_finish_upload_ingestion_raichucrud_macrorankingto', 's3_finish_upload_ingestion_raichucrud_categorypageto']
As you see, the "Trigger Rule" field says that one of the tasks is in a "non-successful state", but at the same time the stats shows that all upstreams are marked as successful.
If I reset the database, it doesn't happen, but I can't reset it for every execution (hourly). I also don't want to reset it.
Does anyone have any light?
PS: I am running in an EC2 instance (c4.xlarge) with LocalExecutor.
[EDIT]
I found in the scheduler log that the DAG is in deadlock:
[2017-08-25 19:25:25,821] {models.py:4076} DagFileProcessor157 INFO - Deadlock; marking run failed
I guess this may be due to some exception treatment.
I have had this exact issue before, for me my code was generating duplicate task ids. And it looks like in your case there is also a duplicate id:
s3_finish_upload_ingestion_raichucrud_privatecontactinteraction
This is probably a year late for you, but hopefully this will save others, lots of debugging time :)

Resources