Google Cloud Composer (Airflow) - Scalability issues - airflow

I'm facing some issues on my Cloud Composer instance resulting in failed tasks.
Details of instance configuration :
Composer image : composer-2.0.29-airflow-2.3.3 / Airflow version : 2.3.3
Airflow.cfg :
parallelism = 32 / dag_concurrency = 100 / worker_concurrency = 24
In terms of resources :
I have 60 DAGs which can contains up to 55 tasks that needs to run in parallel.
They don't do any compute, only some light PythonOperator/GCSOperator/BigQueryOperator.
I often encounter this type of errors :
*** Log file is not found: gs://xxx/xxx/attempt=2.log.
*** The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted).
*** Please, refer to https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#common_issues hints to learn what might be possible reasons for a missing log.
All of my tasks have 3 retries but when it happens for a reason it stops at 2 retries and send a failure error. I don't understand why. Example of error in mail sent :
Try 2 out of 3
Exception:
Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
I also receives random zombie tasks Detected as zombie
My metrics are the following :
When I clear the task, it succeeds as it should.
(I don't have access to GKE but if it helps I can ask to have access)
Any advice to prevent this errors and understand what happens ?

Related

Apache Airflow on Composer 2.0 Task Failed with "Not yet started" issue

My dags consists of too many tasks around more than 600 and while running the dag at times it behaves differently i.e sometimes a task gets failed and status comes as 'Not yet started'.
Please help how to resolve this error,
Information :
Composer 2.0.10
Airflow 2.2.3
Executor : Celery
Workers : 2 to 6 Auto scale
Schedulers : 2

firebase-admin fails with error maxretry without much information about the underlying cause and stops the node script also

I have a mobile and web app that use firebase realtime database and there are some long running tasks which are served on servers with the help of firebase-queue and firebase-admin. The long running tasks is to find out who else is using the mobile app in a person's contact book. So, if you install the app, the app will send a task to the server with your contact book data and ask it to find the people in the contact book who are also using the app. Every now and then I see two types of errors in the logs. First error is below which also cause the node process to stop.
/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15373
return queue[i].onComplete(new Error(abortReason), false, null);
^
Error: maxretry
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15373:52
at exceptionGuard (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4018:9)
at repoRerunTransactionQueue (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15386:9)
at repoRerunTransactions (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15279:5)
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15260:13
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:7061:17
at PersistentConnection.onDataMessage_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:7088:17)
at Connection.onDataMessage_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5882:14)
at Connection.onPrimaryMessageReceived_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5876:18)
at WebSocketConnection.onMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5778:27)
at WebSocketConnection.appendFrame_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4491:18)
at WebSocketConnection.handleIncomingFrame (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4539:22)
at Client.mySock.onmessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4438:19)
at Client.dispatchEvent (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2883:30)
at Client._receiveMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:3042:10)
at Client$2.<anonymous> (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2924:49)
at Client$2.emit (node:events:539:35)
at Client$2.emit (node:domain:475:12)
at Client$2.<anonymous> (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2186:14)
at pipe (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1503:40)
at Pipeline$1._loop (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1510:3)
at Pipeline$1.processIncomingMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1479:8)
at Extensions$1.processIncomingMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1645:20)
at Client$2._emitMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2177:22)
at Client$2._emitFrame (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2137:19)
at Client$2.parse (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1863:18)
at Client$2.parse (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2369:60)
at IO.write (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:186:16)
at TLSSocket.ondata (node:internal/streams/readable:754:22)
at TLSSocket.emit (node:events:527:28)
at TLSSocket.emit (node:domain:475:12)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at TLSSocket.Readable.push (node:internal/streams/readable:228:10)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23)
There isn't much information about what exactly caused the maxretry error. The error happens at random after a few days of running the script. It doesn't happen right away.
The second error that I see that isn't as disruptive as the one above is
[2022-06-01T10:30:49.722Z] #firebase/database: FIREBASE WARNING: transaction at /queue/tasks/-N3TDQCAdt4y-akb0_MK failed: disconnect
This doesn't stop the node process and I can see that a transaction failed but not sure why did it disconnect and how can I resolve this problem.
I am using firebase-admin 9.6.0.

How to fix Airflow logging?

I have S3 remote logging enabled, Airflow is installed on an EC2. My dags are running however, they don't always create a log and then fails. The error is as follows:
*** Falling back to local log
*** Log file does not exist: /home/ec2-user/airflow/logs/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log
*** Fetching from: http://ip-10-105-32-92.eu-west-1.compute.internal:8793/log/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log
*** Failed to fetch log file from worker. Client error '404 NOT FOUND' for url 'http://ip-10-105-32-92.eu-west-1.compute.internal:8793/log/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log'
For more information check: https://httpstatuses.com/404
After a few attempts (3-5), it eventually does end up working.
I have even disabled the remote logging in an attempt to debug, and it still doesn't work. Any suggestions?
Apache Airflow version: 2.2.4
We use a describe stacks API call to get the latest ECS Task Definition, and I've noticed we have lots of these errors:
An error occurred (Throttling) when calling the DescribeStacks operation (reached max retries: 4): Rate exceeded

Including Multiple Attachments In Transaction Kills Node

Setup
Corda 4.6
Working from Java template
I have been experimenting adding up to 10 Attachments of small (1K) zip files to a transaction.
Error when testing with StartedMockNodes:
io.github.classgraph.ClassGraphException: Uncaught exception during scan
at io.github.classgraph.ClassGraphException.newClassGraphException(ClassGraphException.java:89) ~[classgraph-4.8.90.jar:4.8.90]
at io.github.classgraph.ClassGraph.scan(ClassGraph.java:1555) ~[classgraph-4.8.90.jar:4.8.90]
...
Caused by: java.lang.OutOfMemoryError: Java heap space
at nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.readAllBytesWithSpilloverToDisk(NestedJarHandler.java:815) ~[classgraph-4.8.90.jar:4.8.90]
at nonapi.io.github.classgraph.fastzipfilereader.PhysicalZipFile.<init>(PhysicalZipFile.java:161) ~[classgraph-4.8.90.jar:4.8.90]
at nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.downloadJarFromURL(NestedJarHandler.java:576) ~[classgraph-4.8.90.jar:4.8.90]
...
Error when testing local nodes built with CordForm and connecting with RPC:
Node will stop suddenly. No errors in the log. In the directory of the failed node there will be two files:
hs_err_pid20400.log
java_pid20400.hprof
log file has similar errors as StartedMockNode failures:
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.readAllBytesWithSpilloverToDisk(Ljava/io/InputStream;Ljava/lang/String;JLnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fileslice/Slice;+65
j nonapi.io.github.classgraph.fastzipfilereader.PhysicalZipFile.<init>(Ljava/io/InputStream;JLjava/lang/String;Lnonapi/io/github/classgraph/fastzipfilereader/NestedJarHandler;Lnonapi/io/github/classgraph/utils/LogNode;)V+25
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.downloadJarFromURL(Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fastzipfilereader/PhysicalZipFile;+428
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.access$000(Lnonapi/io/github/classgraph/fastzipfilereader/NestedJarHandler;Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fastzipfilereader/PhysicalZipFile;+3
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler$4.newInstance(Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Ljava/util/Map$Entry;+124
Clarification #1: The error occurs during transaction execution. Not when originally uploading the files to the node using CordaRPCOps.uploadAttachmentWithMetadata (that works fine).
Clarification #2: The first node to fail is the one constructing the transaction. If you try restarting this node it will fail on restart. It will take several re-starts to get up and running again. Then any node that was receiving the transaction will fail. They will also require several restarts to get up an running again. As a testament to Corda's Flow framework - after enough restarts the transaction will eventually be successful and the Attachment's will be transmitted.
Clarification #3: I can pre-upload the Attachments to all the nodes before executing the transaction and the failures still occur.
StartedMockNodes:
Found this
Added the following to my workFlows build.gradle file to stop the errors:
test {
maxHeapSize = "4096m"
}
Local Nodes & RPC:
??? - haven't found a solution yet

Corda throws error trying to generate the basic nodes

Am trying to generate the basic nodes- PartyA, PartyB and Notary on Ubuntu 14 by running ./gradlew deployNodes or even ./gradlew clean deployNodes. The error reads:
... still waiting. If this is taking longer than usual, check the node logs.
Error while generating node info file /cordapp-template-java/build/nodes/Notary/logs
Error while generating node info file /cordapp-template-java/build/nodes/PartyB/logs
Error while generating node info file /cordapp-template-java/build/nodes/PartyA/logs
Task :deployNodes FAILED
FAILURE: Build failed with an exception.
What went wrong:
Execution failed for task ':deployNodes'.
Error while generating node info file. Please check the logs in /cordapp-template-java/build/nodes/Notary/logs.
Error while generating node info file. Please check the logs in /cordapp-template-java/build/nodes/Notary/logs.
The error logs do not provide any indication of error.
I have personally run into the above question myself. From what I saw, it seems it was a random incident on the Unix based machine.
The issue was resolved after I moved the project to the different location. It is absurd. But I have never ran into this issue ever again.

Resources