I'm facing some issues on my Cloud Composer instance resulting in failed tasks.
Details of instance configuration :
Composer image : composer-2.0.29-airflow-2.3.3 / Airflow version : 2.3.3
Airflow.cfg :
parallelism = 32 / dag_concurrency = 100 / worker_concurrency = 24
In terms of resources :
I have 60 DAGs which can contains up to 55 tasks that needs to run in parallel.
They don't do any compute, only some light PythonOperator/GCSOperator/BigQueryOperator.
I often encounter this type of errors :
*** Log file is not found: gs://xxx/xxx/attempt=2.log.
*** The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted).
*** Please, refer to https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#common_issues hints to learn what might be possible reasons for a missing log.
All of my tasks have 3 retries but when it happens for a reason it stops at 2 retries and send a failure error. I don't understand why. Example of error in mail sent :
Try 2 out of 3
Exception:
Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
I also receives random zombie tasks Detected as zombie
My metrics are the following :
When I clear the task, it succeeds as it should.
(I don't have access to GKE but if it helps I can ask to have access)
Any advice to prevent this errors and understand what happens ?
Related
My dags consists of too many tasks around more than 600 and while running the dag at times it behaves differently i.e sometimes a task gets failed and status comes as 'Not yet started'.
Please help how to resolve this error,
Information :
Composer 2.0.10
Airflow 2.2.3
Executor : Celery
Workers : 2 to 6 Auto scale
Schedulers : 2
I have a mobile and web app that use firebase realtime database and there are some long running tasks which are served on servers with the help of firebase-queue and firebase-admin. The long running tasks is to find out who else is using the mobile app in a person's contact book. So, if you install the app, the app will send a task to the server with your contact book data and ask it to find the people in the contact book who are also using the app. Every now and then I see two types of errors in the logs. First error is below which also cause the node process to stop.
/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15373
return queue[i].onComplete(new Error(abortReason), false, null);
^
Error: maxretry
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15373:52
at exceptionGuard (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4018:9)
at repoRerunTransactionQueue (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15386:9)
at repoRerunTransactions (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15279:5)
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15260:13
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:7061:17
at PersistentConnection.onDataMessage_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:7088:17)
at Connection.onDataMessage_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5882:14)
at Connection.onPrimaryMessageReceived_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5876:18)
at WebSocketConnection.onMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5778:27)
at WebSocketConnection.appendFrame_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4491:18)
at WebSocketConnection.handleIncomingFrame (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4539:22)
at Client.mySock.onmessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4438:19)
at Client.dispatchEvent (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2883:30)
at Client._receiveMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:3042:10)
at Client$2.<anonymous> (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2924:49)
at Client$2.emit (node:events:539:35)
at Client$2.emit (node:domain:475:12)
at Client$2.<anonymous> (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2186:14)
at pipe (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1503:40)
at Pipeline$1._loop (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1510:3)
at Pipeline$1.processIncomingMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1479:8)
at Extensions$1.processIncomingMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1645:20)
at Client$2._emitMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2177:22)
at Client$2._emitFrame (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2137:19)
at Client$2.parse (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1863:18)
at Client$2.parse (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2369:60)
at IO.write (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:186:16)
at TLSSocket.ondata (node:internal/streams/readable:754:22)
at TLSSocket.emit (node:events:527:28)
at TLSSocket.emit (node:domain:475:12)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at TLSSocket.Readable.push (node:internal/streams/readable:228:10)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23)
There isn't much information about what exactly caused the maxretry error. The error happens at random after a few days of running the script. It doesn't happen right away.
The second error that I see that isn't as disruptive as the one above is
[2022-06-01T10:30:49.722Z] #firebase/database: FIREBASE WARNING: transaction at /queue/tasks/-N3TDQCAdt4y-akb0_MK failed: disconnect
This doesn't stop the node process and I can see that a transaction failed but not sure why did it disconnect and how can I resolve this problem.
I am using firebase-admin 9.6.0.
I have S3 remote logging enabled, Airflow is installed on an EC2. My dags are running however, they don't always create a log and then fails. The error is as follows:
*** Falling back to local log
*** Log file does not exist: /home/ec2-user/airflow/logs/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log
*** Fetching from: http://ip-10-105-32-92.eu-west-1.compute.internal:8793/log/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log
*** Failed to fetch log file from worker. Client error '404 NOT FOUND' for url 'http://ip-10-105-32-92.eu-west-1.compute.internal:8793/log/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log'
For more information check: https://httpstatuses.com/404
After a few attempts (3-5), it eventually does end up working.
I have even disabled the remote logging in an attempt to debug, and it still doesn't work. Any suggestions?
Apache Airflow version: 2.2.4
We use a describe stacks API call to get the latest ECS Task Definition, and I've noticed we have lots of these errors:
An error occurred (Throttling) when calling the DescribeStacks operation (reached max retries: 4): Rate exceeded
Setup
Corda 4.6
Working from Java template
I have been experimenting adding up to 10 Attachments of small (1K) zip files to a transaction.
Error when testing with StartedMockNodes:
io.github.classgraph.ClassGraphException: Uncaught exception during scan
at io.github.classgraph.ClassGraphException.newClassGraphException(ClassGraphException.java:89) ~[classgraph-4.8.90.jar:4.8.90]
at io.github.classgraph.ClassGraph.scan(ClassGraph.java:1555) ~[classgraph-4.8.90.jar:4.8.90]
...
Caused by: java.lang.OutOfMemoryError: Java heap space
at nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.readAllBytesWithSpilloverToDisk(NestedJarHandler.java:815) ~[classgraph-4.8.90.jar:4.8.90]
at nonapi.io.github.classgraph.fastzipfilereader.PhysicalZipFile.<init>(PhysicalZipFile.java:161) ~[classgraph-4.8.90.jar:4.8.90]
at nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.downloadJarFromURL(NestedJarHandler.java:576) ~[classgraph-4.8.90.jar:4.8.90]
...
Error when testing local nodes built with CordForm and connecting with RPC:
Node will stop suddenly. No errors in the log. In the directory of the failed node there will be two files:
hs_err_pid20400.log
java_pid20400.hprof
log file has similar errors as StartedMockNode failures:
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.readAllBytesWithSpilloverToDisk(Ljava/io/InputStream;Ljava/lang/String;JLnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fileslice/Slice;+65
j nonapi.io.github.classgraph.fastzipfilereader.PhysicalZipFile.<init>(Ljava/io/InputStream;JLjava/lang/String;Lnonapi/io/github/classgraph/fastzipfilereader/NestedJarHandler;Lnonapi/io/github/classgraph/utils/LogNode;)V+25
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.downloadJarFromURL(Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fastzipfilereader/PhysicalZipFile;+428
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.access$000(Lnonapi/io/github/classgraph/fastzipfilereader/NestedJarHandler;Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fastzipfilereader/PhysicalZipFile;+3
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler$4.newInstance(Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Ljava/util/Map$Entry;+124
Clarification #1: The error occurs during transaction execution. Not when originally uploading the files to the node using CordaRPCOps.uploadAttachmentWithMetadata (that works fine).
Clarification #2: The first node to fail is the one constructing the transaction. If you try restarting this node it will fail on restart. It will take several re-starts to get up and running again. Then any node that was receiving the transaction will fail. They will also require several restarts to get up an running again. As a testament to Corda's Flow framework - after enough restarts the transaction will eventually be successful and the Attachment's will be transmitted.
Clarification #3: I can pre-upload the Attachments to all the nodes before executing the transaction and the failures still occur.
StartedMockNodes:
Found this
Added the following to my workFlows build.gradle file to stop the errors:
test {
maxHeapSize = "4096m"
}
Local Nodes & RPC:
??? - haven't found a solution yet
Am trying to generate the basic nodes- PartyA, PartyB and Notary on Ubuntu 14 by running ./gradlew deployNodes or even ./gradlew clean deployNodes. The error reads:
... still waiting. If this is taking longer than usual, check the node logs.
Error while generating node info file /cordapp-template-java/build/nodes/Notary/logs
Error while generating node info file /cordapp-template-java/build/nodes/PartyB/logs
Error while generating node info file /cordapp-template-java/build/nodes/PartyA/logs
Task :deployNodes FAILED
FAILURE: Build failed with an exception.
What went wrong:
Execution failed for task ':deployNodes'.
Error while generating node info file. Please check the logs in /cordapp-template-java/build/nodes/Notary/logs.
Error while generating node info file. Please check the logs in /cordapp-template-java/build/nodes/Notary/logs.
The error logs do not provide any indication of error.
I have personally run into the above question myself. From what I saw, it seems it was a random incident on the Unix based machine.
The issue was resolved after I moved the project to the different location. It is absurd. But I have never ran into this issue ever again.