What can the following be due to / how to debug it? it happens when closing my MPI application
[1612979755.727913] [compute-0-9:21112:0] tag_match.c:61 UCX WARN unexpected tag-receive descriptor 0x2b2bf64cdbc0 was not matched
Assuming the application exited normally, this probably means that some process sent a message (e.g. calling MPI_Send) to a destination process that did not post a matching receive before calling MPI_Finalize. See https://github.com/openucx/ucx/issues/6331#issuecomment-778428537
Related
I'm facing some issues on my Cloud Composer instance resulting in failed tasks.
Details of instance configuration :
Composer image : composer-2.0.29-airflow-2.3.3 / Airflow version : 2.3.3
Airflow.cfg :
parallelism = 32 / dag_concurrency = 100 / worker_concurrency = 24
In terms of resources :
I have 60 DAGs which can contains up to 55 tasks that needs to run in parallel.
They don't do any compute, only some light PythonOperator/GCSOperator/BigQueryOperator.
I often encounter this type of errors :
*** Log file is not found: gs://xxx/xxx/attempt=2.log.
*** The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted).
*** Please, refer to https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#common_issues hints to learn what might be possible reasons for a missing log.
All of my tasks have 3 retries but when it happens for a reason it stops at 2 retries and send a failure error. I don't understand why. Example of error in mail sent :
Try 2 out of 3
Exception:
Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
I also receives random zombie tasks Detected as zombie
My metrics are the following :
When I clear the task, it succeeds as it should.
(I don't have access to GKE but if it helps I can ask to have access)
Any advice to prevent this errors and understand what happens ?
I have a mobile and web app that use firebase realtime database and there are some long running tasks which are served on servers with the help of firebase-queue and firebase-admin. The long running tasks is to find out who else is using the mobile app in a person's contact book. So, if you install the app, the app will send a task to the server with your contact book data and ask it to find the people in the contact book who are also using the app. Every now and then I see two types of errors in the logs. First error is below which also cause the node process to stop.
/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15373
return queue[i].onComplete(new Error(abortReason), false, null);
^
Error: maxretry
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15373:52
at exceptionGuard (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4018:9)
at repoRerunTransactionQueue (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15386:9)
at repoRerunTransactions (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15279:5)
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:15260:13
at /Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:7061:17
at PersistentConnection.onDataMessage_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:7088:17)
at Connection.onDataMessage_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5882:14)
at Connection.onPrimaryMessageReceived_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5876:18)
at WebSocketConnection.onMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:5778:27)
at WebSocketConnection.appendFrame_ (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4491:18)
at WebSocketConnection.handleIncomingFrame (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4539:22)
at Client.mySock.onmessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:4438:19)
at Client.dispatchEvent (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2883:30)
at Client._receiveMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:3042:10)
at Client$2.<anonymous> (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2924:49)
at Client$2.emit (node:events:539:35)
at Client$2.emit (node:domain:475:12)
at Client$2.<anonymous> (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2186:14)
at pipe (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1503:40)
at Pipeline$1._loop (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1510:3)
at Pipeline$1.processIncomingMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1479:8)
at Extensions$1.processIncomingMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1645:20)
at Client$2._emitMessage (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2177:22)
at Client$2._emitFrame (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2137:19)
at Client$2.parse (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:1863:18)
at Client$2.parse (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:2369:60)
at IO.write (/Users/varungupta/Projects/myapp-server/node_modules/#firebase/database-compat/dist/index.standalone.js:186:16)
at TLSSocket.ondata (node:internal/streams/readable:754:22)
at TLSSocket.emit (node:events:527:28)
at TLSSocket.emit (node:domain:475:12)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at TLSSocket.Readable.push (node:internal/streams/readable:228:10)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23)
There isn't much information about what exactly caused the maxretry error. The error happens at random after a few days of running the script. It doesn't happen right away.
The second error that I see that isn't as disruptive as the one above is
[2022-06-01T10:30:49.722Z] #firebase/database: FIREBASE WARNING: transaction at /queue/tasks/-N3TDQCAdt4y-akb0_MK failed: disconnect
This doesn't stop the node process and I can see that a transaction failed but not sure why did it disconnect and how can I resolve this problem.
I am using firebase-admin 9.6.0.
I have S3 remote logging enabled, Airflow is installed on an EC2. My dags are running however, they don't always create a log and then fails. The error is as follows:
*** Falling back to local log
*** Log file does not exist: /home/ec2-user/airflow/logs/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log
*** Fetching from: http://ip-10-105-32-92.eu-west-1.compute.internal:8793/log/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log
*** Failed to fetch log file from worker. Client error '404 NOT FOUND' for url 'http://ip-10-105-32-92.eu-west-1.compute.internal:8793/log/REMOVED/REMOVED/2022-03-07T07:00:00+00:00/2.log'
For more information check: https://httpstatuses.com/404
After a few attempts (3-5), it eventually does end up working.
I have even disabled the remote logging in an attempt to debug, and it still doesn't work. Any suggestions?
Apache Airflow version: 2.2.4
We use a describe stacks API call to get the latest ECS Task Definition, and I've noticed we have lots of these errors:
An error occurred (Throttling) when calling the DescribeStacks operation (reached max retries: 4): Rate exceeded
Setup
Corda 4.6
Working from Java template
I have been experimenting adding up to 10 Attachments of small (1K) zip files to a transaction.
Error when testing with StartedMockNodes:
io.github.classgraph.ClassGraphException: Uncaught exception during scan
at io.github.classgraph.ClassGraphException.newClassGraphException(ClassGraphException.java:89) ~[classgraph-4.8.90.jar:4.8.90]
at io.github.classgraph.ClassGraph.scan(ClassGraph.java:1555) ~[classgraph-4.8.90.jar:4.8.90]
...
Caused by: java.lang.OutOfMemoryError: Java heap space
at nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.readAllBytesWithSpilloverToDisk(NestedJarHandler.java:815) ~[classgraph-4.8.90.jar:4.8.90]
at nonapi.io.github.classgraph.fastzipfilereader.PhysicalZipFile.<init>(PhysicalZipFile.java:161) ~[classgraph-4.8.90.jar:4.8.90]
at nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.downloadJarFromURL(NestedJarHandler.java:576) ~[classgraph-4.8.90.jar:4.8.90]
...
Error when testing local nodes built with CordForm and connecting with RPC:
Node will stop suddenly. No errors in the log. In the directory of the failed node there will be two files:
hs_err_pid20400.log
java_pid20400.hprof
log file has similar errors as StartedMockNode failures:
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.readAllBytesWithSpilloverToDisk(Ljava/io/InputStream;Ljava/lang/String;JLnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fileslice/Slice;+65
j nonapi.io.github.classgraph.fastzipfilereader.PhysicalZipFile.<init>(Ljava/io/InputStream;JLjava/lang/String;Lnonapi/io/github/classgraph/fastzipfilereader/NestedJarHandler;Lnonapi/io/github/classgraph/utils/LogNode;)V+25
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.downloadJarFromURL(Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fastzipfilereader/PhysicalZipFile;+428
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler.access$000(Lnonapi/io/github/classgraph/fastzipfilereader/NestedJarHandler;Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Lnonapi/io/github/classgraph/fastzipfilereader/PhysicalZipFile;+3
j nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler$4.newInstance(Ljava/lang/String;Lnonapi/io/github/classgraph/utils/LogNode;)Ljava/util/Map$Entry;+124
Clarification #1: The error occurs during transaction execution. Not when originally uploading the files to the node using CordaRPCOps.uploadAttachmentWithMetadata (that works fine).
Clarification #2: The first node to fail is the one constructing the transaction. If you try restarting this node it will fail on restart. It will take several re-starts to get up and running again. Then any node that was receiving the transaction will fail. They will also require several restarts to get up an running again. As a testament to Corda's Flow framework - after enough restarts the transaction will eventually be successful and the Attachment's will be transmitted.
Clarification #3: I can pre-upload the Attachments to all the nodes before executing the transaction and the failures still occur.
StartedMockNodes:
Found this
Added the following to my workFlows build.gradle file to stop the errors:
test {
maxHeapSize = "4096m"
}
Local Nodes & RPC:
??? - haven't found a solution yet
I want to run an app on the iPad2 but at linking I got this error:
collect2: ld terminated with signal 6 [Abort trap]
ld(69392) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
I dont know whats the reason for this error. It look like allocating 16777216 bytes (16MB) and the iPad2 should handle that!
Are you sure you got the error at linking, and that it didn't successfully link, install and begin to run, THEN get the error?
The error you have is because malloc can't allocate another 16M block, and THAT is almost certainly because you have either crazy memory fragmentation (possible, but not as common) or a memory leak (very common!)
It would be odd to see this coming form the linker/XCode tools (unless you're running betaware, in which case who knows?!) It's more likely in your app.