corda CENM networkmap server start failing to connect database after a few week run - corda

we operate CENM(1.2 and use helm template to run on k8s cluster) to construct our own private network and keep on running CENM network map server for a few week, then launching new node start failing.
with further investigation, its appeared that request timeout for http://nmap:10000/network-map causes problem.
in nmap server’s log, we found following output when access to above url with curl.
[NMServer] - Error while handling socket client message com.r3.enm.servicesapi.networkmap.handlers.LatestUnsignedNetworkParametersRetrievalMessage#760c53ea: HikariPool-1 - Connection is not available, request timed out after 30000ms.
netstat shows there is at least 3 establish connection to the database from the container which network map server runs, also I can connect database directly with using CLI.
so I don’t think it is neither database saturated nor network configuration problem.
anyone have an idea why this happens? I think restart probably solve the problem, but want to know the root cause...
regards,

Please test the following options.
Since it is the HikariCP (connection pool) component that is throwing the error it would be worth seeing if increasing the pool size in the network map configuration may help - see below)
Corda uses Hikari Pool for creating the connection pool. To configure the connection pool any custom properties can be set in the dataSourceProperties section.
dataSourceProperties = {
dataSourceClassName = "org.postgresql.ds.PGSimpleDataSource"
...
maximumPoolSize = 10
connectionTimeout = 50000
}
Has a healthcheck been conducted to verify there are sufficient resources on that postgres database i.e basic diagnostic checks ?
Another option to get more information logged from the network map service is to run with TRACE logging also:
From https://docs.corda.net/docs/cenm/1.2/troubleshooting-common-issues.html
Enabling debug/trace logging
Each service can be configured to run with a deeper log level via command line flags passed at startup:
java -DdefaultLogLevel=TRACE -DconsoleLogLevel=TRACE -jar <enm-service-jar>.jar --config-fi

Related

docker redis server get/set command issue

I'm asking you a question because I need your help.
I have built a redis server in a docker container and am using it in cluster mode.
I can access the redis server and ping well.
However, when executing the get/set command, the following error message is output and the connection with the redis is lost.
Could not connect to Redis at 172.19.0.4:6300: Connection failed due to no response from the connected member or connection was lost due to no response from the host.
I would really appreciate it if you could help me with the above problem.
I guessed that the connection and ping were done well, so it was a security-related configuration problem.
So I tried to set the security setting of the redis.conf file to no, but the problem was not solved.
'protected-mode no'

AMQP handshake timeout error while deploying AWS Corda Enterprise Template

I am deploying AWS Corda Enterprise Template. The Quick start deployed the stack as per the defined CloudFormation template. I can see 2 AWS instances, up and running as Corda nodes, in Hot-Cold setup with a load balancer.
However the Log for Corda node has following ERROR related to AMQP communication.
[ERROR] 2018-10-18T05:47:55,743Z [Thread-3
(ActiveMQ-scheduled-threads)] core.server.lambda$channelActive$0 -
AMQ224088: Timeout (10 seconds) while handshaking has occurred. {}
What can be possible reason for this error? This error keeps on occurring after a certain time interval. So it looks like some connectivity issue to me.
Note: The load balancer shows the status of this AWS Corda instances as healty (In Service). So I believe the Corda node has booted up successfully.
The ERROR message isn't necessarily tied to AMQP. Perhaps you were confused by the "AMQ" in the error ID (AMQ224088)?
In any event, this error indicates that something on the network is connecting to the ActiveMQ Artemis broker, but it's not completing any protocol handshake. This is commonly seen with, for example, load balancers that do a health check by creating a socket connection without sending any real data just to see if the port is open on the target machine.

How to communicate with Kafka server running inside a docker

I am using apache KafkaConsumer in my Scala app to talk to a Kafka server wherein the Kafka and Zookeeper services are running in a docker container on my VM (the scala app is also running on this VM). I have setup the KafkaConsumer's property "bootstrap.servers" to use 127.0.0.1:9092.
The KafkaConsumer does log, "Sending coordinator request for group queuemanager_testGroup to broker 127.0.0.1:9092". The problem appears to be that the Kafka client code is setting the coordinator values based on the response it receives which contains responseBody={error_code=0,coordinator={node_id=0,host=e7059f0f6580,port=9092}} , that is how it sets the host for future connections. Subsequently it complains that it is unable to resolve address: e7059f0f6580
The address e7059f0f6580 is the container ID of that docker container.
I have tested using telnet that my VM is not detecting this as a hostname.
What setting do I need to change such that the Kafka on my docker returns localhost/127.0.0.1 as the host in its response ? Or is there something else that I am missing / doing incorrectly ?
Update
advertised.host.name is deprecated, and --override should be avoided.
Add/edit advertised.listeners to be the format of
[PROTOCOL]://[EXTERNAL.HOST.NAME]:[PORT]
Also make sure that PORT is also listed in property for listeners
After investigating this problem for hours on end, found that there is a way to
set the hostname while starting up the Kafka server, as follows:
kafka-server-start.sh --override advertised.host.name=xxx (in my case: localhost)

AWS CodeDeploy vs Windows 2016 in ASG

I use AWS CodeDeploy to deploy builds from GitHub to EC2 instances in AutoScaling Group.
It's working fine for Windows 2012 R2 with all Deployment configurations.
But for Windows 2016 it totally fails on "OneAtTime" deploy;
During "AllAtOnce" deploy only one or two instances deployed successfully, all other fails.
In the logfile on agent this suspicious message is present:
ERROR [codedeploy-agent(1104)]: CodeDeploy Instance Agent Service: CodeDeploy Instance Agent Service: error during start or run: Errno::ETIMEDOUT
- A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2)
All policies, roles, software, builds and other stuff are the same, I even tested this on brand new AWS account.
Does anybody faced such behaviour?
I ran into the same problem, but during my investigation, I found out that server's route table had wrong routes for 169.254.169.254 network (there was specified the gateway from the network where my template was captured), so that it couldn't read instance metadata.
From the above error it looks like the agent isn't able to talk to CodeDeploy endpoint after instance starts up. Please check if the routing tables and other proxy related settings are set up correctly. Also if you do not have it already, you can turn on the debug log by setting :verbose to true in the agent config and restart the agent. This would help debug the issue better.

MSDTC is only working in one direction

I'm trying to use TransactionScope for unit tests and I keep getting errors on our build server. After following several helpful SO answers and blogs, I installed DTCPing and ran it on both server1 & server2. When I run it in the reverse order it seems to connect in one direction and fail in the other:
++++++++++++Validating Remote Computer Name++++++++++++
Please refer to following log file for details:
D:\KPAHQDEV043372.log
Invoking RPC method on teamcity
RPC test is successful
++++++++++++RPC test completed+++++++++++++++
++++++++++++Start DTC Binding Test +++++++++++++
Trying Bind to teamcity
Binding call to teamcity Failed
Session Down
But when I run it in the intended direction it just fails the RPC
++++++++++++Validating Remote Computer Name++++++++++++
Please refer to following log file for details:
C:\TEAMCITY3024.log
Invoking RPC method on kpahqdev04
Problem:fail to invoke remote RPC method
Error(0x6BA) at dtcping.cpp #303
-->RPC pinging exception
-->1722(The RPC server is unavailable.)
RPC test failed
I found some helpful info here but now I'm just stuck. Any ideas?
You need to add Distributed Transaction Coordinator service as an exception in firewall.
Also make sure RPC (port 135) is enabled and added as en exception in firewall.
Also you can check if firewall is the culprit by turning off firewall and re-running the DTCPing.

Resources