we started a redis chart (bitnami's redis helm, cluster mode disabled) and connected our java app (using redisson) to the service.
After 5 minutes the connection to the redis is closed ("Reading from client: Connection reset by peer in redis debug log"). (The networking still seems fine, and we can establish new connections, but the old ones are closed and redis conf timeout is 0)
When configuring the Java to access directly to the redis pod (without the k8s service in the middle), the problem doesn't happen.
We didn't find any similar problem on the web, which is weird (pretty out of the box settings, nothing special).
We have no idea what can cause such problems, any ideas?
Kuberenets version 1.11.7 installed on AWS via kops, we tried Kubenet cluster and new calico-flannel (canal) cluster (without policies) and switching redis versions (4 and 5), and accessing the service by IP/name but it didn't help.
redis timeout is 0 (disabled)
Bitnami's helm chart https://github.com/helm/charts/tree/master/stable/redis
using usePassword: false and loglevel debug.
Related
Currently working on a setup where I need to connect on-prem application to an application running AWS EKS via Network Load Balancer . The flow is POD ----> Ingress Controller ---> NLB ---> On-Prem application .
Have a stream timeout set to 10 min (600 seconds) in nginx.conf file. As expected after 10 minutes of inactivity the POD gets disconnected , however on-prem application still shows the connections with Network Load Balancer as active as seen from netstat command.
Currently analysing the issue using tcpdump . However just wanted to know is there any configuration on Ingress/ NLB end that can help in disconnecting on-prem application as soon as POD gets disconnected.
we operate CENM(1.2 and use helm template to run on k8s cluster) to construct our own private network and keep on running CENM network map server for a few week, then launching new node start failing.
with further investigation, its appeared that request timeout for http://nmap:10000/network-map causes problem.
in nmap server’s log, we found following output when access to above url with curl.
[NMServer] - Error while handling socket client message com.r3.enm.servicesapi.networkmap.handlers.LatestUnsignedNetworkParametersRetrievalMessage#760c53ea: HikariPool-1 - Connection is not available, request timed out after 30000ms.
netstat shows there is at least 3 establish connection to the database from the container which network map server runs, also I can connect database directly with using CLI.
so I don’t think it is neither database saturated nor network configuration problem.
anyone have an idea why this happens? I think restart probably solve the problem, but want to know the root cause...
regards,
Please test the following options.
Since it is the HikariCP (connection pool) component that is throwing the error it would be worth seeing if increasing the pool size in the network map configuration may help - see below)
Corda uses Hikari Pool for creating the connection pool. To configure the connection pool any custom properties can be set in the dataSourceProperties section.
dataSourceProperties = {
dataSourceClassName = "org.postgresql.ds.PGSimpleDataSource"
...
maximumPoolSize = 10
connectionTimeout = 50000
}
Has a healthcheck been conducted to verify there are sufficient resources on that postgres database i.e basic diagnostic checks ?
Another option to get more information logged from the network map service is to run with TRACE logging also:
From https://docs.corda.net/docs/cenm/1.2/troubleshooting-common-issues.html
Enabling debug/trace logging
Each service can be configured to run with a deeper log level via command line flags passed at startup:
java -DdefaultLogLevel=TRACE -DconsoleLogLevel=TRACE -jar <enm-service-jar>.jar --config-fi
I have an AKS cluster using Nginx ingress controllers, and I infrequently but reliably receive errors like this:
The connection was closed unexpectedly
An existing connection was forcibly closed by the remote host
What can I do to fix this?
I found the solution to these errors in the documentation at https://blogs.msdn.microsoft.com/jpsanders/2009/01/07/you-receive-one-or-more-error-messages-when-you-try-to-make-an-http-request-in-an-application-that-is-built-on-the-net-framework-2-0, which says:
Also check and ensure the Keep-Alive timeouts on the server, load
balancer and client (.NET) are set so that the client is set to less
than the load balancer, which in turn is set less than the server.
In my case I needed to increase the upstream-keepalive-timeout setting to something larger than the default timeout of an Azure loadbalancer (which is 4 minutes). I set the value to 300 seconds, and the errors went away.
Edit
I also had to see the worker-shutdown-timeout value, as described in https://github.com/kubernetes/minikube/issues/3039.
I am deploying AWS Corda Enterprise Template. The Quick start deployed the stack as per the defined CloudFormation template. I can see 2 AWS instances, up and running as Corda nodes, in Hot-Cold setup with a load balancer.
However the Log for Corda node has following ERROR related to AMQP communication.
[ERROR] 2018-10-18T05:47:55,743Z [Thread-3
(ActiveMQ-scheduled-threads)] core.server.lambda$channelActive$0 -
AMQ224088: Timeout (10 seconds) while handshaking has occurred. {}
What can be possible reason for this error? This error keeps on occurring after a certain time interval. So it looks like some connectivity issue to me.
Note: The load balancer shows the status of this AWS Corda instances as healty (In Service). So I believe the Corda node has booted up successfully.
The ERROR message isn't necessarily tied to AMQP. Perhaps you were confused by the "AMQ" in the error ID (AMQ224088)?
In any event, this error indicates that something on the network is connecting to the ActiveMQ Artemis broker, but it's not completing any protocol handshake. This is commonly seen with, for example, load balancers that do a health check by creating a socket connection without sending any real data just to see if the port is open on the target machine.
I use AWS CodeDeploy to deploy builds from GitHub to EC2 instances in AutoScaling Group.
It's working fine for Windows 2012 R2 with all Deployment configurations.
But for Windows 2016 it totally fails on "OneAtTime" deploy;
During "AllAtOnce" deploy only one or two instances deployed successfully, all other fails.
In the logfile on agent this suspicious message is present:
ERROR [codedeploy-agent(1104)]: CodeDeploy Instance Agent Service: CodeDeploy Instance Agent Service: error during start or run: Errno::ETIMEDOUT
- A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2)
All policies, roles, software, builds and other stuff are the same, I even tested this on brand new AWS account.
Does anybody faced such behaviour?
I ran into the same problem, but during my investigation, I found out that server's route table had wrong routes for 169.254.169.254 network (there was specified the gateway from the network where my template was captured), so that it couldn't read instance metadata.
From the above error it looks like the agent isn't able to talk to CodeDeploy endpoint after instance starts up. Please check if the routing tables and other proxy related settings are set up correctly. Also if you do not have it already, you can turn on the debug log by setting :verbose to true in the agent config and restart the agent. This would help debug the issue better.