How can I increase the healthcheck timeout when launching an instance with Boxfuse? - cloudcaptain

My instance fails to come up within 60 seconds. How can I increase the timeout?
The error I'm getting both locally and on AWS is:
ERROR: Time out: Payload of Instance vb-312f2f77 failed to come up within 60 seconds at http://127.0.0.1:8888/ !

There are two available fixes /causes:
Increase the healthcheck timeout. For example to increase the
default of 60 to 120 seconds you could use
boxfuse fuse payload.war -healthcheck.timeout=120
(More info: https://cloudcaptain.sh/docs/commandline/fuse.html#healthcheck.timeout)
Analyse the instance logs to check whether it is a genuine timeout
or some application startup issue. You can do this by issuing
boxfuse logs vb-312f2f77
(More info: https://cloudcaptain.sh/docs/commandline/logs.html)

Related

Mariadb: MySQL server has gone away

In my application, I have an issue where I receive the "MySQL server has gone away" error during a quite long-running transaction. I know this has already been asked a lot, but I tried my test to go through all possible causes.
The one thing that baffles me a lot, is this error message in the log of the MariaDB server:
[Warning] Aborted connection 6 to db: 'default' user: 'root' host: '10.0.0.18' (Got timeout reading communication packets)
This would explain why the client reports a broken connection, but this error occurs 10-15 minutes before the client reports the "MySQL server has gone away" error. In the meantime, the client is happily running insert statements without an issue. But as soon the client runs a select statement, the statement fails practically immediately.
I have already checked for these possible causes:
The server was running all the time
wait_timeout is set to 8 hours, which is way longer than the time the transaction needs to fail
max_allowed_packet is set to 512M which should be more than enough since the query is a very short select statement
The server does not run out of memory
I'm pretty sure the issue must be related to the "Got timeout reading communication packets" error from the MariaDB log. But I cannot wrap my head around why the client still can write data. And why this timeout occurs in the first place, since the wait_timeout is super high.
Some system information:
I'm running on MariaDB 10.5.1
The client uses python 3.6 and mysqlclient, which uses libmysql, is used for the database connection
I hope maybe some of you have an idea what I should look for, because this is really driving me nuts.

Single Node Artifactory - deploy using AWS ECS fails with current node still available

Maybe Im just approaching this wrong.
Single Instance mode (non-HA)
AWS-RDS Postgres Database
Deploying via ECS
Currently have Artifactory-Pro building a docker container and deploying to ECS via CI/CD. The initial deploy goes fine. Everything stands up, database migrations occur, and the instance runs.
However, when doing an update to the task, a new task spins up. It then adds entries to the access_topology with the new container-ip and unique node-id, but they stay unhealthy. The logs just then bomb out with failure messages (below - due to existing heartbeat of other node).
If I first stop the running task, and start a new task, it spins up properly (Probably due to heartbeat loss).
In typical ECS world, the new task is spun up till its deemed healthy, and then the older task is killed off.
Either scenario creates orphaned NODE records that stay healthy -- trying to also figure out how to garbage collect on those and purge.
Any thoughts on this?
Errors are below – it appears that it wont properly join because of an active heartbeat, and not being HA. However, I want this node to stand up so I can topple the other. Thanks –
Cluster join: Successfully joined jfmd#01es5dmfhar6gcy5abyj4rwpkc with node id ip-10-10-3-248.us-XXXX-1.compute.internal
Application could not be initialized: Current Artifactory node last heartbeat is: 1607609142483. Stopping Artifactory since the local server is running as PRO/OSS but found other servers in registry
Error occurred when refreshing domain cache all domain endpoint failed : Fetch domains from http://localhost:8046/distribution/api/v1/events/domains failed (returned 404), Fetch domains from http://localhost:8046/artifactory/api/events/domains failed (returned 404), [domain_client]"
Retry 20 Elapsed 16.84 secs failed: Couldn't access another access peer. [localhost:8046]. Status code: UNAVAILABLE. HTTP status code 503
Status code: UNAVAILABLE. HTTP status code 503
1607609184634,invalid content-type: text/plain; charset=utf-8
1607609184634,"headers: Metadata(:status=503,content-type=text/plain; charset=utf-8,content-length=19,date=Thu, 10 Dec 2020 14:06:24 GMT)"
1607609184634,DATA-----------------------------
1607609184634,Service Unavailable. Trying again
This is not possible without an HA configuration. Since this is not an HA configuration, the application will not start up if there is another application still "alive". In this case, "alive" is defined as having written the heartbeat within X amount of seconds (I believe this is 10 by default).

How do retries work in a Datapower mpgw service using routing-url to set backside URL?

I have a datapower mpgw service that takes in JSON POST and GET HTTPs requests. Persistent connections are enabled. It sets the backend url using the dp routing-url variable. How do retries work for this? is there some specific retry setting? does it do retries automatically up to a certain point? what if I don't want it to retry?
The backend app is taking about 1.5 minutes to return 500 when it can't connect, but I want it to return more quickly. I have the "backside timeout" set to 30 seconds. I'm wondering if it's because it's retrying a couple times but I can't find info on how retries are working or configured in this case.
I'm open to more answers, but what i found here looks like it says that with persistent connections enabled, DP will retry after the backend timeout duration up until the duration of the persistent connection timeout.

How can I debug buffering with HTTP.sys?

I am running Windows 8.1 and I have an integration test suite that leverages HostableWebCore to spin up isolated ASP.NET web server processes. For performance reasons, I am launching 8 of these at a time and once they are started up I send a very simple web request to each, which is handled by an MVC application loaded into each. Every instance is listening on a different port.
The problem is that the requests are getting held up (I believe) in HTTP.sys (or whatever it is called these days). If I look at fiddler, I can see all 8 requests immediately (within a couple milliseconds) hit the ServerGotRequest state. However, the requests sit in this state for 20-100 seconds, depending on how many I run in parallel at a time.
The reason I suspect this is HTTP.sys problem is because the amount of time I have to wait for any of them to respond increases with the number of hosting applications I spin up in parallel. If I only launch a single hosting application, it will start responding in ~20 seconds. If I spin up 2 they will both start responding in ~30 seconds. If I spin up 4, ~40 seconds. If I spin up 8, ~100 seconds (which is default WebClient request timeout).
Because of this long delay, I have enough time to attach a debugger and put a breakpoint in my controller action and that breakpoint will be hit after the 20-100 second delay, suggesting that my process hasn't yet received the request. All of the hosts are sitting idle for those 20-100 seconds after ~5-10 seconds of cold start CPU churning. All of the hosts appear to receive the requests at the same time, as if something was blocking any request from going through and then all of a sudden let everything through.
My problem is, I have been unable to locate any information related to how one can debug HTTP.sys. How can I see what it is doing? What is causing the block? Why is it waiting to forward on the requests to the workers? Why do they all come through together?
Alternatively, if someone has any idea how I can work around this and get the requests to come through immediately (without the waiting) I would very much appreciate it.
Another note: I can see System (PID 4) immediately register to listen on the port I have specified as soon as the hosting applications launch.
Additional Information:
This is what one of my hosting apps looks like under netsh http show servicestate
Server session ID: FD0000012000004C
Version: 2.0
State: Active
Properties:
Max bandwidth: 4294967295
Timeouts:
Entity body timeout (secs): 120
Drain entity body timeout (secs): 120
Request queue timeout (secs): 120
Idle connection timeout (secs): 120
Header wait timeout (secs): 120
Minimum send rate (bytes/sec): 150
URL groups:
URL group ID: FB00000140000018
State: Active
Request queue name: IntegrationTestAppPool10451{974E3BB1-7774-432B-98DB-99850825B023}
Properties:
Max bandwidth: inherited
Max connections: inherited
Timeouts:
Timeout values inherited
Logging information:
Log directory: C:\inetpub\logs\LogFiles\W3SVC1
Log format: 0
Number of registered URLs: 2
Registered URLs:
HTTP://LOCALHOST:10451/
HTTP://*:10451/
Request queue name: IntegrationTestAppPool10451{974E3BB1-7774-432B-98DB-99850825B023}
Version: 2.0
State: Active
Request queue 503 verbosity level: Basic
Max requests: 1000
Number of active processes attached: 1
Controller process ID: 12812
Process IDs:
12812
Answering this mainly for posterity. Turns out that my problem wasn't HTTP.sys but instead it was ASP.NET. It opens up a shared lock when it tries to compile files. This shared lock is identified by System.Web.HttpRuntime.AppDomainAppId. I believe that since all of my apps are built dynamically from a common applicationHost.config file, they all have the same AppDomainAppId (/LM/W3SVC/1/ROOT). This means they all share a lock and effectively all page compilation happens sequentially for all of the apps. However, due to the nature of coming/going from the lock all of the pages tend to finish at the same time because it is unlikely that any of them will get to the end of the process in a timely fashion, causing them all to finish around the same time. Once one of them makes it through, others are likely close behind and finish just after.

JBoss 5.1 Servlet repeats request after 60 seconds

We have a servlet that accepts requests and performs certain actions on external systems. Many times these external systems respond slowly and the requests take longer than 60 seconds. In the log we notice that exactly after 60 seconds a new request is made to the servlet (with the same post parameters) as long as the client is still connected.
Googling found that the same is reported on other App Servers such as Glassfish etc. The reason seems to be that after a timeout of 60 seconds the servlet or the web container are timing out the call and repeating the request. Note that this seems to be a servlet or container initiated refresh and not really posted from the client. Way to avoid this is to apparently increase the timeout. (Read more here on a similar issue: Java - multiple requests from two WebContainer threads)
I increased the connectionTimeout in the deploy/jbossweb.sar/server.xml to 120000 (2 minutes) but the call repeats exactly after 60 seconds still.
Any idea how to increase the timeout or to avoid this behaviour in JBoss?
Thanks
Srini
Found the issue. The problem was not to do with JBoss at all. Our JBoss servers run on Amazon EC2 instances and are behind an ELB load balancer. The AWS ELB load balancer timesout after every 60 seconds of idle time and resubmits the request.

Resources