WSO2 lags in processing fixed-time APIs - wso2-api-manager

We have a backend API which runs in almost constant time (it does "sleep" for given period). When we run a managed API which proxies to it for a long time, we see that from time to time its execution time increases up to twice the average.
From analyzing the Amazon ALB data in production, it seems that the time the request spends inside Synapse remains the same, but the connection time (the time the request enters the queue for processing) is high.
In an isolated environment we noticed that those lags happen approximately every 10 minutes. In production, where we have multiple workers that gets request all the time, the picture is more obscured, as it happens more often (possibly the lag accumulates).
Does anyone aware of any periodic activity in the worker which result delays entering the queue every
few minutes? Any parameter that control this? Otherwise, any idea how to figure out what is the cause?
Attached is an image demonstrating it.

Could be due to gateway token cache invalidation. The default timeout is 15 minutes.
Once the key validation is done via the Key Manager, the key validation info is cached in the gateway. The subsequent API invocations, key validation will be done from this cache. During this time, the execution time will be lower.
After the cache invalidation, the token validation will be done via the key manager (using DB). This will cause an increase in the execution time.

Investigating further, we found out two causes for the spikes.
Our handler writes log to shared file system which was set as sync instead of async. This caused delays. This reduced most of the spikes.
Additional spikes seem to be related to registry updates. We did not investigate those, as they were more sporadic.

Related

Google Cloud Function with minimum instances: cold start after every deployment?

To minimize cold starts, I've set a minimum instance for my Google Cloud Function. I actually do it with the firebase admin SDK like this:
functions.runWith({ minInstances: 1 })
...but I can see it confirmed in Google Cloud Console:
I'm noticing that after every deployment, I still encounter one cold start. I would have assumed that one instance would be primed and ready for the first request, but that doesn't seem to be the case. For example, here are the logs:
You can see that ~16 hours after deployment, the first request comes in. It's a cold start that takes 8139ms. The next request comes in about another hour later, but there's no cold start and the request takes 556ms, significantly faster than the first request.
So is this the expected behaviour? Do we still encounter one cold start even if minimum instances is set? Should I then be priming the cloud function after every deployment with a dummy request to prevent my users from encountering this first cold start?
Tl;dr: The first execution of a function that has minimum instances set is not technically a cold start, but probably will be slower than later executions of that instance.
Minimum instances of a function will immediately be "warmed up" on deploy and in a warm but idle state, ready to respond to a request. However, the functions we write often need to do extra setup work when they're actually triggered for the first time.
For example, we may use dynamic imports to pull in a library or need to set up a connection to a remote DB. Even though the function instance is warm, the extra work that has to be done on the first execution means that it will probably be slower than later executions.
The benefit of the minimum instances setting is that later executions benefit from all the setup work done by the first execution, and can be much faster than if they were scaled back to zero and had to set themselves up all over again on the next request.
Update: Occasionally, an idle instance may be killed by the Cloud Functions backend. If this happens, another instance will be spun up immediately to meet the required minimum instances setting, but that new instance will need to go through its extra setup work again the first time it is triggered. However, this really shouldn't happen often.
The documentation does not make a hard guarantee about the behavior (emphasis mine):
To minimize the impact of cold starts, Cloud Functions attempts to
keep function instances idle for an unspecified amount of time after
handling a request.
So, there is an attempt (no guarantee), and it kicks in after a handling a request (not after deployment), but you don't know how long that will last. As stated, it sounds like you might want to make a request, along with the expectation that it might still not always work exactly the way you want.

How to root cause observeChanges() latency spikes?

I have a client that observes document additions using the added callback mechanism. Most of the time the added callback is fired ~2 seconds after document addition. However, sometimes I will see spikes to as much as much as 20 seconds. i.e. document is added and the callback is executed 20 seconds later.
How can I root cause these spikes in callback execution latency? Perhaps this is a network issue, how could I prove/disprove?
This is not a load issue as the number of clients and operations is constant.
I determine the time diff by placing a time stamp in the document inserted and then comparing with the time when the added callback is executed.
Not all clients experience these latency spikes.

setMaxInactiveInterval on OC4J isn't accurate

I've a servlet app deployed in side oc4j.
I am trying to invalidate the user session after 1 minute using:
session.setMaxInactiveInterval(1 * 60);
But What happens is that It takes over 1 minute (and may reach 1 min and half) before the session get destroyed.
Is this an implementation issue, or what?
You seem to be checking the destroy by waiting until HttpSessionListener#sessionDestoryed() get called instead of actually sending a HTTP request to the server after exactly 1 minute.
The session destroy is on most servers managed by a background job which runs at intervals, which can be each minute or more depending on server make/version, configuration and possibly also load. This job checks all open sessions whether it has been expired or not and sweeps the expired ones accordingly. Thus, it is not true that the session destroy is immediately called on the same second as the session is expired as long as the client hasn't sent a request. This background job does not run every second, it would have been too CPU intensive.
The session destroy will however be immediately called whenever the server retrieves a request with a session ID while the session is still present in server's memory but is been expired.
So, you'd either have to accept it or to change your testing methodology.

Does the Server ASP.Net timeout setting affect the client timeout setting?

I'm working with ASP.Net web services and am having a problem with a long-running process that takes about 5 minutes to complete, and it's timing out. To fix this, I was able to set the executionTimeout on the server's web.config to 10 minutes, and then set the .Timeout property on the Web Service object to approximately 9 minutes. Now, I'm worried that this may possibly cause some other web service calls to sit there for 10 minutes before they time out rather than the previous 90-100 seconds. I know the default on the client side is 100 seconds, but wasn't sure if updating the server's timeout setting would affect this.
Bottom line is - Is it safe to update the server's timeout setting to a long amount like 10 minutes, and rely on the default timeout on the client, or could this end up causing some problems?
Thanks!
The web is not supposed to work like this. If you have a long running process, you should call it in a new thread and post the answer after the page has finish loading on the client side (either with a callback or by querying the server-side every x minutes to check if the process has finished). This way you avoid timeouts and the user gets their page (even incomplete) in a user-friendly time. This is important because if the user does not get their page in a reasonable time, they will be unhappy and try to reload the page (and maybe restart your process...).

browser timeouts while asp.net application keeps running

I'm encountering a situation where it takes a long time for ASP.NET to generate reply with the web page (more than 2 hours). It due to the codebehind running for a while (very long, slow loop).
Browser (both IE & Firefox) stops waiting for the reply (after about an hour) and gives generic cannot display webpage error (similar to what you would see if you'd try to navige to non-existing server).
At the same time asp.net app keeps going (I can see it in debugger) and eventually completes.
Why does this happen? Are there any settings in web.config to influence this? I'm hoping there's a timeout setting that I'm missing that's causing this.
Maybe a settings in IE or Firefox? But I think they wait while the server is keeping connection alive.
I'm experiencing this even when I launch app in debug mode (with compilation debug="true") on my local machine from VS (so it's not running on IIS, but on ASP.NET Dev Server).
I know it's bad that it takes so long to generate the page, but it doesn't matter at this stage. Speeding it up would take a lot of extra work and the delay doesn't really matter. This is used internally.
I realize I can redesign around this issue running logic to a background process and getting notified when it's done through AJAX, or pull it to a desktop app or service or whatever. Something along those lines will be done eventually, but that's not what I'm asking about right now.
Sounds like you're using IE and it is timing out while waiting for a response from the server.
You can find a technet article to adjust this limit:
http://support.microsoft.com/kb/181050
CAUSE
By design, Internet Explorer imposes a
time-out limit for the server to
return data. The time-out limit is
five minutes for versions 4.0 and 4.01
and is 60 minutes for versions 5.x, 6,
and 7. As a result, Internet Explorer
does not wait endlessly for the server
to come back with data when the server
has a problem. Back to the top
RESOLUTION
In general, if a page does not return within a few
minutes, many users perceive that a
problem has occurred and stop the
process. Therefore, design your server
processes to return data within 5
minutes so that users do not have to
wait for an extensive period of time.
The entire paradigm of the Web is of request/response. Not request, wait two hours, response!
If the work takes so long to do, then have the page request trigger the work, and then not wait for it. Put the long-running code into a Windows service, and have the service listen to an MSMQ queue (or use WCF with an MSMQ endpoint). Have the page send requests for work to this queue. The service will read a request, maybe start up a new thread to process it, then write a response to another queue, file, or whatever.
The same page, or a different, "progress" page can poll the response queue or file for responses, and update the user, assuming the user still cares after two hours.
For something that takes this long, I would figure out a way to kick it off via AJAX and then periodically check on it's status. The background process should update some status variable on a regular basis and store it's data in the cache or session when complete. When it completes and the browser detects this (via AJAX), have the browser do a real postback (or get by changing location.href), pick up the saved data, and generate the page.
I have a process that can take a few minutes so I spin off a separate thread and send the result via ftp. If an error occures in the process I send myself an error message including the stack trace. You may want to consider sending the results via email or some other place then the browser and use a thread as well.

Resources