A potential CEF cache corruption scenario - cefsharp

We have a .NET WPF container app in which we host several web apps using CEFSharp.WinForms control. At times, we see that for some users, some JavaScript resource requests fail with the ERR_CONTENT_DECODING_FAILED error message. This issue gets resolved if we reload the app after either clearing the CEF cache or after disabling the cache from the network tab in the developer toolbar window. Please note that this issue isn't confined to a specific subset of resource files: instead, we have seen it happening sporadically for a variety of JavaScript resource files (some hosted on Apache while the others hosted on IIS servers).
While a possible cause for usual ERR_CONTENT_DECODING_FAILED error is a server-side content-encoding issue, in this specific case, we believe this could potentially be related to the CEF browser caching. Please see the analysis section below for the reasons we believe so.
Application Setup
When we initialize CEF settings, we set MultiThreadedMessageLoop setting to true and set CachePath property to a location under %localappdata% on windows 10 machine. When the container app starts, it creates three CEF web browser controls and launches web apps in them. All three apps load concurrently. After that, more CEF web browsers are created as the user visits more apps. The user also reloads some of these apps over time. All the web apps are internal apps sharing the same domain but physically hosted on different web servers. The JavaScript resource files in question usually have caching policy set to allow them to be cached for a week.
CEFSharp version - 79.1.360.0
CEF version - r79.1.36+g90301bd+chromium-79.0.3945.130
Chromium version - 79.0.3945.130
Our Analysis so far
We checked the web-server logs for the failing JavaScript resources. We observed that in most cases, the server requests for those resource files (by the impacted user) were made a few days ago. The users are usually able to use the application well for some days before they sporadically start getting this error.
We checked the network logs (*.HAR file). We see that for the failing JavaScript resource, _transferSize is 0 (which seems to indicate that response was served from the cache)
When the error occurs, it gets resolved when we reload the app after either clearing the cache or disabling the cache from the network tab.
We tried artificially simulating this error. We used Fiddler's autoresponder feature to deliberately respond with a bad server response (the content was 'gzip' encoded however Content-Encoding header indicated 'br'). We could simulate the ERR_CONTENT_DECODING_FAILED error, however, we could see in network logs that _tranferSize was a non-zero value. We also observed that chrome did not cache the bad response. This test indicates that when the original JavaScript response was cached by the browser, it must have been a correctly encoded response, or else the browser would not have cached it.
All of the above points lead us to believe that, JavaScript resource files were downloaded (with correct encoding) and cached in CEF cache. The user was also able to use the apps for some time. After that however, in certain scenarios, some of these files potentially got corrupted in CEF cache, leading to the content decoding error.
We tried using CEF response filter mechanism as explained here to capture the bad response when content decoding error occurs. Unfortunately, we observed that dataIn stream which gets passed to filter function is null when the response fails with this error.
Summary and Questions
This is a sporadic issue which our users are facing. We haven't found a way to deterministically recreate this problem. However based on our analysis so far, we believe some JavaScript files may be getting corrupted in CEF cache over time. We are not sure if the fact that we host several CEF web browsers and load them concurrently could be playing some role in causing this issue.
Has anyone else observed/reported a similar issue? Do you have any idea if we are missing or overlooking something here or going in the wrong direction? Any pointers will be greatly appreciated.

Related

Chrome ERR_HTTP2_PROTOCOL_ERROR + Firefox Secure Connection Failed

I'm hosting a website serves global regions, and recently there's a weird issue came up.
Already checked other posts on the Internet including the one in stackoverflow with a lot of discussions:Chrome net::ERR_HTTP2_PROTOCOL_ERROR 200 after a reconnect , but none of the answers helped.
Website is building on ASP.NET webform legacy "website" (not web application).
There's a important function which performs several process once user click a button on website.
Let's say there are 100 lines of code in that function, and I've added some flags to log which steps have been hit and processed.
Weird situation is:
Only China users are facing the issue. (website is not hosted in China)
Some users are using firefox and it returned below, in English it is "Secure Connection Failed"
But checked several posts including firefox documents, there should be error code on screen like
ssl_error_no_cypher_overlap but there is nothing.
Firefox error
Some users are using other browsers which is Chrome based, it returns:
Chrome error
In additionally, I checked the process log in these user feedbacks, most of them does not finish all the code, in other words, if there are 100 lines of codes and some of them just stopped in line 50.
Website has TLS 1.2 enabled, also http2 protocol (h2) is applied when I checked via Chrome-Network tab.
I'm wondering if it is possible if client browser shut down the connection in some reasons, it will end with the result I see (stopped at the middle of entire code flow), from my opinion if a request is posted to server then no matter what client does, the process should finish entire flow.
Any ideas or thoughts will be appreciated!
I was just dealing with that exact situation.
From what I read in various posts on the HTTP2_PROTOCOL_ERROR, I think what happens is the response is started but code problem(s) prevent the server from completing the response. The incomplete response gives the protocol error in Chrome, and, because it's over TLS, Firefox sees it as a security error. (I'd share links, but I've already closed all those windows - sorry.)
Somehow my code was preventing the server from completing the response without causing an exception.
I was able to track down the offending code by commenting out the body of every code-behind procedure on the page and then bringing them back one at a time.
Good luck to you!
I can't give you a concrete example, but in my case, there was no problem on the application side.
Have you recently added settings to your in-house infrastructure engineer?
For example, have you added WAF settings? You may want to check.
FYI

Debugging requests which are 'stuck' in an IIS worker process

In case of TL;DR - I basically need guidance regarding what tools are available to debug requests which are issued to IIS and which stall inside a module.
I have a problem with an old ASP 2.0 app at the moment whereby it will periodically become unavailable and recycling the app pool (horrible as that may be) doesn't bring it back up 100% of the time.
So first of all it presents itself as requests entering the app pool and being trapped in state 'BeginRequest' in RewriteModule.
It is not a specific request which is always the first to experience this issue. The issue cannot be easily recreated either.
Eventually requests join this backlog and when it becomes 70+ deep the app pool fails to respond to pings from WAS and it forcibly recycles. Predictably it doesn't stop on-time and the old app pool is forced to stop. When the new app pool comes up it either works just fine or it instantly experiences the same issue as the outgoing one and requests begin to queue.
In issues like this all the official guidance is understandably focussed around looking at why the RewriteModule may choke.
I have validated my redirections and though complex there are no obvious issues with syntax (XML validates).
Likewise in inetmgr loading up the URL Rewrite Module seems to parse the configs fine and show them visually.
Basic stuff like permissions is all fine.
When the app is working normally I also used Failed Request Tracing/Logging to look at the request pipeline for a sample URL which stalled and I can confirm that there is no circular logic or weird errors presenting - the request seems to be handled just fine. This also showed me how high up the rewritemodule is invoked and from this I really don't see how the issue could be app-related as .NET isn't invoked at this point.
Annoyingly when an app pool is experiencing this issue and I can throw in requests which just stall Failed Request Tracing is no good because you actually need a request to get to the end of it's journey and fail otherwise it refuses to log anything out.
I resorted to taking process dumps of affected w3wp.exe's and running them through DebugDiag. Unfortunately the only thing I see is that threads are open accessing the rewritemodule but precious little about what they are stuck on.
As anyone else would do I've tried to track the start of the issue back to any recently installed patches or code changes but nothing matches. Likewise this is happening on 3x servers otherwise I would try reinstalling the rewritemodule. Other sites on the same server which invoke rewritemodule are unaffected.
Has anyone else experienced issues like this - the net seems to have relatively little info in this case. Perhaps you can recommend further debugging tools or approaches for IIS which I can adapt to this scenario? This is sort of a cry for help from someone more used to Apache/Nginx - sorry for the long post.

Channel.Ping.Failed error Detected duplicate HTTP-based FlexSessions What's the root cause?

Hi
I've downloaded the Cairngorm3 Simple Sample Application from here.
There's a few steps.
a) Download the server-side zip. It contains a PDF instructing how to start a HSQLDB database and get a Tomcat instance up an running (I used catalina.sh start).
b) Check out the source with Subversion, and load it up into Flashbuilder 4. (You need Flex 3.4 SDK)
When I run the app (an Outlook like app written in Flex), I have issues at the point I try and save a contact. I'm assuming it's on a remoteobject call.
But it I get this:
Send failed
faultCode:Client.Error.MessageSend faultString:'Send failed' faultDetail:'Channel.Ping.Failed error Detected duplicate HTTP-based FlexSessions, generally due to the remote host disabling session cookies. Session cookies must be enabled to manage the client connection correctly. url: 'http://localhost:8400/messagebroker/amf;jsessionid=5765DDDB6E2D54BD03D3E636B0E8C03E'''
I'm wondering if this is something you need to tweak in services-config.xml?
Located in flex-frameworks/tomcat/webapps/ROOT/WEB-INF/flex folder (flex-frameworks comes from the server-side zip download:
Anyone got any ideas?
This is Christophe Coenraets baby.
I also subsequently found a blog post by Alexander Glosband, but couldn't ascertain from it, what you need to do as a work around. i.e. Is this something that is configurable?
The way to reproduce the error consistently is to try and activate the web camera from the app. Then instead of clicking accept, reload the screen. Then when you try and take a photo after subsequently granting access to camera, you get the duplicate session error.
I think there is an issue with the code pertaining to the Camera, that's not cleaning up after itself correctly, the session is probably not being tidied up correctly.
You are right, problem comes from services-config.xml. Change your url from
http://localhost:8400/messagebroker/amf
to
/messagebroker/amf
I found solution from here send failed error
"Compiler EMBEDS channels, endpoints and destinations into SWF" video tells.

ASP.NET script combiner returns blank response at times

I am using a script manager for ASP.NET MVC to combine and compact CSS files and JavaScript files for pages on a website. For the most part this works as expected, however there are times (couple of times per week) when the HTTP handler responsible for returning the content returns an empty response and so pages load without any CSS - the HTML returns and the images load as well. When this happens, refreshing the page does not resolve the problem, while resetting IIS always resolves the problem. Also, without resetting IIS, after some time the problem stops.
Normally the script handler logs errors, however there are no errors logged during the issue. It seems as if the handler is never invoked. There are no failed request logs generated by IIS.
I monitored resource usage when this was happening and did not notice anything out of the ordinary. The web server is running IIS 7 and has low CPU usage. I increased some parameters in IIS settings regarding the number of allowable requests to process, the problem still exists though perhaps less frequent. The website receives about 1.5 million pageviews monthly.
This was found to be an issue with HTTP cache headers.

Validation of viewstate MAC failed

Ran into this issue yesterday on one of our sites. First of all the site is hosted in a web farm environment and for the time being I have added a static machineKey to the web.config on both nodes (2 node environment). This has solved the issue and everything is running fine now.
This raised the following question:
Why is it that all our other sites that run on this environment does not require this (machineKey in the web.config).
I checked event logs to make sure that we are not having the same issue on other sites and everything looks fine. I also confirmed that the app pool is not recycling too often and everything was normal with regards to app pool settings.
The only explanation I can come up with is that the site is rendered by one node and subsequent post backs go to another node - which would leave me to believe that the problem lies with the load balancer. Our infrastructure guys tells me that everything is as it should be with regards to the load balancer and the scenario that I am proposing will not happen.
Am I missing the obvious here or are there anything else that I can consider?
Thanks in advance
Basically, yes, you're right - you generally see this in a web farm environment when "Sticky Sessions" aren't properly configured in the load balancer, and the users postback is sent to a different server.
To be fair to your network guys, it's possible that most requests are being sent to one server, but that this application is tipping the usage such that requests are often sent to another server - but you should be seeing that across all sites, unless the traffic patterns are completely different.
The other possible cause is that either your page is taking too long to load, and the users are posting back before the page has completely finished loading - I'd managed to get one of my sites doing that with a couple of remote advert calls buried halfway through the page load, or the users are waiting too long between page render and postback and the session on the loadbalancer is timing out so it thinks it's a new request.
If you are working with a web-farm environment, machine key values, if specified in the web.config need to be synced. In addition, you will want to make sure that the machine key values in the machine.config file are also synced between the two.

Resources