Buffered uploading via HTTP Proxy - http

I am trying to solve an issue with uploads to our web infastructure.
When a user uploads media to our site, it is proxied (via our Web Proxy tier) to a Java backend with a limited number of threads. When a user has a slow connection or a large upload, this holds one of the Java threads open a long period of time, reducing overall capacity.
To mitigate this I'd like to implement an 'upload proxy' which will accept the entire HTTP POST data of the upload and only when it has received all of the data it will proxy that POST to the Java backend quickly, pushing the problem of the upload thread being held open to a HTTP proxy.
Initially I found Apache Traffic Server has a 'buffer_upload' plugin, but it seems a bit bleeding edge and has no support for regex in URLs, although it would solve most of my issues.
Does anyone know a proxy product that would be able to do what I am suggesting (aside from Apache Traffic Server)?
I see that Nginx has fairly detaild buffer settings for proxying, but it doesn't seem (from docs/explanations) to wait for the whole POST before opening a backend connection/thread. Do I have this right?
Cheers,
Tim

Actually, nginx always buffers requests before opening a connection to the backend. It is possible to turn off response buffering using proxy_buffering or setting an X-Accel-Buffering response header for per-response buffering control.

Related

Nginx: capture post requests when upstream is offline

I'm using Nginx as a reverse proxy for a Ruby on Rails application.
The application has 2 critical endpoints that are responsible for capturing data from customers who're registering their details with our service. These endpoints take POST data from a form that may or may not be hosted on our website.
When our application goes down for maintenance (rare, but we have a couple of SPOF services), I would like to ensure the POST data is captured so we don't lose it forever.
Nginx seems like a good place to do this given that it's already responsible for serving requests to the upstream Rails application, and has a custom vhost configuration in place that serves a static page for when we enable maintenance mode. I figured this might be a good place for additional logic to store these incoming POST requests.
The issue I'm having is that Nginx doesn't parse POST data unless you're pointing it at an upstream server. In the case of our maintenance configuration, we're not; we're just rendering a maintenance page. This means that $request_body¹ is empty. We could perhaps get around this by faking a proxy server, or maybe even pointing Nginx at itself and enabling the logger on a particular location. This seems hacky though.
Am I going about this the wrong way? I've done some research and haven't found a canonical way to solve this use-case. Should I be using a third-party tool and not Nginx?
1: from ngx_http_core_module: "The variable’s value is made available in locations processed by the proxy_pass, fastcgi_pass, uwsgi_pass, and scgi_pass directives when the request body was read to a memory buffer."

When implementing a web proxy, how should the server report lower-level protocol errors?

I'm implementing an HTTP proxy. Sometimes when a browser makes a request via my proxy, I get an error such as ECONNRESET, Address not found, and the like. These indicate errors below the HTTP level. I'm not talking about bugs in my program -- but how other servers behave when I send them an HTTP request.
Some servers might simply not exist, others close the socket, and still others not answer at all.
What is the best way to report these errors to the caller? Is there a standard method that, if I use it, browsers will convert my HTTP message to an appropriate error message? (i.e. they get a reply from the proxy that tells them ECONNRESET, and they act as though they received the ECONNRESET themselves).
If not, how should it be handled?
Motivations
I really want my proxy to be totally transparent and for the browser or other client to work exactly as if it wasn't connected to it, so I want to replicate the organic behavior of errors such as ECONNRESET instead of sending an HTTP message with an error code, which would be totally different behavior.
I kind of thought that was the intention when writing an HTTP proxy.
There are several things to keep in mind.
Firstly, if the client is configured to use the proxy (which actually I'd recommend) then fundamentally it will behave differently than if it were directly connecting out over the Internet. This is mostly invisible to the user, but affects things like:
FTP URLs
some caching differences
authentication to the proxy if required
reporting of connection errors etc <= your question.
In the case of reporting errors, a browser will show a connectivity error if it can't connect to the proxy, or open a tunnel via the proxy, but for upstream errors, the proxy will be providing a page (depending on the error, e.g. if a response has already been sent the proxy can't do much but close the connection). This page won't look anything like your browser page would.
If the browser is NOT configured to use a proxy, then you would need to divert or intercept the connection to the proxy. This can cause problems if you decide you want to authenticate your users against the proxy (to identify them / implement user-specific rules etc).
Secondly HTTPS can be a real pain in the neck. This problem is growing as more and more sites move to HTTPS only. There are several issues:
browsers configured to use a proxy, for HTTPS URLS will firstly open a tunnel via the proxy using the CONNECT method. If your proxy wants to prevent this then any information it provides in the block response is ignored by the browser, and instead you get the generic browser connectivity error page.
if you want to provide any other benefits one normally wishes from a proxy (e.g. caching / scanning etc) you need to implement a MitM (Man-in-the-middle) and spoof server SSL certificates etc. In fact you need to do this if you just want to send back a block-page to deny things.
There is a way a browser can act a bit more like it was directly connected via a proxy, and that's using SOCKS. SOCKS has a way to return an error code if there's an upstream connection error. It's not the actual socket error code however.
These are all reasons why we wrote the WinGate Internet Client, which is a LSP-based product for our product WinGate. Client applications then learn the actual upstream error codes etc.
It's not a favoured approach nowadays though, as it requires installation of software on the client computer.
I wouldn't provide them too much info. Report what you need through internal logs in case you have to solve the problem. Return a 400, 403 or 418. Why? Perhaps the're just hacking.

HTTP vs HTTPS from developer view

I need to build a Web site which would have a secure connection (HTTPS) on some pages. I need to know if there will be a difference for me (as a developer) while I will write the code? I must treat differently some data or what? What is the main difference from back-end view?
From the backend point of view, there is no difference. The difference between the two is the TCP connection between the server and the client. Https will be encrypted, http is not of course, but it's all decrypted by the time it hits your code. The server will have some flags available so you can determine whether the connection is http or https (names vary depending on the server) but unless you're using that information to change the behavior of the page, you don't need to worry about it.

CGI to Handle Multiple Requests on a Persistent HTTP Connection

CGI programs typically get a single HTTP request.
HTTP 1.1 supports persistent HTTP connections whereby multiple HTTP requests/responses are made w/o closing the connection.
Is there a way for a CGI program (or similar mechanism) to handle multiple HTTP requests/responses on the same connection?
I am using Apache httpd.
Keep-alives are one of the higher-level HTTP features that is wholly dealt with by the web server. They are out-of-scope for CGI applications themselves.
Accessing CGI scripts through Apache mod_cgi works with keep-alive for me. The browser re-uses the same TCP connection to fetch the page and then resources referred to by it, without the scripts in question having to do anything special.
If you mean you would like to have the same CGI process handle one request and then the next (instead of the process ending and a new one being spawned), then I'm afraid that's not possible. The web server will intercept keep-alives and make them look like single requests before your scripts can do anything about it. (If you want to do that to improve performance, consider a different gateway interface, such as FastCGI or language-specific options like WSGI.)
SCGI sounds exactly like what you want. It is similar to FastCGI but a simpler solution to implement (the S stands for Simple :)).

HTTP Proxy/FastCGI/SCGI not closing connection when client disconnected - bug or feature?

I'm working on Comet support for CppCMS framework via long XMLHttpRequest polls. In many cases, such request is closed by client before any response from server was given -- for example the page is closed, user moves to other page or it is just refeshed.
At the server side I expect that I would recieve the notification that connection is dropped. I tested the application via 3 connectors: FastCGI, SCGI and simple HTTP Proxy.
From 3 major UNIX web servers, Apache2, lighttpd and Nginx, only the last one had closed
connection as expected allowing my application to remove the request from wait queue -- this worked for both FastCGI and HTTP Proxy connectors. (Nginx does not have scgi module by default).
Others, Apache and Lighttpd do not close connection or inform the backend about disconnected
clients, the proceed as if the client is still on line. This happens for all 3 supported APIs: FastCGI, SCGI and HTTP Proxy.
I had opened an issue for Lighttpd, but what
more conserns me is the fact that Apache -- mature and well supported web server as lighttpd
and does not discloses the server backend that client had gone.
Questions:
Is this a bug or this is a feature? Is there any reason not to close the connection between web server and application backend?
Are there real life Comet application working behind these servers via FastCGI/SCGI/HTTP-Proxy backends?
If the above true, how do they deal with this issue? I understand that I can timeout all connections every 10 seconds, but I would like to keep them idle as far as client listens -- because this allows easier scale up -- each connection is very cheep -- the cost is only the opended socket.
Thanks!
(1) Feature. Or, more specifically, fallout from an implementation detail.
A TCP/IP connection does not involve a constant flow of traffic back and forth. Thus, there is no way to know that a client is gone without (a) the client telling you it is closing the connection or (b) a timeout.
(2) I'm not specifically familiar with Comet or CppCMS. But, yes, there are all kinds of CMS servers running behind the mentioned web servers and they all have to deal with this issue (and, yes, it is a pain).
(3) Timeouts are the only way, but you can mitigate the pain, so to speak. Have the client ping the server across the connection every N seconds when there is otherwise no activity. Doesn't have to do anything and you can tack stuff on the reply; notifications of concurrent edits or whatever you need.
You are correct in that it is surprising that mod_fastcgi doesn't support telling the backend that Apache has detected the disconnect or the connection timed out. And you aren't the first to be dismayed.
The second patch on this page should fix that particular issue:
http://osdir.com/ml/web.fastcgi.devel/2006-02/msg00015.html
http://ncannasse.fr/blog/tora_comet
I don't have any concrete information for you, but this article does mention that they can detect when the client has disconnected from Apache. See tora.Queue. And it sounds like the source is available in the neko CVS, so you might be able to find some clues there. Good luck.

Resources