Varnish to be used for https - nginx

Here's the situation. I have clients over a secured network (https) that talk to multiple backends. Now, I wanted to establish a reverse proxy for majorly load balancing (based on header data or cookies) and a little caching. So, I thought varnish could be of use.
But, varnish does not support ssl-connection. As I've read at many places, quoting, "Varnish does not support SSL termination natively". But, I want every connection, ie. client-varnish and varnish-backend to be over https. I cannot have plaintext data anywhere throughout network (there are restrictions) so nothing else can be used as SSL-Terminator (or can be?).
So, here are the questions:
Firstly, what does this mean (if someone can explain in simple terms) that "Varnish does not support SSL termination natively".
Secondly, is this scenario good to implement using varnish?
and Finally, if varnish is not a good contender, should I switch to some other reverse proxy. If yes, then which will be suitable for the scenario? (HA, Nginx etc.)

what does this mean (if someone can explain in simple terms) that "Varnish does not support SSL termination natively"
It means Varnish has no built-in support for SSL. It can't operate in a path with SSL unless the SSL is handled by separate software.
This is an architectural decision by the author of Varnish, who discussed his contemplation of integrating SSL into Varnish back in 2011.
He based this on a number of factors, not the least of which was wanting to do it right if at all, while observing that the de facto standard library for SSL is openssl, which is a labyrinthine collection of over 300,000 lines of code, and he was neither confident in that code base, nor in the likelihood of a favorable cost/benefit ratio.
His conclusion at the time was, in a word, "no."
That is not one of the things I dreamt about doing as a kid and if I dream about it now I call it a nightmare.
https://www.varnish-cache.org/docs/trunk/phk/ssl.html
He revisited the concept in 2015.
His conclusion, again, was "no."
Code is hard, crypto code is double-plus-hard, if not double-squared-hard, and the world really don't need another piece of code that does an half-assed job at cryptography.
...
When I look at something like Willy Tarreau's HAProxy I have a hard time to see any significant opportunity for improvement.
No, Varnish still won't add SSL/TLS support.
Instead in Varnish 4.1 we have added support for Willys PROXY protocol which makes it possible to communicate the extra details from a SSL-terminating proxy, such as HAProxy, to Varnish.
https://www.varnish-cache.org/docs/trunk/phk/ssl_again.html
This enhancement could simplify integrating varnish into an environment with encryption requirements, because it provides another mechanism for preserving the original browser's identity in an offloaded SSL setup.
is this scenario good to implement using varnish?
If you need Varnish, use it, being aware that SSL must be handled separately. Note, though, that this does not necessarily mean that unencrypted traffic has to traverse your network... though that does make for a more complicated and CPU hungry setup.
nothing else can be used as SSL-Terminator (or can be?)
The SSL can be offloaded on the front side of Varnish, and re-established on the back side of Varnish, all on the same machine running Varnish, but by separate processes, using HAProxy or stunnel or nginx or other solutions, in front of and behind Varnish. Any traffic in the clear is operating within the confines of one host so is arguably not a point of vulnerability if the host itself is secure, since it never leaves the machine.
if varnish is not a good contender, should I switch to some other reverse proxy
This is entirely dependent on what you want and need in your stack, its cost/benefit to you, your level of expertise, the availability of resources, and other factors. Each option has its own set of capabilities and limitations, and it's certainly not unheard-of to use more than one in the same stack.

Related

Advanced HTTP/2 proxy for load balancing of distributed scraping solution

I have built a distributed HTTP scraper solution that uses different "exit addresses" addresses by design in order to balance the network load.
The solution supports IPv4, IPv6 and HTTP proxy to route the traffic.
Each processor was responsible to define the most efficient route to balance the traffic and it was temporarily implemented manually for prototyping. Currently, the solution grows and with the number of processors as the complexity of the load balancing task get higher, that's why I need a way to create a component dedicated to it.
I did some rather extensive research, but seem to have failed in finding a solution for load balancing traffic between IPv6, IPv4 (thousands of local addresses) and public HTTP proxies. The solution needs to support weights, app-level response checks and cool-down periods.
Does anyone know a solution that already solves this problem? Before I start developing a custom one.
Thanks for your help!
If you search for load balancing proxy you'll discover the Cache Array Routing Protocol (CARP). This CARP might not be what you're searching for and there exists servers only for the proxy-cache what I never knew till now.
Nevertheless those servers have own load balancers too, and perhaps that's a detail where it's worth it to search more.
I found a presentation mentioning CARP as outstanding solution too: https://cs.nyu.edu/artg/internet/Spring2004/lectures/lec_8b.pdf
Example: for proxy-arrays in Netra Proxy Cache Server: https://docs.oracle.com/cd/E19957-01/805-3512-10/6j3bg665f/index.html
Also there exist several concepts for load-balancing (https://link.springer.com/article/10.1023/A:1020943021842):
The three proposed methods can broadly be divided into centralized and decentralized
approaches. The centralized history (CH) method makes use of the transfer rate of each
request to decide which proxy can provide the fastest turnaround time for the next job.
The route transfer pattern (RTP) method learns from the past history to build a virtual
map of traffic flow conditions of the major routes on the Internet at different times of the
day. The map information is then used to predict the best path for a request at a particular time of the day. The two methods require a central executive to collate information
and route requests to proxies. Experimental results show that self-organization can be
achieved (Tsui et al., 2001). The drawback of the centralized approach is that a bottleneck and a single point of failure is created by the central executive. The decentralized
approach—the decentralized history (DH) method—attempts to overcome this problem
by removing the central executive and put a decision maker in every proxy (Kaiser et al.,
2000b) regarding whether it should fetch a requested object or forward the request to another
proxy.
As you use public proxy-servers probably you won't use decentralized history (DH) but centralized history (CH) OR the route transfer pattern (RTP).
Perhaps it would be even useful to replace your own solution completely, i.e. by this: https://github.blog/2018-08-08-glb-director-open-source-load-balancer/. I've no reason for this special example, it's just random by search results I found.
As I'm not working with proxy-servers this post is just a collection of findings, but perhaps there is a usable detail for you. If not, don't mind - probably you know most or all already and it's never adding anything new for you. Also I never mention any concrete solution.
Have you checked this project? https://Traefik.io which supports http/2 and tcp load balancing. The project is open source and available on github. It is build using Go. I'm using it now as my reverse proxy with load balancing for almost everything.
I also wrote a small blog post on docker and Go where I showcase the usage of Traefik. That also might help you in your search. https://marcofranssen.nl/docker-tips-and-tricks-for-your-go-projects/
In the traefik code base you might find your answer, or you might decide to utilize traefik to achieve your goal instead of home grown solution.
See here for a nice explanation on the soon to be arriving Traefik 2.0 with TCP support.
https://blog.containo.us/back-to-traefik-2-0-2f9aa17be305

http2 domain sharding without hurting performance

Most articles consider using domain sharding as hurting performance but it's actually not entirely true. A single connection can be reused for different domains at certain conditions:
they resolve to the same IP
in case of secure connection the same certificate should cover both domains
https://www.rfc-editor.org/rfc/rfc7540#section-9.1.1
Is that correct? Is anyone using it?
And what about CDN? Can I have some guarantees that they direct a user to the same server (IP)?
Yup that’s one of the benefits of HTTP/2 and in theory allows you to keep sharding for HTTP/1.1 users and automatically unshard for HTTP/2 users.
The reality is a little more complicated as always - due mostly to implementation issues and servers resolving to different IP addresses as you state. This blog post is a few years old now but describes some of the issues: https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/. Maybe it’s improved since then, but would imagine issues still exist. Also new features like the ORIGIN frame should help but are not widely supported yet.
I think however it’s worth revisiting the assumption that sharding is actually good for HTTP/1.1. The costs of setting up new connections (DNS lookup, TCP setup, TLS handshake and then the actual sending HTTP messages) are not immaterial and studies have shown the 6 connection browser limit is really used never mind adding more by sharding. Concatenation, spriting and inlining are usually much better options and these can still be used for HTTP/2. Try it on your site and measure is the best way of being sure of this!
Incidentally it is for for these reasons (and security) that I’m less keen on using common libraries (e.g. jquery, bootstrap...etc.) from their CDNs instead of hosted locally. In my opinion the performance benefit of a user already having the version your site uses already cached is over stated.
With al these things, HTTP/1.1 will still work without sharded domains. It may (arguably) be slower but it won’t break. But most users are likely on HTTP/2 so is it really worth adding the complexity for the minority’s of users? Is this not a way of progressively enhancing your site for people on modern browsers (and encouraging those not, to upgrade)? For larger sites (e.g. Google, Facebook... etc.) the minority may still represent a large number of users and the complexity is worth it (and they have the resources and expertise to deal with it) for the rest of us, my recommendation is not to shard, to upgrade to new protocols like HTTP/2 when they become common (like it is now!) but otherwise to keep complexity down.

How to add HTTP/2 in G-WAN

I would like to know if it's possible to make G-WAN 100% compatible with HTTP/2 by using for example the solution nghttp2 (https://nghttp2.org)
Sorry for the late answer - for any reason Stackoverflow did not notify us this question and I have found it only because a more recent one was notified.
I have not looked at this library so I can't tell for sure if it can be used without modifications, but it could certainly be used as the basis of an event-based G-WAN protocol handler.
But, from a security point of view, there are severe issues with HTTP-2, and this is why we have not implemented it in G-WAN: HTTPS-2 lets different servers use the same TCP connection - even if they weren't listed in the original TLS certificate.
That may be handy for legit applications, but that's a problem for security: DOH (DNS over HTTP-2) prevents users from blocking (or even detecting) unwanted hosts at the traditionally used DNS requests level (the "hosts" file in various operating systems).
In facts, this new HTTP standard is defeating the purpose of SSL certificates, and defeating domain-name monitoring and blacklisting.
Is it purely a theoretical threat?
Google ads have been used in the past to inject malware designed to attack both the client and server sides.

Why HTTP is far more used than HTTPS?

I hope every reason is mentioned, I think that performance is the main reason, but I hope every one to mention what he\she knows about this.
It's more recommended that you explain every thing, I'm still a starter.
Thanks in advance :)
It makes pages load slower, at least historically. Nowadays this may not be so relevant.
It's more complex for the server admin to setup and maintain, and perhaps too difficult for the non-professional.
It's costly for small sites to get and regularly renew a valid SSL certificate from the SSL certificate authorities.
It's unnecessary for most of your web browsing.
It disables the HTTP_REFERER field, so sites can't tell where you've come from. Good for privacy, bad for web statistics analysis, advertisers and marketing.
Edit: forget that you also need a separate IP address for each domain using SSL. This is incompatible with name-based virtual hosting, which is widely used for cheap shared web hosting. This might become a non-issue if/when IPv6 takes off, but it makes it impossible for every domain to have SSL using IPv4.
HTTPS is more expensive than plain HTTP:
Certificates issued by trusted issuer are not free
TLS/SSL handshake costs time
TLS/SSL encryption and compression takes time and additional resources (the same for decryption and decompression)
But I guess the first point is the main reason.
Essentially it's as Gumbo posts. But given the advances in power of modern hardware, there's an argument that there's no reason to not use HTTPS any more.
The biggest barrier is the trusted certificate. You can go self-signed, but that then means all visitors to your site get an "unrested certificate" warning. The traffic will still be encrypted, and it is no less secure, but big certificate warnings can put potential visitors off.
I maybe stating the obvious, but not all content needs transport layer security.

HTTPS instead of HTTP?

I'm new to web security.
Why would I want to use HTTP and then switch to HTTPS for some connections?
Why not stick with HTTPS all the way?
There are interesting configuration improvements that can make SSL/TLS less expensive, as described in this document (apparently based on work from a team from Google: Adam Langley, Nagendra Modadugu and Wan-Teh Chang): http://www.imperialviolet.org/2010/06/25/overclocking-ssl.html
If there's one point that we want to
communicate to the world, it's that
SSL/TLS is not computationally
expensive any more. Ten years ago it
might have been true, but it's just
not the case any more. You too can
afford to enable HTTPS for your users.
In January this year (2010), Gmail
switched to using HTTPS for everything
by default. Previously it had been
introduced as an option, but now all
of our users use HTTPS to secure their
email between their browsers and
Google, all the time. In order to do
this we had to deploy no additional
machines and no special hardware. On
our production frontend machines,
SSL/TLS accounts for less than 1% of
the CPU load, less than 10KB of memory
per connection and less than 2% of
network overhead. Many people believe
that SSL takes a lot of CPU time and
we hope the above numbers (public for
the first time) will help to dispel
that.
If you stop reading now you only need
to remember one thing: SSL/TLS is not
computationally expensive any more.
One false sense of security when using HTTPS only for login pages is that you leave the door open to session hijacking (admittedly, it's better than sending the username/password in clear anyway); this has recently made easier to do (or more popular) using Firesheep for example (although the problem itself has been there for much longer).
Another problem that can slow down HTTPS is the fact that some browsers might not cache the content they retrieve over HTTPS, so they would have to download them again (e.g. background images for the sites you visit frequently).
This being said, if you don't need the transport security (preventing attackers for seeing or altering the data that's exchanged, either way), plain HTTP is fine.
If you're not transmitting data that needs to be secure, the overhead of HTTPS isn't necessary.
Check this SO thread for a very detailed discussion of the differences.
HTTP vs HTTPS performance
Mostly performance reasons. SSL requires extra (server) CPU time.
Edit: However, this overhead is becoming less of a problem these days, some big sites already switched to HTTPS-per-default (e.g. GMail - see Bruno's answer).
And not less important thing. The firewall, don't forget that usually HTTPS implemented on port 443.
In some organization such ports are not configured in firewall or transparent proxies.
HTTPS can be very slow, and unnecessary for things like images.

Resources