block bad server request while using lowest server overhead - wordpress

Our web server is occasionally getting slammed with loads of requests for an (exchange server) file that doesn't exist on our (apache) web server:
autodiscover/autodiscover.xml
How can I respond to those requests in a way that requires the least load on our server?
(When these requests happen, our VPS memory usage spikes and goes over our RAM allocation.)
I want the server to respond, with the lowest overhead, telling them basically to go away; that file doesn't exist here; stop; bad request (you get the idea.)
Right now, the server returns a 404 error page which means that our WordPress installation is evoked. It returns our custom 404 WordPress error page. That involves a lot of overhead that I'd like to avoid.
I suspect that these requests come from some sort of hacking attempts, but I'm just guessing at that. At any rate, I just want to intercept them and block or stop them as quickly and efficiently as possible.
(I've put IP blocks on the IP addresses they come from but I think that is just playing whack-a-mole.)
I've put this in our htaccess file but it doesn't do what I want:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteRule ^autodiscover/autodiscover.xml$ "-" [forbidden,last]
</IfModule>
Is there something wrong with this rule?
Can I or should I use our htaccess file in another way to do what I want? Is there a better way to do it than using htaccess? Could we be returning something other than a 404 response? Perhaps a 400 or 403 response? How would we do that? We are on a VPS server to which I do not have direct access.

"No direct access" means you can't install software?
I'd totally recommend setting up nginx as a reverse proxy, but if that's out of the question because of no local admin: use a service like CloudFlare and use page rules (which runs the reverse proxy for you, and more).
You don't want anything unnecessary to reach apache if you run pre-fork mpm (which you probably do because of mod_php), as it will spawn lots of processes to handle parallel clients, and it is relatively slow in starting them. Additionally, once the storm is over, they'll be killed again (depending on your setting for MaxSpareServers).
Regarding MaxSpareServers, and a way to at least contain the damage this does:
IF (you might not be) you run apache's mpm_prefork (MPM = Multi-Processing Modules, basically what "engine" apache uses; most PHP-setups with mod_php run mpm_prefork), apache needs a child to handle each parallel request.
Now, when 100 requests come in at the same time, apache will spawn 100 (or less, if MaxRequestWorkers is set below 100), childs to handle them. That's VERY costly, because each of those processed needs to start and will take up memory) and can result in a Denial-Of-Server for all intents and purposes as your VPS starts to swap and everything gets slowed down.
Once this request storm is over, apache is left with 100 child processes, and will start to remove some of them until it reaches whatever has been set as MaxSpareServers. If there are no spare servers, new requests will have to wait in line for a child to handle them. To users, that's basically "the server doesn't respond".
If you can't put something in front of apache, can you change (or have somebody change) the apache config? You will have some success if you limit MaxRequestWorkers, aka the maximum number of childs apache can spawn (and therefore the maximum number of parallel requests that can be handled). If you do, the server will still appear unresponsive in high load times, but it won't affect the system as much, as your processes won't start to use SWAP instead of RAM and become terribly slow.

Any attempt to Deny access to the file, whether via Deny or an [F] forbidden flag, results in WordPress kicking in and responding with a 404 error page.
To avoid this, I have created a custom 421 error page (which seemed most appropriate), placed it outside of the WordPress directory, and used an htaccess Redirect from the bad-request file name to the custom error page.
Redirect /autodiscover/autodiscover.xml /path/to/error-421-autodiscover-xml.php
This returns the custom error page (only 1KB) without WordPress being invoked. Creating a custom error page specific to the one file with the bad-request file in the name of the error code file should help to identify these calls in our log files.
If you ran an Exchange server elsewhere for email, then it might be appropriate to redirect from the web server to the Exhange server using htacess on the web server. An appropriate htaccess rule might be:
RedirectMatch 302 (?i)^/autodiscover/(.*)$ https://mail.example.com/autodiscover/$1
See: https://glennmatthys.wordpress.com/2013/10/14/redirect-autodiscover-urls-using-htaccess/
#janh is certainly right that a mechanism to intercept the request before it even hits our Apache server would be ideal, but it is beyond what we can do right now.
(The /autodiscover/autodiscover.xml seems to be a file used by Exchange servers to route traffic to the mail server. According to our web hosting company, it is not unusual for Apache servers to be hit with requests for this file. Some of these requests may come from malicious bots searching for access to a mail server; some may come from misconfigured smart phone email clients or other misdirected requests intended for an Exchange server.)

Related

Monitoring website visit length with Nginx logs

I am trying to build a solution to monitor my website visits without relying on cookies and third parties. Currently, by monitoring the access logs I can get enough and useful information but I am missing the length of the visits (i.e. to check whether people actually read what I write).
What would be a good strategy to monitor visit length with access logs? (I am using Nginx, but presumably the same ideas will be valid for Apache)
If not already part of your build then install the Nchan websockets module for Nginx.
Configure a websocket subscriber location directive on your Nginx server and specify nchan_subscribe_request and nchan_unsubscribe_request directives within it.
Insert a line of code into your page to establish a client connection to your websocket location upon page load.
That's it, done.
Now when I visit your page my browser will connect to your Nginx/Nchan websocket server. Nginx will make an internal request to whatever address you set as the nchan_subscribe_request URL, you can pass my IP in the headers of this request or whatever you need to identify me. Log this in your main log, a separate log, pass it to an upstream server, php, node, make a database entry, save my ip+timestamp in memcached, whatever.
Then when I leave the site my websocket connection will disconnect and Nginx will do the same thing but to the nchan_unsubscribe_request URL instead. Depending upon what you did when I connected you can now do whatever you need to do in order to work out how long I spent on your site.
As you now have a persistent connection to your clients you could take it a step further and include some code to monitor certain client behaviours or watch for certain events.
You are trying to determine whether or not people are reading what you write so you could use a few lines of javascript to monitor how far down the page visitors had scrolled. Each time they scrolled to a new maximum scroll position send that data over the websocket back to your server.
Due to the disconnected nature of HTTP, your access log would probably not give you what you need.
Not totally familiar with nginx or apache log, but I think most logs contain a timestamp, an HTTP request (the document requested and status, etc.) and an IP address.
Potential issues
Without a session cookie, all IP addresses (same household, same company, etc.) would be seen as the same session.
If someone goes to your site (1 HTTP request), consumes content on your site, doesn't proceed to another page, and leaves, your log will only contain that request (which is essentially a bounce, and you won't be able to calculate duration). If your application makes uses of a lot of javascript calls, then you might be able to log from the server side application,
2) If you use a tool like GA, etc., you can still use timer and javascript events (etc. though not perfect) to tell GA that the session is still active. Not sure if it works for typical server logs.
It might not be as big as an issue if a typical visit contains more than 1 request, knowing that there is no easy way to get the duration after the last server request.

Tomcat http is not redirecting to https

I have two instances of Tomcat set up on two different servers. I didn't explicitly select which versions to install, they were actually both automatically installed as a part of IBM Rational Team Concert installations (v5.0.1 and v5.0.2 on each server), but I can say they are both at least version 7.
On the first instance, when I go to http://myserver.domain.com:9443/ccm, I get automatically redirected to https://myserver.domain.com:9443/ccm.
On the second instance, when I go to http://otherserver.domain.com:9443/ccm, I do not get redirected to https, but rather either get a strange download or get a blank page with an unrecognizable Unicode character (depending on the browser).
I notice that both server.xml's are different (I am not sure why RTC made them that different between minor releases), but it is not obvious by looking at them both what I have to set in the second server.xml to achieve the behavior present in the first. Port 9443 is set up as an HTTPS port. What do I set in server.xml to make all http requests to that port automatically redirect to https?
Tomcat can't do what you are asking. There is no mechanism to detect that http is being used on an https port and redirect the user accordingly. This might be something we add in Tomcat 9 but that is very much just an idea at this stage.
Something other than Tomcat is performing the redirect you observe. Take a look at the HTTP headers - they might provide some clue as to what is going on.

Should I always use a reverse proxy for a web app?

I'm writing a web app in Go. Currently I have a layout that looks like this:
[CloudFlare] --> [Nginx] --> [Program]
Nginx does the following:
Performs some redirects (i.e. www.domain.tld --> domain.tld)
Adds headers such as X-Frame-Options.
Handles static images.
Writes access.log.
In the past I would use Nginx as it performed SSL termination and some other tasks. Since that's now handled by CloudFlare, all it does, essentially, is static images. Given that Go has a built in HTTP FileServer and CloudFlare could take over handling static images for me, I started to wonder why Nginx is in-front in the first place.
Is it considered a bad idea to put nothing in-front?
In your case, you can possibly get away with not running nginx, but I wouldn't recommend it.
However, as I touched on in this answer there's still a lot it can do that you'll need to "reinvent" in Go.
Content-Security headers
SSL (is the connection between CloudFlare and you insecure if they are terminating SSL?)
SSL session caching & HSTS
Client body limits and header buffers
5xx error pages and maintenance pages when you're restarting your Go application
"Free" logging (unless you want to write all that in your Go app)
gzip (again, unless you want to implement that in your Go app)
Running Go standalone makes sense if you are running an internal web service or something lightweight, or genuinely don't need the extra features of nginx. If you're building web applications then nginx is going to help abstract "web server" tasks from the application itself.
I wouldn't use nginx at all to be honest, some nice dude tested fast cgi go + nginx and just go standalone library. The results he came up with were quite interesting, the standalone hosting seemed to be much better in handling requests than using it behind nginx, and the final recommendation was that if you don't need specific features of nginx don't use it. full article
You could run it as standalone and if you're using partial/full ssl on your site you could use another go http server to redirect to safe https routes.
Don't use ngnix if you do not need it.
Go does SSL in less lines then you have to write in ngnix configure file.
The only reason is a free logging but I wonder how many lines of code is logging in Go.
There is nice article in Russian about reverse proxy in Go in 200 lines of code.
If Go could be used instead of ngnix then ngnix is not required when you use Go.
You need ngnix if you wish to have several Go processes or Go and PHP on same site.
Or if you use Go and you have some problem when you add ngnix then it fix the problem.

Setup a maintenance URL in a generic way

Suppose I have a website that is normally accessed at address www.mywebsite.com.
Now let's say the website is down completely (think server has melted). I want the users trying to reach www.mywebsite.com to end up on a maintenance URL on another server instead of having a 404.
Is this possible easily without having to route all the trafic through a dispatcher/load balancer?
I could imagine something like :
When the default server is UP traffic is like :
[USER]<---->[www.mywebsite.com]<---->[DISPATCHER]<---->[DEFAULT SERVER]
When the default server is DOWN traffic is like :
[USER]<---->[www.mywebsite.com]<---->[DISPATCHER]<---->[MAINTENANCE SERVER]
Where [DISPATCHER] figures out where to route the traffic. Problem is that in this scenario all the traffic goes through [DISPATCHER]. Can I make it so that the first connection goes through dispatcher, and then, if the default server is up, the traffic goes directly from the user to the default server? (with a check every 10 - 15 minutes for example)
[USER]<---->[www.mywebsite.com]<-------->[DEFAULT SERVER] after the first successful connection
Thanks in advance!
Unfortunately, maybe the most practical solution is to give-up. Until browsers finally add support for SRV records....
You can achieve what you want with dynamic DNS - setup some monitoring script on a "maintenance server" that would check if your website is down, and if yes, update DNS for your site and point it to the maintenance server. This approach have it's own problems, biggest of which is that any monitoring may generate false positives, and thus your users will see the maintenance page while the site is actually up.
Another possible approach (even worse) - for example, make www.example.com point to your dispatcher server, and www2.example.com - to your main server. Then dispatcher would HTTP redirect all incoming requests to www2.example.com.
But what will you do when your dispatcher melts ? - While trying to handle one point of failure you just added another one.
Maybe it's practical to handle all page links in some javascript what will check if the server is up first, and only then follow the link. This approach while requires some scripting, but at least provides best results when your server is down while the user is already on your site. But it helps nothing for those who ry to enter the site for the first time.
If only browsers would support SRV records....

Redirecting http traffic to another server temporarily

Assume you have one box (dedicated server) that's on 24 7 and several other boxes that are user machines that have unused bandwidth. Assume you want to host several web pages. How can the dedicated server redirect http traffic to the user machines. It is desirable that the address field in the web browser still displays the right address, and not an ip. Ie. I don't want to redirect to another web page, I want to tell the web browser that it should request the same web page from a different server. I have been browsing through the 3xx codes, and I don't think they are made for anything like this.
It should work some what along these lines:
1. Dedicated server is online all the time.
2. User machine starts and tells the dedicated server that it's online.
(several other user machines can do similarly)
3. Web browser looks up domain name and finds out that it points to dedicated server.
4. Web browser requests page.
5. Dedicated server tells web browser to repeat request to user machine
Is it possible to use some kind of redirect, and preferably tell the browser to keep sending further requests to user machine. The user machine can close down at almost any point of time, but it is assumed that the user machine will wait for ongoing transactions to finish, no closing the server program in the middle of a get or something.
What you want is called a Proxy server or load balancer that would sit in front of your web server.
The web browser would always talk to the load balancer, and the load balancer would forward the request to one of several back-end servers. No redirect is needed on the client side, as the client always thinks it is just talking to the load balancer.
ETA:
Looking at your various comments and re-reading the question, I think I misunderstood what you wanted to do. I was thinking that all the machines serving content would be on the same network, but now I see that you are looking for something more like a p2p web server setup.
If that's the case, using DNS and HTTP 30x redirects would probably be what you need. It would probably look something like this:
Your "master" server would serve as an entry point for the app, and would have a well known host name, e.g. "www.myapp.com".
Whenever a new "user" machine came online, it would register itself with the master server and a the master server would create or update a DNS entry for that user machine, e.g. "user123.myapp.com".
If a request came to the master server for a given page, e.g. "www.myapp.com/index.htm", it would do a 302 redirect to one of the user machines based on whatever DNS entry it had created for that machine - e.g. redirect them to "user123.myapp.com/index.htm".
Some problems I see with this approach:
First, Once a user gets redirected to a user machine, if the user machine went offline it would seem like the app was dead. You could avoid this by having all the links on every page specifically point to "www.myapp.com" instead of using relative links, but then every single request has to be routed through the "master server" which would be relatively inefficient.
You could potentially solve this by changing the DNS entry for a user machine when it goes offline to point back to the master server, but that wouldn't work without an extremely short TTL.
Another issue you'll have is tracking sessions. You probably wouldn't be able to use sessions very effectively with this setup without a shared session state server of some sort accessible by all the user machines. Although cookies should still work.
In networking, load balancing is a technique to distribute workload evenly across two or more computers, network links, CPUs, hard drives, or other resources, in order to get optimal resource utilization, maximize throughput, minimize response time, and avoid overload. Using multiple components with load balancing, instead of a single component, may increase reliability through redundancy. The load balancing service is usually provided by a dedicated program or hardware device (such as a multilayer switch or a DNS server).
and more interesting stuff in here
apart from load balancing you will need to set up more or less similar environment on the "users machines"
This sounds like 1 part proxy, 1 part load balancer, and about 100 parts disaster.
If I had to guess, I'd say you're trying to build some type of relatively anonymous torrent... But I may be wrong. If I'm right, HTTP is entirely the wrong protocol for something like this.
You could use dns, off the top of my head, you could setup a hostname for each machine that is going to serve users:
www in A xxx.xxx.xxx.xxx # ip address of machine 1
www in A xxx.xxx.xxx.xxx # ip address of machine 2
www in A xxx.xxx.xxx.xxx # ip address of machine 3
Then as others come online, you could add then to the dns entries:
www in A xxx.xxx.xxx.xxx # ip address of machine 4
Only problem is you'll have to lower the time to live (TTL) entry for each record down to make it smaller (I think the default is 86400 - 1 day)
If a machine does down, you'll have to remove the dns entry, though I do think this is the least intensive way of adding capacity to any website. Jeff Attwood has more info here: is round robin dns good enough?

Resources