Web crawlers overloading site

Web crawlers overloading site - nginx

We have a problem with a number of our sites where Yahoo, Google, Yandex, Bing Ahrefs and others all index the site at the same time which kills the website.
I've configured fail2ban to block the source IPs, but these are forever changing so not ideal. I have also tried using robots.txt but this makes little difference.
We have tried putting the site behind cloudflare, but again this makes little difference and all we can do is block the source IPs.
What else can I do?
Currently we're monitoring the site with Nagios, which restarts nginx when the site becomes un-responsive, but this seems far from ideal.
Ubuntu server running nginx
Robots.txt file is here:-
User-agent: *
Disallow: /
Posting here in case there is anything that I can get our developers to try.
Thanks

An easy approach is to rate limit them based on User-Agent header in their requests. Schematically this looks like the following.
At http level in Nginx configuration:
map $http_user_agent $bot_ua {
default '';
"~*Googlebot|Bing" Y;
}
limit_req_zone $bot_ua zone=bot:1m rate=1r/s;
This will make sure that all requests with Googlebot or Bing in User-Agent will be rate limited to 1 request per second. Note that rate limiting will be "global" (vs. per-IP), i.e. all of the bots will wait in a single queue to access the web site. The configuration can be easily modified to rate limit on per-IP basis or to white list some user agents.
At server or location level:
limit_req zone=bot burst=5;
Which means a "burst" of 5 requests is possible. You may drop this option if you want.
Nginx rate limit documentation
Nginx map documentation
Nginx will issue HTTP status code 429 when a request is rate limited. "Sane" web crawlers detect this and slow down scanning the site.
Though I should say this whole problem is much more complicated. There are a lot of malicious requests pretending they are from Google, Twitter, FB, etc. coming from various scanners and crawlers (e.g. see this question) and they respect neither robots.txt nor 429. Sometimes they are quite smart and have User-Agents mimicing browsers. In this case the approach above will not help you.

Your robots.txt should have worked. And note that not all crawlers respect robots.txt.
robots.txt is case-sensitive, and it needs to be world-readable at www.yourdomain.com/robots.txt.
See what happens when you add Crawl-delay: 10.

Related

Nginx: capture post requests when upstream is offline

I'm using Nginx as a reverse proxy for a Ruby on Rails application.
The application has 2 critical endpoints that are responsible for capturing data from customers who're registering their details with our service. These endpoints take POST data from a form that may or may not be hosted on our website.
When our application goes down for maintenance (rare, but we have a couple of SPOF services), I would like to ensure the POST data is captured so we don't lose it forever.
Nginx seems like a good place to do this given that it's already responsible for serving requests to the upstream Rails application, and has a custom vhost configuration in place that serves a static page for when we enable maintenance mode. I figured this might be a good place for additional logic to store these incoming POST requests.
The issue I'm having is that Nginx doesn't parse POST data unless you're pointing it at an upstream server. In the case of our maintenance configuration, we're not; we're just rendering a maintenance page. This means that $request_body¹ is empty. We could perhaps get around this by faking a proxy server, or maybe even pointing Nginx at itself and enabling the logger on a particular location. This seems hacky though.
Am I going about this the wrong way? I've done some research and haven't found a canonical way to solve this use-case. Should I be using a third-party tool and not Nginx?
1: from ngx_http_core_module: "The variable’s value is made available in locations processed by the proxy_pass, fastcgi_pass, uwsgi_pass, and scgi_pass directives when the request body was read to a memory buffer."

Is HTTP2 + SSL faster than HTTP without SSL?

As the title says,
I currently have a website with https on sensitive pages and cached http on the rest for speed.
I know HTTP2 is faster than HTTPS. What I don't know is whether HTTP2 is faster than regular un-encrypted HTTP?
Would I see performance improvements if I encrypted everything with SSL and enabled HTTP2, compared to using HTTP without encryption but with caching?

It completely depends on your site.
However, saying that, there is practically zero noticeable speed penalty for using HTTPS nowadays - unless you are on 20 year old hardware or serving huge content (e.g. you are a streaming site like Netflixs or YouTube). In fact even YouTube switched to HTTPS for practically all their users: https://youtube-eng.blogspot.ie/2016/08/youtubes-road-to-https.html?m=1
There is a small initial connection delay (typically 0.1 of a second) but after that there is practically no delay, and if on HTTP/2 then the gains that m will give to most sites will more than make up for tiny, unnoticeable, delay that HTTPS might add.
In fact, if you have some of your site on HTTPS already, then either you are using HTTPS resources on some of your site (e.g. common CSS that both HTTP and HTTPS pages use) and already experiencing this delay, as you need to connect over HTTPS to get those even when on HTTP, or you are making them available over both and making your HTTPS users download them again when they switch.
You can test the differences with a couple of sample sites on my blog here to give you some indication of differences: https://www.tunetheweb.com/blog/http-versus-https-versus-http2/ - which is a response to the https://www.httpvshttps.com website that I feel doesn't explain this as well as it should.

How can I not just deny, but slow down requests from a crawler from a certain IP [nginx]

I have an overbearing and unauthorized crawler trying to crawl my website at very high request rates. I denied the IP address in my nginx.conf when I saw this, but then a few weeks later I saw the same crawler (it followed the same crawling pattern) coming from another IP address.
I would like to fight back not just by sending back an immediate 403 but by also by slowing down the response time, or something equally inconvenient and frustrating for the programmer behind this crawler. Any suggestions?
I would like to manually target just this crawler. I've tried the nginx limit_req_zone, but it's too tricky to configure in a way that does not affect valid requests.

Nginx - Allow requests from IP range with no header set

I'm trying to use nginx behind a Compute Engine http load balancer. I would like to allow Health Check requests to come through a port unauthorized and all other requests to be authorized with basic auth.
The Health Check requests come from IP block: 130.211.0.0/22. If I see requests coming from this IP block with no X-forwarded-for header, then it is a health check from the load balancer.
I'm confused on how to set this up with nginx.

Have you tried using Nginx header modules? Googling around I found these:
HttpHeadersMoreModule
Headers
There's also a similar question here.
Alternative. In the past I worked with a software (RT), which had thought of this possibility in the software itself, providing a subdirectory for unauthorized access (/noauth/). Maybe your software might have the same, and you could configure GCE health check to point to something like /noauth/mycheck.html.
Please remember that headers can be easily forged, so an attacker who knows your vulnerability could access your server without auth.

http:// vs http://www: Should I use subdomain or permanent redirect

As far as I can work out, http://www.example.com is technically a subdomain of http://example.com.
Is it better to have www.example.com as a separate subdomain (pointing at the same content), or is it better to do a perminent redirect of all traffic from http://example.com to www.example.com (or visa versa)?
Excuse my ignorance, but the reason I ask is that I'm worried that having two locations online (one with the www, one without), could cause problems with SEO, cookies, analytics etc.
Thanks!
G

Yes, if it is the same content, then give it a single URI; and redirect example.com -> www.example.com not the other way around (www. is the convention and is where systems will most likely look first). If you host on both urls, then Spiders and other bots may be smart enough to realize that it is the same, but why rely on it when a simple redirect ensures they know?
This also means a slightly simple web server setup (only hosting a single domain) and will be easier down the road if you do things like enable ssl or load balancing.
The only (arguable) downside to redirects is that it will mean an additional http request if the user gets it wrong.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex