How to whitelist good bots in fail2ban - nginx

We have a server running nginx and wordpress. Access logs have been disabled to reduce disk I/O and increase performance. I have enabled custom logs for multiple specific uri which logs host date/time method and uri into a specific files. For some uri like /phpmyadmin or /xmlrpc.php I am directly banning the IP the moment server receives a request. Now I want to white-list known search engines like AOL.com Baidu Bingbot/MSN DuckDuckGo Googlebot Teoma Yahoo! and Yandex. I know how to white-list an IP, but have no idea how to while-list a whole lot of spiders.

Related

Which app is in charge of URL redirection in a WordPress site?

I just wonder which app is in charge of URL redirection in a WordPress site.
In my site, there is a real folder at /downloads/files/. Now I find there are some 404 errors in accessing https://www.datanumen.com/downloads/files/sitemap.xml, so I want to redirect the URL to https://www.datanumen.com/sitemap.xml
I try several methods:
Add the following redirect in .htaccess in the root folder:
Redirect 301 /downloads/files/sitemap.xml https://www.datanumen.com/sitemap.xml
But that does not work.
Install Redirection plugin and setup a redirection from
/downloads/files/sitemap.xml
to
/sitemap.xml
But still not work.
So I am curious that in a WordPress site, when I input a URL, will the URL be processed by WordPress system first(in which method 2 will take effect), or processed by Apache first(in which method 1 will take effect)? Why both methods do not work?
Apache/.htaccess catches the request first. It is Apache that sends the request to WordPress/PHP.
However, looking at the HTTP response in the browser...
cf-cache-status: HIT
server: cloudflare
The 404 response you are seeing is coming from your CDN's cache. The request isn't even reaching your application server in order to process the redirect.
/downloads/files/sitemap.xml
HOWEVER, are these legitimate requests for your sitemap XML file? This seems unlikely. So, I'd question whether these requests need to be redirected in the first place?
Note that /sitemap.xml is itself redirected to /wp-sitemap.xml so /sitemap.xml would not seem to be the correct target anyway.
client sends ip packet to a host and specifies that the data in the ip packet is for a particular port number, apache listens on specific port numbers as do most network applications, if the ip packet is for one of the port numbers apache is listening on the operating system will route the ip packet to apache, apache then can see a request for file i.e. html,pdf,jpeg,etc. apache then retrieves that file,image,etc from the server storage medium i.e. ssd or hdd, if the file contains php code then apache will parse the php itself before serving the file to the client.
hardware for server computer i.e. cpu,gpu,ram,memory
operating system i.e. linux
server app than run on the operating system i.e. apache
php files i.e. wordpress
So basically Apache acts first i.e. the server configuration files then when Apache parses the wordpress php files, the php script is executed secondly.

How to configure nginx to only allow requests from cloudfront client?

I have an server behind nginx, and I have a frontend distributed on AWS cloudfront using AWS Amplify. I'd like requests coming not from my client to be denied at the reverse proxy level. If others think I should do this on the app level, please lmk.
What I've tried so far is to allow all the ips of AWS' cloudfront system (https://ip-ranges.amazonaws.com/ip-ranges.json), and deny all after that. However, my requests from the correct client get blocked.
My other alternative is to do a lookup by IP of the domain for every request, and check against that - but I'd rather not do a DNS lookup every time.
I can also include some kind of token with every request, but come on - there's gotta be some easier way to get this done.
Any ideas?

How can I not just deny, but slow down requests from a crawler from a certain IP [nginx]

I have an overbearing and unauthorized crawler trying to crawl my website at very high request rates. I denied the IP address in my nginx.conf when I saw this, but then a few weeks later I saw the same crawler (it followed the same crawling pattern) coming from another IP address.
I would like to fight back not just by sending back an immediate 403 but by also by slowing down the response time, or something equally inconvenient and frustrating for the programmer behind this crawler. Any suggestions?
I would like to manually target just this crawler. I've tried the nginx limit_req_zone, but it's too tricky to configure in a way that does not affect valid requests.

Nginx - Allow requests from IP range with no header set

I'm trying to use nginx behind a Compute Engine http load balancer. I would like to allow Health Check requests to come through a port unauthorized and all other requests to be authorized with basic auth.
The Health Check requests come from IP block: 130.211.0.0/22. If I see requests coming from this IP block with no X-forwarded-for header, then it is a health check from the load balancer.
I'm confused on how to set this up with nginx.
Have you tried using Nginx header modules? Googling around I found these:
HttpHeadersMoreModule
Headers
There's also a similar question here.
Alternative. In the past I worked with a software (RT), which had thought of this possibility in the software itself, providing a subdirectory for unauthorized access (/noauth/). Maybe your software might have the same, and you could configure GCE health check to point to something like /noauth/mycheck.html.
Please remember that headers can be easily forged, so an attacker who knows your vulnerability could access your server without auth.

Removing http301 redirect from client's cache

I have a server/client architecture where the client hits the ASP.NET server's service at a certain host name, IP address, and port. Without thinking, I logged on to the server and set up permanent HTTP301 redirection through IIS from that service to another URL that the machine handles via IIS (same IP and port), mistakenly thinking it was another site that is hosted there. When the client hit the server at the old host name, it cached the permanent redirect. Now, even though I have removed the redirection, the client no longer uses the old address. How can I clear the client's cache so that it no longer stores the redirect?
I have read about how permanent HTTP301 can be, but in this case, it should be possible to reset a single client's knowledge of the incorrectly-learned host name. Any ideas?
The HTTP status code 301 is unambiguously defined in RFC 2616 as
any future references to this
resource SHOULD use one of the
returned URIs
which means you have to go ask all your clients to revalidate the resource. If you have a system where you can push updates to your clients, perhaps you can push an update to use the same URI again, but force a revalidation.
Nothing you do on the server side will help - in fact, by removing the permanent redirect in IIS you have already taken all measures you should.

Resources