facebookexternalhit/1.1 bot Excessive Requests, Need to Slow Down - woocommerce

facebookexternalhit/1.1 hitting my WooCommerce site badly, causing 503 error. So many requests per second. I tried to slow it down using robot txt and set wordfence rate limit. Nothing works, is there any way to slow down without blocking the bot?
Here's few example of raw access logs.
GET /item/31117/x HTTP/1.0" 301 - "-" "facebookexternalhit/1.1
(+http://www.facebook.com/externalhit_uatext.php) GET
/?post_type=cms_block&p=311 HTTP/1.0" 503 607 "-"
"facebookexternalhit/1.1
GET /item/31117/xiaomi-redmi-router HTTP/1.1" 200 48984 "-"

If someone shares a link to your site on Facebook, on occasion when someone views the link (on Facebook, they don't have to click it), Facebook will reach out and grab the rich embed data (opengraph image etc..). This is a very well known problem if you search around the net.
The solution is to rate limit any useragent containing this text:
facebookexternalhit
The crawler does not respect robots.txt and it can actually be leveraged in DDOS attacks, there are many articles about it:
https://thehackernews.com/2014/04/vulnerability-allows-anyone-to-ddos.html
https://devcentral.f5.com/s/articles/facebook-exploit-is-not-unique
They do not respect robots.txt so you have to rate limit. I'm not familiar with Woocommerce but you can search for "Apache rate limiting" or "nginx rate limiting" depending on which you use, and find many good articles.
I recently received a DDOS attack and they combined it with this Facebook attack method at the same time, the Facebook ASN AS32934, hit 1,060 URLs in 1 second.
I just banned the entire ASN, problem solved.

Related

Status 403 - Alternative Status Texts

In professional environments Antivirus Software like McAfee filters the clients traffic and blocks illegal requests. One way they do it, is by responding with a 403 message instead of the real response.
The problem I am having right now is, that a frontend can not see the difference between a 403 coming from the real backend or a 403 coming from the Antivirus Software.
I was thinking about using their custom status texts, but I couldnt find a lot of documentation regarding a list of possible status texts. Is there such a thing as a standard or a public list of possible status texts?
Examples:
URLBlocked
URLCategoryBlocked
...

What do these strange headers mean and what is this hacker trying to do?

I've recently deployed a public website and looking at the nginx access logs I see hackers trying to access different php admin pages (which is fine, I don't use php), but I also see requests like this:
85.239.221.75 - - [27/Dec/2019:14:52:42 +0000] "k\xF7\xE9Y\xD3\x06)\xCF\xA92N\xC7&\xC4Oq\x93\xDF#\xBF\x88:\xA9\x97\xC0N\xAC\xFE>)9>\x0Cs\xC1\x96RB,\xE1\xE2\x16\xB9\xD1_Z-H\x16\x08\xC8\xAA\xAF?\xFB4\x91%\xD9\xDD\x15\x16\x8E\xAB\xF5\xA6'!\xF8\xBB\xFBBx\x85\xD9\x8E\xC9\x22\x176\xF0E\x8A\xCDO\xD1\x1EnW\xEB\xA3D|.\xAC\x1FB\xC9\xFD\x89a\x88\x93m\x11\xEB\xE7\xA9\xC0\xC3T\xC5\xAEF\xF7\x8F\x9E\xF7j\x03l\x96\x92t c\xE4\xB5\x10\x1EqV\x0C5\xF8=\xEE\xA2n\x98\xB4" 400 182 "-" "-"
What is this hacker sending and what are they trying to do? And what should I do to stay ahead of this type of attack?
The data you are having is hex formatted. It is more likely showed because of making HTTPS request to an HTTP request endpoint. Because HTTP expects plain text data and you are giving it HTTPS data which is encrypted, that's why you are seeing bunch of gibberish regarding that log.

How can I prevent unwanted GET requests with URLs added to parameters

I have a small ecommerce site, (LEMP stack) and I had used a route like
my.domain.com/makecart?return_url=....
as a means of returning to a point in the previous page to assist selection for the cart.
Over a period of months I started getting thousands of GET requests with unwanted domain links appended to the ?return_url parameter.
I have now reprogrammed this route without the use of any parameters, but my site is still getting the unwanted hits.
e.g. 76.175.182.173 - - [14/Nov/2018:19:36:08 +0000] "GET /makecart?return_url=http://www.nailartdeltona.com/ HTTP/1.0" 302 364 "http://danielcraig.2bb.ru/click.php
I am redirecting such requests to an error page, and have it 'under control' with fail2ban but I am gradually filling up memory with banning information.
Is there a way to prevent these hits before they are plucked back out of the access log?
Furthermore what are they doing anyway?

Bad requests for WordPress RSS and author URLs

On a popular WordPress site, I'm getting a constant stream of requests for these paths (where author-name is the first and last name of one of the WordPress users):
GET /author/index.php?author=author-name HTTP/1.1
GET /index.rdf HTTP/1.0
GET /rss HTTP/1.1
The first two URLs don't exist, so the server is constantly returning 404 pages. The third is a redirect to /feed.
I suspect the requests are coming from RSS readers or search engine crawlers, but I don't know why they keep using these specific, nonexistent URLs. I don't link to them anywhere, as far as I can tell.
Does anybody know (1) where this traffic is coming from and (2) how I can stop it?
Check Apache logs to get the "where" part.
Stopping random internet traffic is hard. Maybe serve them some other error codes and it will stop. It probably wont tho.
Most my sites have these, most of the time I track them to Asia or the americas, blocking the ip works but if they are few and far between that would be just wasting resources.

Should I use 404 Not found or 410 Gone for a bulletin board system, when a topic is deleted?

I'm creating a bulletin board system, and now I'm implementing a 'delete topic' feature for admins. If someone opens the deleted topic, the server cannot find it, so it must be 404. On the other hand, the topic has existed sometime, so I must use 410. Implementing the 410 would require a new table called deleted_topics, and so would require more space. However, 410 I think is better for search engines. What do you think? Should I use 404 or 410?
404 Not found
The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
Thanks,
Showing a 410 requires a little more effort than a 404 because to know it's a 410 you need to maintain at least a "ghost" of the former page in your database. If this is not a problem to you, I'd consider the 410 "better" and "friendlier" because it presents more information. If you don't want to be hassled with maintaining a graveyard in your database, then 404 is acceptable too, of course.
I don't like Alohci's approach of redirecting to a different page. The end result looks like the user ended up on the "input new topic" page (or whatever) by accident. This works, but I think a preferable solution would be to create a custom 410 page (or 404 page, if you don't want to support 410) with specific information for the case at hand. I.e. your 410 shouldn't just say "gone", it should say "this post has been deleted, here's a link to similar posts or a link to create a new post". Your "404" wouldn't have quite as much information available but it could still offer a subset of such information and links.
I guess the "custom 410 page" comes close in appearance to "redirecting with 301" but an important difference is that robotic users of your site (of which there are many!) will get the more accurate status, and know to purge the old link from their crawl index – this will ultimately save them and you some unnecessary traffic.
I think the correct way to do this is by sending the 410 Gone for some time and after a few weeks/months to switch to 404 Not found. Of course, it is for you to decide if that is worth the amount of time and effort.
Neither. Since you tagged your question 'SEO' I'm assuming you want the best SEO answer. If there are any backlinks (coming from outside sites) to your deleted topic all the 'link juice' will be lost with 404 and 410 status.
Instead you should definitely create some 301 redirects which point to the root of the site, the root of the forum, or a related category. You will thus preserve the link juice and you get to decide which pages of your site will benefit most.

Resources