Squid limit the number of requests per minute for certain domains - squid

Is there a way of limiting my squid users to access let's say Facebook more than once per hour for example? I know how to deny domains but I want to limit this request size to 1/given time if possible.

It's not possible to do that with Squid directly. You could try writing an external acl helper.

Related

How to configure nginx to only allow requests from cloudfront client?

I have an server behind nginx, and I have a frontend distributed on AWS cloudfront using AWS Amplify. I'd like requests coming not from my client to be denied at the reverse proxy level. If others think I should do this on the app level, please lmk.
What I've tried so far is to allow all the ips of AWS' cloudfront system (https://ip-ranges.amazonaws.com/ip-ranges.json), and deny all after that. However, my requests from the correct client get blocked.
My other alternative is to do a lookup by IP of the domain for every request, and check against that - but I'd rather not do a DNS lookup every time.
I can also include some kind of token with every request, but come on - there's gotta be some easier way to get this done.
Any ideas?

Nginx shared memory with limit_req_zone

So actually I'm using nginx to reverse proxy (and load balancing) some API backend servers with nginx and I'm using the directive limit_req_zone to limit max requests per IP and URI. No problem with that.
Eventually we might need to scale out and add a couple more of nginx instances. Every nginx instance uses a "shared memory zone" to temporary save (in a cache, I guess) every request so it can properly check if the request passes or not accordingly with the limit_req_zone mentioned above. That being said, how does nginx handles it if multiple nginx are running at same time?
For example:
limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
This tells nginx to only allow 1 request per second coming from the same IP address, but what about if the second request (within the same second) comes to another nginx instance? As I understand, it will pass because they not share the shared memory where it stores the cache, I guess.
I've been trying to research a bit about it but could't find anything. Any help would be appreciate.
If by multiple nginx you mean multiple master processes, I'm not completely sure what the result is. To have multiple master processes running, they would need to have different configs / different ports to bind.
For worker processes with a single master instance, the shared memory is precisely that, shared, and all of the workers will limit the requests together. The code documentation says:
shared_memory — List of shared memory zones, each added by calling the ngx_shared_memory_add() function. Shared zones are mapped to the same address range in all nginx processes and are used to share common data, for example the HTTP cache in-memory tree.
http://nginx.org/en/docs/dev/development_guide.html#cycle
In addition, there's a blog entry about limit_req stating the following:
Zone – Defines the shared memory zone used to store the state of each IP address and how often it has accessed a request‑limited URL. Keeping the information in shared memory means it can be shared among the NGINX worker processes. The definition has two parts: the zone name identified by the zone= keyword, and the size following the colon. State information for about 16,000 IP addresses takes 1 megabyte, so our zone can store about 160,000 addresses.
Taken from https://www.nginx.com/blog/rate-limiting-nginx/

Web crawlers overloading site

We have a problem with a number of our sites where Yahoo, Google, Yandex, Bing Ahrefs and others all index the site at the same time which kills the website.
I've configured fail2ban to block the source IPs, but these are forever changing so not ideal. I have also tried using robots.txt but this makes little difference.
We have tried putting the site behind cloudflare, but again this makes little difference and all we can do is block the source IPs.
What else can I do?
Currently we're monitoring the site with Nagios, which restarts nginx when the site becomes un-responsive, but this seems far from ideal.
Ubuntu server running nginx
Robots.txt file is here:-
User-agent: *
Disallow: /
Posting here in case there is anything that I can get our developers to try.
Thanks
An easy approach is to rate limit them based on User-Agent header in their requests. Schematically this looks like the following.
At http level in Nginx configuration:
map $http_user_agent $bot_ua {
default '';
"~*Googlebot|Bing" Y;
}
limit_req_zone $bot_ua zone=bot:1m rate=1r/s;
This will make sure that all requests with Googlebot or Bing in User-Agent will be rate limited to 1 request per second. Note that rate limiting will be "global" (vs. per-IP), i.e. all of the bots will wait in a single queue to access the web site. The configuration can be easily modified to rate limit on per-IP basis or to white list some user agents.
At server or location level:
limit_req zone=bot burst=5;
Which means a "burst" of 5 requests is possible. You may drop this option if you want.
Nginx rate limit documentation
Nginx map documentation
Nginx will issue HTTP status code 429 when a request is rate limited. "Sane" web crawlers detect this and slow down scanning the site.
Though I should say this whole problem is much more complicated. There are a lot of malicious requests pretending they are from Google, Twitter, FB, etc. coming from various scanners and crawlers (e.g. see this question) and they respect neither robots.txt nor 429. Sometimes they are quite smart and have User-Agents mimicing browsers. In this case the approach above will not help you.
Your robots.txt should have worked. And note that not all crawlers respect robots.txt.
robots.txt is case-sensitive, and it needs to be world-readable at www.yourdomain.com/robots.txt.
See what happens when you add Crawl-delay: 10.

Will a request to api.myapp.com be slower then a request to api-myapp.herokuapp.com when hosted on heroku?

I'm trying to understand the best way to handle SOA on heroku, i've got it into my head that making requests to custom domains will somehow be slower, or would all requests go "out" via the internet?
On previous projects which are SOA in nature we've had dedicated hosting so could make requests like http://blogs/ (obviously on the internal network) I'm wondering if heroku treats *.herokuapp.com requests as "internal"... Or is it clever enough to know the myapp.com is actually myapp.herokuapp.com and route locally, or am i missing the point completely, and in fact all requests are "external"
What you are asking about is general knowledge of how internet requests are working.
Whenever you do request from your application to lets say example.com, domain name will first be translated into IP address using so called DNS servers.
So this how it works: does not matter you request myapp.com or myapp.heroku.com you will always request infromation from specific IP address, and domain name you have requested will be passed as part of request headers.
Server which receives this request will try to find in its internal records this domain name and handle request accordingly.
So conclusion is that does not matter you put myapp.com or myapp.heroku.com, the speed of request will always be same.
PS: As heroku will load balance your requests between different instances of your running myapp.com, the speed here will depend on several factors: how quickly your application will respond, how many instances you have running and load average per instance, how much is load balancer loaded at the moment. But surely it will not depend on which domain name you use.

Maintaining simultaneous connections in HTTP?

I need to maintain multiple active long-pooling AJAX connections to the Webserver.
I know that most browsers don't allow more then 2 simultaneous connections to the same server. This is what the HTTP 1.1 protocol states:
Clients that use persistent
connections SHOULD limit the number of
simultaneous connections that they
maintain to a given server. A
single-user client SHOULD NOT maintain
more than 2 connections with any
server or proxy. A proxy SHOULD use up
to 2*N connections to another server
or proxy, where N is the number of
simultaneously active users. These
guidelines are intended to improve
HTTP response times and avoid
congestion.
Supposing that I have 2 sub-domains Server1.MyWebSite.Com and Server2.MyWebSite.Com sharing the same IP address, will I be able to make 2x2 simultaneous connections?
It does appear that different hostnames on the same IP can be useful. You may run into issues when making the AJAX connections due to Same Origin Policy.
Edit: As per your document.domain question (from Google's Browser Security Handbook):
Checks for XMLHttpRequest targets do not take document.domain into account...
It will be 100% browser dependent. Some might base the 2 connection limit on domain name, some might on IP address.
Others will let you do as many as you like.
No browser bases its connection limit on IP address. All browsers base the limit on the specified FQDN.
Hence, yes, it would be entirely fine to have a DNS alias to your server, although the earlier answer is correct that XHR will require that you use the page's domain name for XHR, and use the alias to download the static content (images, etc) in the page.
Incidentally, modern browsers typically raise the connection limit to 6 or 8 connections per host.

Resources