Cloudflare Edge Caching ignoring crawlers

Cloudflare Edge Caching ignoring crawlers - http

I've set up Edge Caching to cache HTML content. It works perfectly well when resources are hit either by a browser or via Curl. In both cases, the first request warms the cache and the second request is served directly from Cloudflare.
Through my logs, however, I've noticed that crawlers such as Bing, Yahoo, and Google do not appear to warm the cache.
When I visit urls previously hit by a crawler in the browser or via Curl, that subsequent request hits my origin server as well (according to my server logs).
Is this a matter of plan size (regular vs. Enterprise), bad configuration, or does Cloudflare special case crawler user agents?

If your site isn't visited from the locations Google would ordinarily crawl from, it might not be warm in the CloudFlare Cache.
You might see some benefit from setting a higher Edge Cache Expire TTL in CloudFlare, depending on how frequently search engines crawl your site; in order to do this you need to use CloudFlare's Page Rules.
If you want something more bespoke, it's probably best you reach out to CloudFlare's Enterprise Sales team.

Related

Secure urls with very large random numbers

If I set up a simple web server online (eg nginx), and generate a very large random string (such that it is unguessable), and host that endpoint on my domain, eg
example.com/<very-large-random-string>
would I be safe in say, hosting a webapp at that endpoint with no authentication to store my personal information (like a scratch-pad or notes kind of thing)?
I know google docs does this, is there anything special one has to do (again, eg for nginx) to prevent someone from getting a list of all available pages?
I guess I'm asking is there any way for a malicious actor to find out about the existence of such a page, preferably irrespective of what web-server I used.

I'd be pretty alarmed if my online bank started using this system, but it should give you a basic level of security. Bear in mind that this is security through obscurity, which is rather frowned upon and will immediately turn into no security whatsoever the moment someone discovers the hidden URL.
To prevent this from happening, you will need to take a few precautions:
Install an SSL certificate on your server, and always access the url via https, never via http (otherwise the URL path will be sent in plain view and visible to everyone along the way).
Make sure your secure document contains no outgoing links. This includes not only hyperlinks (<a href="...">) but also embedded images, stylesheets, scripts, media files and so on. Otherwise the URL will be leaked to other domains via the Referer request headers.*1
(A bit of a no-brainer, but) make sure there are also no inbound links to this page. Although they aren't so common now, web hosts used to generate automatic "web stats" pages showing the traffic to each web domain. Some content management systems generate a site map automatically. This would be just as bad.
Disable directory browsing on your server. In other words, make sure that someone who visits the directory level above your hidden directory isn't presented with a list of subdirectories.
Bear in mind that the URL will always be visible in your address bar and browser history, and possibly in other places like your browser's cookie jar. Your browser will probably provide the rest of the URL by auto-complete when someone types the domain into your address bar.
*1: Actually, your browser will only send a Referer header when you access other https pages, but still...

Do cached Cloudfront pages interact with the origin server?

I inherited a complex AWS system from someone and I have practically no AWS experience. I'm reading documentation and doing training, but there's one thing I can't figure out: when someone hits a page served by CloudFront, are they able to make changes that affect the origin server?
I would have thought "no, they're just static pages", but I'm seeing evidence to the contrary. We have some Wordpress installs and I think the users are hitting Cloudfront when they log in through the admin panel remotely, but they're still able to make changes and publish content. I also at one point cached admin-ajax.php without allowing OPTIONS, PUT, PATCH, POST, and DELETE requests, thinking it doesn't matter because our front-end site doesn't use ajax. This broke the admin panel, which requires ajax, even when logged in directly to the origin server and bypassing Cloudfront.

The cached page would render in the browser, but when the user "makes changes" the browser is going to send a separate HTTP GET or POST request from the one that requested the page. The "changes" request will not be cached by CloudFront, but be forwarded to the origin server.
Note that you will need to configure CloudFront appropriately to respect the server's cache headers and treat parameters as cache keys, etc., in order for this to work properly. You can also configure CloudFront behaviors for a specific path, like /admin/* to prevent caching of the pages at that path among other things.

Use of "reversed" mixed content (serving active and passive HTTPS content to HTTPS website)

I have found a lot of information about serving http content into https websites and what to think of when doing / not doing that.
My problem is slightly different: I want to serve https content from one domain (active and passive) into another http only domain websites, but I can't find any information about browser support for that.
Example:
http://www.mydomain.com
loads scripts and images from
https://www.myotherdomain.com
I have tried this out in Chrome / Firefox and seem to not get any warnings, but wonder what the general browser support out there is. Can I expect this to work anywhere?

The reason for mixed content warnings are that when a user is browsing a page over https and it has content embedded which is accessed over http, the user would believe they are on a secure connection but not be aware of the insecure content otherwise. This could be used to trick a user into believing they are secure when actually they are not.
In your case the user would of course only see http, and not see anything to make them believe the connection is secure, this therefore would not be a security concern meaning that browsers will allow this.
The bigger question is why you may want to do this, remember you will not benefit from caching between your server and the client which would increase load on your https server. I'd be tempted to serve a copy of your files over http and only use the ones served over https for pages served over https.

Browser mixed content warning - what's the point?

I employ the Google maps API on my otherwise SSL-secured site. I invariably therefore get one of these terrible "mixed content" warnings pop up from my web app. This is annoying. I understand that this issue can be fixed when upon moving the app into production I sign up to a premier account with Google. Hurrah. I am just perplexed: the threat from Google to the integrity of my site remains the same whether I pull down their content over HTTP or HTTPS. What's the point, in other words, of browsers putting up this warning?
Thanks.

The threat from Google may remain the same, but when you're loading the Google content over http, it's not just threats from Google you need to worry about; you also need to worry about man-in-the-middle attacks, in which someone pretends to be Google and injects malicious content into your page. With the number of people who use untrusted or insecure wireless networks, it's not too hard to launch a man in the middle attack these days.
Also, https is supposed to protect information going in both directions. If there is content on the page not protected via https, but the user sees the https in the address and lock icon, they may believe that information they enter is secure from eavesdroppers, when in fact some of the information is transmitted in the clear.

the threat from Google to the integrity of my site remains the same whether I pull down their content over HTTP or HTTPS
I think you're using the wrong threat model here. The threat is not that google might act maliciously and send the wrong data to your users. Indeed, SSL would not protect against that.
The actual threat is that a man in the middle (between your users and google) could eavesdrop on the unprotected data to determine what your users are up to, or even modify the unprotected content in order to trick them.
It's the duty of the browser to somehow inform the user that such attacks are possible. Otherwise the user will incorrectly think that everything is secure because he entered an "https" address.

The reason this message exists is that any HTTPS connection is served via SSL, so the browser knows that the data coming in on it is indeed the exact data sent from the server.
This is not the case for any components that have been deliverd via HTTP - these can change components that have been delivered via SSL, so the guarantee that the HTTPS data is correct cannot be maintained.
That's why the warning comes up.

How to prevent hotlinking of streaming content?

I have a directory with my media files and I need no to display them on other sites.
Server doesn't support .htaccess, because it uses nginx.
How can I enable hotlink protection for my files??
Thank you.

Easiest way would be to check for the Referer header in HTTP request. Basically if that header does not have URL from your site, then this could be hot linking.
This has following problems:
Referrer header can be forged -> hot linking works
All user agents do not necessarily send the Referrer header -> legitimate user might not get the content.
You could also set a cookie when user is browsing your site, and check for existence of that cookie when user is accessing the streaming content.

The details may be dated, but Igor gives an example of referrer mapping for image hotlink protection that might be useful here: http://nginx.org/pipermail/nginx/2007-June/001082.html
If you decide to go the referrer route.
If you are using memcached you could also store store client IP addresses for a time and only serve up your streaming media if an unexpired client IP is found in the cache. The client IP gets cached during normal browsing ensuring that the person viewing your streaming content has also recently been visiting your site.

On my hostgator site, they used nginx as a proxy to Apache(nginx+apache). maybe that will help you. Also if you have access to the logs, if you see a lot of traffic that way from a ip I would investigate, and if it points to a site, then block the other web server. Php's file_get_contents doesn't get stopped by htaccess or anything else I know besides blocking the ip.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex