Force nginx cache - nginx

We are stuck with a non-trivial question with nginx.
Our web server (Nginx) has a cache enabled for all links.
When we do newsletter campaigns, each customer accesses our website using a unique URL and this means that the nginx cache is not used.
This results in a high load on the website every time we send a newsletter, so we need to deal with this in some way.
An example of link
https://www.example.com/deals/calendarraa?utm_content=calendars_link&xnpe_tifc=hFzDOFnDbuPsOfnDx9pZVjQsVuU_O.bpx.US4FzXxDxZ4.VdxDPuxuYj4nTT&utm_source=newsletter&utm_campaign=deals%2520offer%3A%2520calendar%2520and%2520greeting%2520cards&utm_medium=email
Is it possible to configure nginx so that in case we have params starting with utm_ &/or xnpe_tifc, e.g. we ignore these parameters and nginx serves website with URL without them?
That means: https://www.example.com/deals/calendarra? or even better just https://www.example.com/deals/calendarra

Related

On Demand TLS and Reverse Proxy Support for Custom Domains

I came into a situation today. Please share your expertise 🙏
I have a project (my-app.com) and one of the features is to generate a status page consisting of different endpoints.
Current Workflow
User login into the system
User creates a status page for one of his sites (e.g.google) and adds different endpoints and components to be included on that page.
System generates a link for a given status page.
For Example. my-app.com/status-page/google
But the user may want to see this page in his custom domain.
For Example. status.google.com
Since this is a custom domain, we need on-demand TLS functionality. For this feature, I used Caddy and is working fine. Caddy is running on our subdomain status.myserver.com and user's custom domain status.google.com has a CNAME to our subdomain status.myserver.com
Besides on-demand TLS, I am also required to do reverse proxy as
shown below.
For Example. status.google.com ->(CNAME)-> status.myserver.com ->(REVERSE_PROXY)-> my-app.com/status-page/google
But Caddy supports only protocol, host, and port format for reverse proxy like my-app.com but my requirement is to support reverse proxy for custom page my-app.com/status-page/google. How can I achieve this? Is there a better alternative to Caddy or a workaround with Caddy?
You're right, since you can't use a path in a reverse-proxy upstream URL, you'd have to do rewrite the request to include the path first, before initiating the reverse-proxy.
Additionally, upstream addresses cannot contain paths or query strings, as that would imply simultaneous rewriting the request while proxying, which behavior is not defined or supported. You may use the rewrite directive should you need this.
So you should be able to use an internal caddy rewrite to add the /status-page/google path to every request. Then you can simply use my-app.com as your Caddy reverse-proxy upstream. This could look like this:
https:// {
rewrite * /status-page/google{path}?{query}
reverse_proxy http://my-app.com
}
You can find out more about all possible Caddy reverse_proxy upstream addresses you can use here: https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#upstream-addresses
However, since you probably can't hard-code the name of the status page (/status-page/google) in your Caddyfile, you could set up a script (e.g. at /status-page) which takes a look at the requested URL, looks up the domain (e.g. status.google.com) in your database, and automatically outputs the correct status-page.

What settings are required to put AWS CloudFront CDN in front of a squarespace website?

I had trouble getting AWS CloudFront to work with SquareSpace. Issues with forms not submitting and the site saying website expired. What are the settings that are needed to get CloudFront working with a Squarespace site?
This is definitely doable, considering I just set this up. Let me share the settings I used on Cloudfront, Squarespace, and Route53 to make it work. If you want to use a different DNS provide than AWS Route53, you should be able to adapt these settings. Keep in mind that this is not an e-commerce site, but a standard site with a blog, static pages, and forms. You can likely adapt these instructions for other issues as/if they come up.
Cloudfront (CDN)
To make this work, you need to create a Cloudfront Distribution for Web.
Origin Settings
Origin Domain Name should be set to ext-cust.squarespace.com. This is Squarespace's entry point for external domain names.
Origin Path can be left blank.
Origin ID is just the unique ID for this distribution and should auto-populate if you're on the distribution creation screen, or be fixed if you're editing Origin Settings later.
Origin Custom Headers do not need to be set.
Default Cache Behavior Settings / Behaviors
Path Patterns should be left at Default.
I have Viewer Protocol Policy set to Redirect HTTP to HTTPS. This dictates whether your site can use one or both of HTTP or HTTPS. I prefer to have all traffic routed securely, so I redirect all HTTP traffic to HTTPS. Note that you cannot do the reverse and redirect HTTPS to HTTP, as this will cause authentication issues (your browser doesn't want to expose what you thought was a secure connection).
Allowed HTTP Methods needs to be GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE. This is because forms (and other things such as comments, probably) use the POST HTTP method to work.
Cached HTTP Methods I left to just GET, HEAD. No need for anything else here.
Forward Headers needs to be set to All or Whitelist. Squarespace's entry point we mentioned earlier needs to know where what domain you're coming from to serve your site, so the Host header must be whitelisted, or allowed with everything else if set to All.
Object Caching, Minimum TTL, Maximum TTL, and Default TTL can all be left at their defaults.
Forward Cookies cookies is the missing component to get forms working. Either you can set this to All, or Whitelist. There are certain session variables that Squarespace uses for validation, security, and other utilities. I have added the following values to Whitelist Cookies: JSESSIONID, SS_MID, crumb, ss_cid, ss_cpvisit, ss_cvisit, test. Make sure to put each value on a separate line, without commas.
Forward Query Strings is set to True, as some Squarespace API calls use query strings so these must be passed along.
Smooth Streaming, Restrict Viewer Access, and Compress Objects Automatically can all be left at their default values, or chosen as required if you know you need them to be set differently.
Distribution Settings / General
Price Class and AWS WAF Web ACL can be left alone.
Alternate Domain Names should list your domain, and your domain with the www subdomain attached, e.g. example.com, www.example.com.
For SSL Certificate, please follow the tutorial here to upload your certificate to IAM if you haven't already, then refresh your certificates (there is a control next to the dropdown for this), select Custom SSL Certificate and select the one you've provisioned. This ensures that browsers recognize your SSL over HTTPS as valid. This is not necessary if you're not using HTTPS at all.
All following settings can be left at default, or chosen to meet your own specific requirements.
Route 53 (DNS)
You need to have a Hosted Zone set up for your domain (this is specific to Route 53 setup).
You need to set an A record to point to your Cloudfront distribution.
You should set a CNAME record for the www subdomain name pointing to your Cloudfront distribution, even if you don't plan on using it (later we'll go through setting Squarespace to only use the root domain by redirecting the www subdomain)
Squarespace
On your Squarespace site, you simply need to go to Settings->Domains->Connect a Third-Party Domain. Once there, enter your domain and continue. Under the domain's settings, you can uncheck Use WWW Prefix if you'd like people accessing your site from www.example.com to redirect to the root, example.com. I prefer this, but it's up to you. Under DNS Settings, the only value you need is CNAME that points to verify.squarespace.com. Add this CNAME record to your DNS settings on Route 53, or other DNS provider. It won't ever say that your connection has been fully completed since we're using a custom way of deploying, but that won't matter.
Your site should now be operating through Cloudfront pointing to your Squarespace deployment! Please note that DNS propogation takes time, so if you're unable to access the site, give it some time (up to several hours) to propogate.
Notes
I can't say exactly whether each and every one of the values set under Whitelist Cookies is necessary, but these are taken from using the Chrome Inspector to determine what cookies were present under the Cookie header in the request. Initially I tried to tell Cloudfront to whitelist the Cookie header itself, but it does not allow that (presumably because it wants you to use the cookie-specific whitelist). If your deployment is not working, see if there are more cookies being transmitted in your requests (under the Cookie header, the values you're looking for should look like my_cookie=somevalue;other_cookie=othervalue—my_cookie and other_cookie in my example are what you'd add to the whitelist).
The same procedure can be used to forward other headers entirely that may be needed via the Forward Headers whitelist. Simply inspect and see if there's something that looks like it might need to go through.
Remember, if you're not whitelisting a header or cookie, it's not getting to Squarespace. If you don't want to bother, or everything is effed (pardon my language), you can always set to allow all headers/cookies, although this adversely affects caching performance. So be conservative if you can.
Hope this helps!
Here are the settings to get CloudFront working with Squarespace!
Behaviours:
Allowed HTTP Methods Ensure that you select: GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE. Otherwise forms will not work:
Forward Headers: Select whitelist and choose 'Host'. Otherwise squarespace will not know which website they need to load up and you get the message 'Website has expired' or similar.
Origins:
Origin Domain Name set as: ext-cust.squarespace.com
Origin Protocol Policy Select HTTPS so that traffic between the CDN and the origin is secure too
General
Alternate Domain Names (CNAMEs) put both your www and none www addresses here and let Squarespace decide on if to direct www to root or vice-versa (.e.g example.com www.example.com)
You can now configure SSL on CloudFront
HTTPS You can now enforce HTTPS using a certificate for your site here rather than in Squarespace
Setting I'm unsure about still:
Forward Query Strings: recommended not for caching reasons but I think this could break things...
Route53
Create A records for www and root (e.g. example.com www.example.com) and set as an alias to your CloudFront distribution

Prevent bundle from responding if cache buster doesn't match content

I'm using Bundling and Minification across a server farm where there is a cross-over period of old and new servers.
The issue I have is that the old servers are caching the content of the new bundle cache buster URL.
For example the new HTML is cached with the new bundle URL:
<script src="/bundle.css?v=RBgbF6A6cUEuJSPaiaHhywGqT7eH1aP8JvAYFgKh"></script>
This then makes a request to an old server which hasn't yet been updated with the new CSS code and this then cached.
Any subsequent calls to the new bundle URL will then return the old code.
Therefore is there a way of checking that the content of the bundle matches the hashed cache buster? And if it doesn't throw a 404 for example.
Using my example above when the request goes back to the old server for the bundle it would check the contents of the bundle, generate a content hash and compare this to the querystring.
In this instance the cache-buster wouldn't match the actual content hash and a 404 would be returned.
Eventually a user would hit a new server with the bundle request and the correct content would be cached.
We're bound to encounter the same issue ourselves soon, but we've been sticking to only 2 update domains (split the servers in half so that there's not more than one version running at a time).
As far as I see it there are 4 possible alternatives:
Have your static content always point to an up to date server. This could be done (depending on your configuration) by IP address or by using a URL that you know has already been updated (if you have a server that gets updated first).
Configure your load balancer so that requests from the same IP address end up on the same system (if your static content is served from your app servers). If this can't be done at the load balancer level then you can do it further up by configuring different IP addresses for different environments and then swapping their DNS records around.
Implement a handler in ASP.NET that listens for CSS files, and checks whether the hash of the bundle is what it expected to be. You will probably need a singleton object to be storing these when they're generated when your app loads. This can then either return a 404, 301 (to get them to retry) or return the old version but instruct it to not cache the results. Note that with HttpPipelining you're not likely to hit a different server so a redirect might not help.
Have a config flag that is set while you're doing a deploy that changes all of your cache-busting URLs to be the current date. This will effectively disable all caching until your deploy is completed, meaning that any "wrong" assets delivered will not be kept.
Is this a problem you're actually seeing, or is it hypothetical? Unless your site is very high traffic and your deploys take a good few minutes it's not something you're likely to see. You'll want to be wary of returning 404s as sometimes the wrong stylesheet is better than no stylesheet.

Do any CDNs allow rewriting request URI's so that client-side routing plays nicely with browser refreshes?

I have an HTML5 app written in static html/js/css (it's actually written in Dart, but compiles down to javascript). I'm serving the application files via CDN, with the REST api hosted on a separate domain. The app uses client-side routing, so as the user goes about using the app, the url might change to be something like http://www.myapp.com/categories. The problem is, if the user refreshes the page, it results in a 404.
Are there any CDN's that would allow me to create a rule that, if the user requests a page that is a valid client-side route, it would just return the (in my case) client.html page?
More detailed explanation/example
The static files for my web app are stored on S3 and served via Amazon's CloudFront CDN. There is a single HTML file that bootstraps the application, client.html. This is the default file served when visiting the domain root, so if you go to www.mysite.com the browser is actually served www.mysite.com/client.html.
The web app uses client-side routing. Once the app loads and the user starts navigating, the URL is updated. These urls don't actually exist on the CDN. For example, if the user wanted to browse widgets, she would click a button, client-side routing would display the "widgets" view, and the browser's url would update to www.mysite.com/widgets/browse. On the CDN, /widgets/browse doesn't actually exist, so if the user hits the refresh button on the browser, they get a 404.
My question is whether or not any CDNs support looking at the request URI and rewriting it. So, I could see a request for /widgets/browse and rewrite it to /client.html. That way, the application would be served instead of returning a 404.
I realize there are other solutions to this problem, namely placing a server in front of the CDN, but it's less ideal.
I do this using CloudFront, but I use my own server running Apache to accomplish this. I realize you're using a server with Amazon, but since you didn't specify that you're restricted to that, I figured I'd answer with how to accomplish what you're looking to do anyway.
It's pretty simple. Any time you query something that isn't already in the cache on CloudFront, or exists in the Cache but is expired, CloudFront goes back to your web server asking it to serve up the content. At this point, you have total control over the request. I use the mod_rewrite in Apache to capture the request, then determine what content I'm going to serve depending on the request. In fact, there isn't a single file (spare one php script) on my server, yet cloudfront believes there are thousands. Pretty sure url rewriting is standard on most web servers, I can only confirm on lighttp and apache from my own experience though.
More Info
All you're doing here is just telling your server to rewrite incoming requests in order to satisfy them. This would not be considered a proxy or anything of the sort.
The flow of content between your app and your server, with cloudfront in between is like this:
appRequest->cloudFront
if cloudFront has file, return data to user without asking your server
for the file.
If cloudFront DOESN'T have the file (or it has expired), go back to
the origin server and ask it for a new copy to cache.
So basically, what is happening in your situation is this:
A)app->ask cloudfront for url cloud front doesn't have
B)cloudfront
then asks your source server for the file
C)file doesn't exist there,
so the server tells cloudFront to go fly a kite
D)cloudFront comes back empty handed and makes your app 404
E)app crashes and
burns, users run away and use something else.
So, all you're doing with mod_rewrite is telling your server how it can re-interpret certain formatted requests and act accordingly. You could point all .jpg requests to point to singleImage.jpg, then have your app ask for:
www.mydomain.com/image3.jpg
www.mydomain.com/naughtystuff.jpg
Neither of those images even have to exist on your server. Apache would just honor the request by sending back singleImage.jpg. But as far as cloudfront or your app is concerned, those are two different files residing at two different unique places on the server.
Hope this clears it up.
http://httpd.apache.org/docs/current/mod/mod_rewrite.html
I think you are using the URL structure in a wrong way. the path which is defined by forward slashes is supposed to bring you to a specific resource, in your example client.html. However, for routing beyond that point (within that resource) you should make use of the # - as is done in many javascript frameworks. This should tell your router what the state of the resource (your html page or app) is. if there are other resources referenced, e.g. images, then you should provide different paths for them which would go through the CDN.

http redirects to https

What would cause a site to try an go to an https url?
We have sitecore set up to redirect non www URLs to www pre-pended URLs. Example: joesrx.com resolves to www.joesrx.com through the Sitecore URLResolver.
What we are seeing is that if you type joesrx.com, it tries to go to https://joesrx.com before it hits the Sitecore server. Since there are no certificates on this server and https is not utilized we get a 404.
Is there something in IIS that is misconfigured? Proxy teams says it is not in their setting and network team says all of the DNS entries are correct.
As a general rule for debugging these sorts of problems, try to imagine all the elements between you and the application and then use a simple divide and conquer approach. You can also test behavior on individual levels of the path between you and the actual application.
In this case for example (from you to application code):
User
Browser
browser may do caching of redirects. Try a different browser / try incognito mode / clear cache
Browser Extensions/Settings
any extensions which make it so the browser always tries to access website(s) via https? Try with extension disabled / another browser
Proxies/Firewalls
any Proxies/Firewalls on your end which may modify requests? Can you try to access the site bypassing any proxies/firewalls, maybe from a different network connection?
Network
Web Server
Web Server Configuration / Pipelines / Resolvers / Filters / Etc.
.htaccess / IIS config / nginx config / servlet filters / (lots of options depending on your framework). Check the server
Actual application code
well.. check the code.
Example of divide and conquer, choosing the Network mid-point: Try accessing the URL with wget/curl from command-line, curl -i will also show you the headers received from the server. If you find a "Location: .." header it's clear that the server is sending a redirect. So now you only have to check Web Server / framework configuration and actual application code.
There are a few things I would check first:
Do you have rewrite rules in your web.config? They may be pattern-matching on www. and redirecting in order to enforce SSL
Do you have code in your pipelines that is attempting to enforce SSL for specific paths? The code here may not be checking the URL correctly.
In IIS, did you bind the 'www' host name to your IIS site? Or is it falling through to another site that has SSL enforced?
In case the other answers don't help, check for HTST headers such as "Strict-Transport-Security: max-age=31536000".
This HTTP header tells browsers to use only SSL for future requests (among other things).
For more info check out:
https://www.owasp.org/index.php/HTTP_Strict_Transport_Security

Resources