Suppose you have a request for the following json file.
e.g. https://example.com/aaa/bbb/ccc.json
I want to have Nginx handle 'aaa' as a fixed string, and 'bbb' and 'ccc' as changing strings.
In this case, I want the proxy to work ignoring the 'bbb' path in the URL.
How should I describe this in a lcoation block in Nginx?
location /aaa {
# I want to ignore 'bbb' path and return ccc.json. 'ccc' varies.
alias /mydirectory/aaa;
try_files $url;
}
Locations in nginx can be defined with regular expressions (docs).
Depending on how specific you'd like to be, you might choose ~ for case-sensitive URI matching or ~* for case-insensitive URI matching. ^~ is also available as a prefix match, but note a match here will stop the engine from performing additional regex matching. Caveat emptor.
Something like this would be a relatively straightforward way to "absorb" the bbb (or any other combo of 1 or more letters, upper and lower case):
location ~ ^\/aaa\/[a-zA-Z]+ {
...
}
In order to capture the ccc.json, you'll want to add a capturing group as well:
location ~ ^\/aaa\/[a-zA-Z]+\/([\w\.]+) {
return 200 '$1';
}
Bonus: if you're using PCRE 7+ (docs), you can name your capture:
location ~ ^\/aaa\/[a-zA-Z]+\/(?<filename>[\w\.]+) {
return 200 '$filename';
}
Related
Here is part of my nginx configuration:
location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/run/php/php7.4-fpm.sock;
}
location /wp-content/uploads/ {
location ~ .(aspx|php|jsp|cgi)$ { return 410; }
}
As I understand it, the order of priority in location blocks goes like this:
= (exact match) --> ^~ (preferential prefix) --> ~ (regex) or ~* (case-insensitive regex) --> (prefix - no special character)
I put a PHP file in /wp-content/uploads. I got a 410 response code (which is what I want). But I don't understand why the first location block didn't capture the request, since regex blocks take precedence over prefix.
Furthermore, when two regex locations match, the first declared regex location match takes the request. Yet the latter processes the request here.
Why am I getting the 410 response code for /wp-content/uploads/info.php?!?
I found the answer. There's a little-known secret of nginx location matching.
Here is the order of precedence, according to artfulrobot.uk:
1. Exact string matches location = /foo
2. The longest of any location ^~ ... matches
3. The first regex match that is nested within the single longest matching prefix match!
4. The first other regex match location ~ regex
5. The longest prefix match location /foo
Everybody knows about #1, #2, #4, and #5.
But #3 is the gotcha. That's what is happening here.
You'll want to read the article as he very granularly describes this seemingly undocumented and strange behavior of nginx.
I have the following rule:
location /foo {
Which matches well for the following examples:
mydomain.com/foo
mydomain.com/foo/
mydomain.com/foo/bar?example=true
However it is also matching for
mydomain.com/foobar
I don't want it to match to that last one (/foobar), it should only match if there is either nothing after the foo, or a slash and zero or more characters after it. I've tried location /foo/ { but that does not produce desired results either.
Can anyone shed some light on how to do this?
There are two ways to handle this, use a regular expression location block - or just handle /foo separately from /foo/.
Regular expression location blocks have a different evaluation order and are less efficient than prefix location blocks, so my preferred solution is the exact match location and prefix location.
Generally, /foo just redirects to /foo/, for example:
location = /foo {
return 302 /foo/;
}
location /foo/ {
...
}
See this document for more.
You can create one Location rule to /foobar whether you need it as an exception:
location = /foobar {
....
}
Nginx match first URI by = operator.
I am trying to set the $qualifies variable to 1 if both the path = "/" AND the query variable "band" != "". I've been able to figure out the separate sections below (I think) but wanted to know if there is an easier way. Seems like this could all be in 1 map:
map $request_uri $myvar {
~(?<captured_path>[^?]*) $captured_path;
}
map $arg_band $band{
"" 0;
default 1;
}
map "$myvar:$band" $qualifies{
default 0;
"/:1" 1;
}
Want to do it 'cause ugly and know there's probably a better way.
It seems like weren't aware that a query-less URL is already available within $uri, so, another potential solution is as follows:
map $uri:$arg_band $qualified {
default 0;
~^/:[^:]+$ 1;
}
Note both $uri and $arg_band can contain "weird" characters (e.g., both can contain ?, in case of $uri, through %3f), so, you gotta be sure in your regex to match your actual separator, and not a placeholder supplied by the user. This can either be done by making it random, long and secret, or by restricting the acceptable input from the user.
Note that without knowing what other logic is employed and how it makes use of the variables, most of the obvious and good-looking solutions would actually contain potential security vulnerabilities and be incorrect (e.g., the solution above may be incorrect if $arg_band does contain :).
Abstract
So, you're trying to set $qualifies to 1 if both $uri is / and $arg_band is not set to anything?
Basic Idea
The basic idea compared to your own code is that we have to do an inverse of the logical and operation to a logical or — (a && b) is always the same as !(!a || !b) — and once you know the theory, then the rest is simply a bit of coding.
And, indeed, it's very simple to do with a single http://nginx.org/r/map:
map $request_uri $qualifies {
default 1;
~^[^?]+[?]band=[^&] 0; # match if $arg_band set to .+, case 1
~^[^?]+[?].*&band=[^&] 0; # match if $arg_band set to .+, case 2
~^[/]+[^/?]+ 0; # match if $uri is set to "/"
}
Discussion
If you don't like the double cases for handling $arg_band, you can use the lookbehind operator of pcre, however, I believe that the above two cases might actually be more efficient and correct than the single one below:
^[^?]+[?].*(?<=[?&])band=[^&] # incorrect! will match /??band=a
A follow-up question I personally had was whether the above combined regex is actually correct, and would match the way nginx does its own parsing for $arg_band. This can be tested by running various strings against an nginx.conf that simply does something like return 200 $args\t$arg_band\t$uri\t$request_uri\n;; what I found out is that $uri is always cut at the first ?, whereas $args itself may contain the second ?, whereas the individual variable names must either start from the first ? of the request, or from & anytime after the first question mark, e.g., a question mark within $args is treated as a regular character, so, the above lookbehind code w/ (?<=[?&]) is incorrect due to the different matching of a string like /??band=t between the regex and actual $arg_band in nginx.
So, if you still want to combine the two expressions, then perhaps the following should be the most correct one:
^[^?]+[?](?:.*&)?band=[^&]
Summary
Making the overall solution:
map $request_uri $qualifies {
default 1;
~^[^?]+[?](.*&)?band=[^&] 0;
~^[/]+[^/?]+ 0;
}
However, you also have to consider how absolutely correct your solution has to be, and if a very high degree of correctness is required, then it may not be appropriate to do your own parsing of $request_uri (just as an example, when $request_uri is /a/../, the $uri will be just / due to URL normalisation, and your original solution already suffered from this).
Given url like this: media/images/293_84072edb91d2b62387f529e2c4456c85f4dadee5
I wanna get path with following rules:
if
location ~* /media/images/\d*_(?<hash>[a-z0-9]{40})
then
take var $hash (84072edb91d2b62387f529e2c4456c85f4dadee5), get first char of it ('8'), for example let's name it with $hash[0]
then second and third chars together ('40'), named by $hash[1:3]
and then root where nginx can take the image must look like this one:
media/images/$hash[1]/$hash[1:3]/$hash
root -> media/images/8/40/84072edb91d2b62387f529e2c4456c85f4dadee5
How can I write this rule? Please, help me to understand.
location ~ ^/media/images/\d*_(?<hash0>[a-z0-9])(?<hash13>[a-z0-9]{2})(?<hash>[a-z0-9]{37})$ {
alias /server/path/$hash0/$hash13/$hash0$hash13$hash;
}
BTW, in pure nginx you can't work with string as in programming language. Nginx has only regexp.
But regexp can get any part of string as group. In your example 40 symbols of hash captured as named group. But you can easy create 2 capture groups: one with 1 symbol, second - with next 39 symbols. Or, can create 3 capture groups: with 1 symbol ((?<hash0>[a-z0-9]) in mine example), with next 2 symbols ((?<hash13>[a-z0-9]{2})) and with next 37 symbols ((?<hash>[a-z0-9]{37})). All groups has own name. Now we can create path using this captures.
Btw, name of groups are not required, this example can be written as
location ~ ^/media/images/\d*_([a-z0-9])([a-z0-9]{2})([a-z0-9]{37})$ {
alias /server/path/$1/$2/$1$2$3;
}
Named group important if you have few different regexp (in location, in server name, in if etc).
Now, why alias and not root. Root - it's sever root for this location. If location /a/ and root /home/www/ - filepath of /a/test.txt will be /home/www/a/test.txt. Alias replace current location, so if alias /home/www/ - filepath of /a/test.txt will be /home/www/test.txt.
So use root if your location and file structure the same, and alias - if location do not map to file system path directly.
Given website addresses, e.g.
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How do I return the root domain in R, e.g.
example.com
example2.co.uk
For my purposes I would define the root domain to have structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
What about something in R that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
Edited: Adding extra info based on Richard's comment
Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
Algorithm
Match domain against all rules and take note of the matching ones.
If no rules match, the prevailing rule is "*".
If more than one rule matches, the prevailing rule is the one which is an exception rule.
If there is no matching exception rule, the prevailing rule is the one with the most labels.
If the prevailing rule is a exception rule, modify it by removing the leftmost label.
The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
The registered or registrable domain is the public suffix plus one additional label.
There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
Somthing lik this should help
> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"
> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"