nginx get root based on url regex with some calculation rules - nginx

Given url like this: media/images/293_84072edb91d2b62387f529e2c4456c85f4dadee5
I wanna get path with following rules:
if
location ~* /media/images/\d*_(?<hash>[a-z0-9]{40})
then
take var $hash (84072edb91d2b62387f529e2c4456c85f4dadee5), get first char of it ('8'), for example let's name it with $hash[0]
then second and third chars together ('40'), named by $hash[1:3]
and then root where nginx can take the image must look like this one:
media/images/$hash[1]/$hash[1:3]/$hash
root -> media/images/8/40/84072edb91d2b62387f529e2c4456c85f4dadee5
How can I write this rule? Please, help me to understand.

location ~ ^/media/images/\d*_(?<hash0>[a-z0-9])(?<hash13>[a-z0-9]{2})(?<hash>[a-z0-9]{37})$ {
alias /server/path/$hash0/$hash13/$hash0$hash13$hash;
}
BTW, in pure nginx you can't work with string as in programming language. Nginx has only regexp.
But regexp can get any part of string as group. In your example 40 symbols of hash captured as named group. But you can easy create 2 capture groups: one with 1 symbol, second - with next 39 symbols. Or, can create 3 capture groups: with 1 symbol ((?<hash0>[a-z0-9]) in mine example), with next 2 symbols ((?<hash13>[a-z0-9]{2})) and with next 37 symbols ((?<hash>[a-z0-9]{37})). All groups has own name. Now we can create path using this captures.
Btw, name of groups are not required, this example can be written as
location ~ ^/media/images/\d*_([a-z0-9])([a-z0-9]{2})([a-z0-9]{37})$ {
alias /server/path/$1/$2/$1$2$3;
}
Named group important if you have few different regexp (in location, in server name, in if etc).
Now, why alias and not root. Root - it's sever root for this location. If location /a/ and root /home/www/ - filepath of /a/test.txt will be /home/www/a/test.txt. Alias replace current location, so if alias /home/www/ - filepath of /a/test.txt will be /home/www/test.txt.
So use root if your location and file structure the same, and alias - if location do not map to file system path directly.

Related

Nginx Nested Location Priority

Here is part of my nginx configuration:
location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/run/php/php7.4-fpm.sock;
}
location /wp-content/uploads/ {
location ~ .(aspx|php|jsp|cgi)$ { return 410; }
}
As I understand it, the order of priority in location blocks goes like this:
= (exact match) --> ^~ (preferential prefix) --> ~ (regex) or ~* (case-insensitive regex) --> (prefix - no special character)
I put a PHP file in /wp-content/uploads. I got a 410 response code (which is what I want). But I don't understand why the first location block didn't capture the request, since regex blocks take precedence over prefix.
Furthermore, when two regex locations match, the first declared regex location match takes the request. Yet the latter processes the request here.
Why am I getting the 410 response code for /wp-content/uploads/info.php?!?
I found the answer. There's a little-known secret of nginx location matching.
Here is the order of precedence, according to artfulrobot.uk:
1. Exact string matches location = /foo
2. The longest of any location ^~ ... matches
3. The first regex match that is nested within the single longest matching prefix match!
4. The first other regex match location ~ regex
5. The longest prefix match location /foo
Everybody knows about #1, #2, #4, and #5.
But #3 is the gotcha. That's what is happening here.
You'll want to read the article as he very granularly describes this seemingly undocumented and strange behavior of nginx.

How to set wildcard reverse proxy in Nginx

Suppose you have a request for the following json file.
e.g. https://example.com/aaa/bbb/ccc.json
I want to have Nginx handle 'aaa' as a fixed string, and 'bbb' and 'ccc' as changing strings.
In this case, I want the proxy to work ignoring the 'bbb' path in the URL.
How should I describe this in a lcoation block in Nginx?
location /aaa {
# I want to ignore 'bbb' path and return ccc.json. 'ccc' varies.
alias /mydirectory/aaa;
try_files $url;
}
Locations in nginx can be defined with regular expressions (docs).
Depending on how specific you'd like to be, you might choose ~ for case-sensitive URI matching or ~* for case-insensitive URI matching. ^~ is also available as a prefix match, but note a match here will stop the engine from performing additional regex matching. Caveat emptor.
Something like this would be a relatively straightforward way to "absorb" the bbb (or any other combo of 1 or more letters, upper and lower case):
location ~ ^\/aaa\/[a-zA-Z]+ {
...
}
In order to capture the ccc.json, you'll want to add a capturing group as well:
location ~ ^\/aaa\/[a-zA-Z]+\/([\w\.]+) {
return 200 '$1';
}
Bonus: if you're using PCRE 7+ (docs), you can name your capture:
location ~ ^\/aaa\/[a-zA-Z]+\/(?<filename>[\w\.]+) {
return 200 '$filename';
}

Xampp Virtualhost

I am configuring a XAMPP Apache server to work with wordpress multisites and do not understand the following directive:
"VirtualDocumentRoot "C:/xampp/www/%-2/sub/%-3"
what is the purpose of %-2 and %-3 ?
Forgive the basic nature of my question but I can't seem to understand the mechanics of these two terms. Can anyone point me to where this notation might be explained?
Thanks in advance for any help or direction
Found the answer,
this is known as "Directory Name Interpolation"
Apache explains this here: https://httpd.apache.org/docs/2.4/mod/mod_vhost_alias.html
I've pasted an excerpt:
Directory Name Interpolation
All the directives in this module interpolate a string into a
pathname. The interpolated string (henceforth called the "name") may
be either the server name (see the UseCanonicalName directive for
details on how this is determined) or the IP address of the virtual
host on the server in dotted-quad format. The interpolation is
controlled by specifiers inspired by printf which have a number of
formats: %% insert a % %p insert the port number of the virtual host
%N.M insert (part of) the name
N and M are used to specify substrings of the name. N selects from the
dot-separated components of the name, and M selects characters within
whatever N has selected. M is optional and defaults to zero if it
isn't present; the dot must be present if and only if M is present.
The interpretation is as follows:
0 the whole name
1 the first part
2 the second part
-1 the last part
-2 the penultimate part 2+ the second and all subsequent parts
-2+ the penultimate and all preceding parts 1+ and -1+ the same as 0
If N or M is greater than the number of parts available a single
underscore is interpolated.

nginx location match rule, without matching trailing characters

I have the following rule:
location /foo {
Which matches well for the following examples:
mydomain.com/foo
mydomain.com/foo/
mydomain.com/foo/bar?example=true
However it is also matching for
mydomain.com/foobar
I don't want it to match to that last one (/foobar), it should only match if there is either nothing after the foo, or a slash and zero or more characters after it. I've tried location /foo/ { but that does not produce desired results either.
Can anyone shed some light on how to do this?
There are two ways to handle this, use a regular expression location block - or just handle /foo separately from /foo/.
Regular expression location blocks have a different evaluation order and are less efficient than prefix location blocks, so my preferred solution is the exact match location and prefix location.
Generally, /foo just redirects to /foo/, for example:
location = /foo {
return 302 /foo/;
}
location /foo/ {
...
}
See this document for more.
You can create one Location rule to /foobar whether you need it as an exception:
location = /foobar {
....
}
Nginx match first URI by = operator.

Return root domain from url in R

Given website addresses, e.g.
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How do I return the root domain in R, e.g.
example.com
example2.co.uk
For my purposes I would define the root domain to have structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
What about something in R that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
Edited: Adding extra info based on Richard's comment
Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
Algorithm
Match domain against all rules and take note of the matching ones.
If no rules match, the prevailing rule is "*".
If more than one rule matches, the prevailing rule is the one which is an exception rule.
If there is no matching exception rule, the prevailing rule is the one with the most labels.
If the prevailing rule is a exception rule, modify it by removing the leftmost label.
The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
The registered or registrable domain is the public suffix plus one additional label.
There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
Somthing lik this should help
> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"
> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"

Resources