I've searched through all of the related topics here but none seems to answer my specific need. Here is the problem: Given a URL (sans protocol), I want to extract the subdomain portion, excluding www. The domain portion is always the same so I don't need to support all TLDs. Examples:
www.subdomain.domain.com should match subdomain
www.domain.com should match nothing
domain.com should match nothing
This is one of the many iterations I have tried:
[^(www\.)]\w+[^(\.domain\.com)]
Square brackets indicate character class and will remove all the order of otherwise special meaning of most characters.
You can try something like this instead:
((?:[^.](?<!www))+)\.domain\.com
regex101 demo
To return what you're looking for instead of retrieving it through submatches:
((?:[^.](?<!www))+)(?=\.domain\.com)
regexp101 revised
Related
I've been reading about url's. Absolute, scheme relative, root relative, location relative.
I still don't understand difference between these two:
//domain.com/index.html - scheme relative
domain.com/index.html - ?
.
Question 1:
Correct me if I am wrong //domain.com/index.html will resolve to absolute url like this:
http://domain.com/index.html
https://domain.com/index.html
ftp://domain.com/index.html
file://domain.com/index.html -- if in email
And browsers will act differently: ie6 doesn't support, ie7,8 will fetch data twice(http https).
.
Question 2:
How will domain.com/index.html resolve? Same as scheme relative url in Q1? Or is it something else?
.
Question 3:
Is there any difference between these url's, what is it and why?
//www.domain.com/index.html
www.domain.com/index.html
.
Question 4:
How will //www.domain.com/index.html resolve?
.
Question 5:
How will www.domain.com/index.html resolve?
It's very easy, looking at URLs like these, to apply your human knowledge of what they probably mean, rather than the much simpler rules implemented by software like web browsers.
The simplest type of URL (or more accurately URI, since some schemes don't represent a Location, only an Identifier) is absolute; it starts with a scheme, then a colon, and no context is needed to resolve it. Examples:
http://example.com
https://www.example.com/foo/bar.baz
http://127.0.0.1:8001
mailto:someone#example.com
data:text/plain,test
urn:example
Then there are location-relative URLs; that is, anything without a scheme, and without a leading slash. These replace everything after the slash in the current context, but leave the rest in place. If the current context is http://example.com/foo/bar.baz, you could have relative URLs like so:
bob.baz -> http://example.com/foo/bob.baz
thing/widget.gizmo -> http://example.com/foo/thing/widget.gizmo
example.com/page -> http://example.com/foo/example.com/page
Note that that last example looks like a domain name at first glance, but is actually exactly the same as all the other relative URLs.
Root-relative URLs, with a leading slash, are similar, but instead of deleting after the last slash, they delete after the first. Given the same context, the previous examples become:
/bob.baz -> http://example.com/bob.baz
/thing/widget.gizmo -> http://example.com/thing/widget.gizmo
/example.com/page -> http://example.com/example.com/page
A root-relative URL could also contain a colon, because the leading slash cannot be part of a scheme prefix:
/foo:bar -> http://example.com/foo:bar
/urn:example -> http://example.com/urn:example
Finally, there are scheme-relative URLs, with two leading slashes. They replace everything after the original double-slash, so keep only the scheme:
if the context is http://example.com/foo/bar then //example.org/bob means http://example.org/bob
if the context is https://example.com/foo/bar then //example.org/bob means https://example.org/bob
if the context is http://example.com, then //foo.bar means http://foo.bar
Note that that last example doesn't look like a domain name to us, but it still follows the same rules. Whether a URL is actually useful is not taken into account when parsing any of the relative forms.
Conventions like "begins with www." and "ends with .com" cannot be relied on, and are not used to determine if a URL is relative or not, so all you need do to answer all your questions is follow this simple set of rules:
If there are two leading slashes, it is scheme relative
If there is one leading slash, it is root relative
If there is no leading slash, but there is a colon, assume it is an absolute URI
If there is no leading slash, and no colon, it is location relative
They are very different. The second one is a relative reference to a path "domain.com/index.html".
WRT "domain.com" vs "www.domain.com": these are simply different host names (or path names in the second variant)
This one allows everything with .html extension that contains no slashes:
rewrite ^/([^/]+).html$ ...
I need to add another catch to it: URL must contain at least one dash, then it can be rewritten.
How to do that?
Just use logic. Word with at least one dash could be expressed as two words with dash between them. So solution is simple:
rewrite ^/([^/]+-[^/]+)\.html$.
Also you forgot to escape dot (.) so your regexp also match url /somesstrangehtml
Hello i need a way to find out the host part of an url , i've tried
Request.Url.Host.Split('.')
but it doesn't work with url like this:
sub.sub.domain.com
or
www.domain.co.uk
since you can have a variable number of dots before and after the domain
i need to get only "domain"
Check out the second answer at Get just the domain name from a URL?
I checked the pastebin link; it's active. I didn't test the code myself, but if it outputs as he describes, you can .split() from there.
If you need to be totally flexibel, you need to make a list of all possible top-level-domains, and try to remove those, with dot, from the end of your string, resulting in
www.domain
or
sub.sub.domain
Then take the last characters after the last dot.
On my site I am tracking the URL /shop/ as goal by head match. As there are some URL parameters I cannot use exact match here.
Additionally, I am tracking a goal by exact match which is a URL to subfolder: /shop/process/paid.php
The problem is that GA tracks this subfolder with the head match as well, and thus saves the URL parameters that come along with paid.php, e.g. paid.php?email=customer#home.com
How can I prevent GA to track the URL parameters?
How would the setup look like?
Thanks!
That should work with a custom filter:
admin->profile->filters->custom filter->search and replace.
Search for
/shop/process/paid.php\?.*
(that's your url with arbitrary query parameters, the "\" is an escape sign since "?" is also an control character in regular expression. Dot means any character and "*" means any number of the preceding (in that case any) character) and replace with the desired url ( /shop/process/paid.php).
There is probably a more elegant solution but like most people I'm not good at this regex stuff. This should work however.
Alternatives:
If those query parameters are nowhere needed in the tracking data you can exlude them completely in the profile settings.
You can created a profile for the subdirectory based on the directory (include filter->request uri contains "/shop" and set only this profile to remove query parameters
i am trying to rewrite any URL that match this pattern:
~/Ahmed
~/Name
to this:
~/User/Ahmed/Ahmed.aspx
~/User/Name/Name.aspx
and i can write them individually but what i am trying to do is detect any URL that look like "~/User/Ahmed/Ahmed" and auto rewrite them to this "Ahmed"
thanks
Hopefully you're using the UrlRewritingNet library, not UrlRewriter? The former is suggested over the latter.
However, in either you can use a regex:
"~/User/([^/\\]+)/\1.aspx" -> "~/$1" //For ".aspx" in the URL
"~/([A-Za-z]+)" to "~/User/$1/$1.aspx" //For /Name in the URL.
Note the ([^/\]+) means any set of characters without slashes,
and "\1" is a backreference to the previous capture, that ensures the name is an exact duplicate. Note that you should enable "ignore case" if you want to support "/User/ahmed/Ahmed.aspx" and not just "/User/Ahmed/Ahmed.aspx".