Return root domain from url in R - r

Given website addresses, e.g.
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How do I return the root domain in R, e.g.
example.com
example2.co.uk
For my purposes I would define the root domain to have structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
What about something in R that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
Edited: Adding extra info based on Richard's comment
Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
Algorithm
Match domain against all rules and take note of the matching ones.
If no rules match, the prevailing rule is "*".
If more than one rule matches, the prevailing rule is the one which is an exception rule.
If there is no matching exception rule, the prevailing rule is the one with the most labels.
If the prevailing rule is a exception rule, modify it by removing the leftmost label.
The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
The registered or registrable domain is the public suffix plus one additional label.

There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"

Somthing lik this should help
> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"
> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"

Related

Nginx suffix meaning

As I was reading the Nginx source code. I often came across variable names ending with _t (for example: ngx_http_request_t) and _s (ngx_http_request_s). Can anyone explain what t and s means?
According to the Nginx's official Documentation: http://nginx.org/en/docs/dev/development_guide.html
Type names end with the “_t” suffix.
_t means "type"
A structure that points to itself has the name, ending with “_s”.
_s means "self"

Xampp Virtualhost

I am configuring a XAMPP Apache server to work with wordpress multisites and do not understand the following directive:
"VirtualDocumentRoot "C:/xampp/www/%-2/sub/%-3"
what is the purpose of %-2 and %-3 ?
Forgive the basic nature of my question but I can't seem to understand the mechanics of these two terms. Can anyone point me to where this notation might be explained?
Thanks in advance for any help or direction
Found the answer,
this is known as "Directory Name Interpolation"
Apache explains this here: https://httpd.apache.org/docs/2.4/mod/mod_vhost_alias.html
I've pasted an excerpt:
Directory Name Interpolation
All the directives in this module interpolate a string into a
pathname. The interpolated string (henceforth called the "name") may
be either the server name (see the UseCanonicalName directive for
details on how this is determined) or the IP address of the virtual
host on the server in dotted-quad format. The interpolation is
controlled by specifiers inspired by printf which have a number of
formats: %% insert a % %p insert the port number of the virtual host
%N.M insert (part of) the name
N and M are used to specify substrings of the name. N selects from the
dot-separated components of the name, and M selects characters within
whatever N has selected. M is optional and defaults to zero if it
isn't present; the dot must be present if and only if M is present.
The interpretation is as follows:
0 the whole name
1 the first part
2 the second part
-1 the last part
-2 the penultimate part 2+ the second and all subsequent parts
-2+ the penultimate and all preceding parts 1+ and -1+ the same as 0
If N or M is greater than the number of parts available a single
underscore is interpolated.

extracting main URL address

I have a list of URLs and I want to extract the main URL to see how many times each URL has been used. as you can imagine, there are so many URLs with different notations. I tried and wrote the following code to extract the main URL:
library(stringr)
library(rebus)
# Step 2: creating a pattern for URL extraction
pat<- "//" %R% capture(one_or_more(char_class(WRD,DOT)))
#step 3: Creating a new variable from URL column of df
#(it should be atomic vector)
URL_var<-df[["URLs"]]
#step 4: using rebus to extract main URL
URL_extract<-str_match(URL_var,pattern = pat)
#step 5: changing large vector to dataframe and changing column name:
URL_data<-data.frame(URL_extract[,2])
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
The result of this code is acceptable for most cases. For example for //www.google.com, it returns www.google.com and for a website like http://image.google.com/steve it returns image.google.com; however, there are so many cases that this code can't recognize the pattern and will fail to find the URL. For example for URL such as http://my-listing.ca/CommercialDrive.html the code will return my which is definitely not acceptable. for another example, for a website like http://www.real-data.ca/clients/ur/ it only returns www.real. It seems that handling - for my code is difficult
Do you have any suggestions on how to improve this code? or do we have any packages to help me extract URLs faster and better?
Thanks
I think you can simply use
library(stringr)
URL_var<-df[["URLs"]]
URL_data<-data.frame(str_extract(URL_var, "(?<=//)[^\\s/:]+"))
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
Here, stringr::str_extract method searches for the first match in the input, and fetches the substring found. Unlike stringr::str_match, it cannot return submatches, so a lookbehind is used in the regex pattern, (?<=...):
(?<=//)[^\s/:]+
It means:
(?<=//) - match a location in the string that is immediately preceded with // string
[^\\s/:]+ - one or more (+) occurrences of any char but whitespace, / and :. The colon is to make sure port number is not included in the match. / makes sure the match stops before the first / and \s (whitespace) makes sure the match stops before the first whitespace.

Does R have any package for parsing out the parts of a URL?

I have a list of urls that I would like to parse and normalize.
I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.
Since parse_url() uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub call.
Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.
^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
The rest we ignore: .*$
Now we plug together the regexes above, and the extraction of the hostname becomes:
PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
Change host name regex to include (but not capture) the port:
HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
You can use the function of the R package httr
parse_url(url)
>parse_url("http://google.com/")
You can get more details here:
http://cran.r-project.org/web/packages/httr/httr.pdf
There's also the urltools package, now, which is infinitely faster:
urltools::url_parse(c("www.google.com/test/index.asp",
"google.com/somethingelse"))
## scheme domain port path parameter fragment
## 1 www.google.com test/index.asp
## 2 google.com somethingelse
I'd forgo a package and use regex for this.
EDIT reformulated after the robot attack from Dason...
x <- c("talkstats.com", "www.google.com/test/index.asp",
"google.com/somethingelse", "www.stackoverflow.com",
"http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")
parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
parser(x)
lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
names(lst) <- unique(parser(x))
lst
## $talkstats.com
## [1] "talkstats.com"
##
## $google.com
## [1] "www.google.com/test/index.asp" "google.com/somethingelse"
##
## $stackoverflow.com
## [1] "www.stackoverflow.com"
##
## $bing.com
## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="
This may need to be extended depending on the structure of the data.
Building upon R_Newbie's answer, here's a function that will extract the server name from a (vector of) URLs, stripping away a www. prefix if it exists, and gracefully ignoring a missing protocol prefix.
domain.name <- function(urls) {
require(httr)
require(plyr)
paths <- laply(urls, function(u) with(parse_url(u),
paste0(hostname, "/", path)))
gsub("^/?(?:www\\.)?([^/]+).*$", "\\1", paths)
}
The parse_url function is used to extract the path argument, which is further processed by gsub. The /? and (?:www\\.)? parts of the regular expression will match an optional leading slash followed by an optional www., and the [^/]+ matches everything after that but before the first slash -- this is captured and effectively used in the replace text of the gsub call.
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
If you like tldextract one option would be to use the version on appengine
require(RJSONIO)
test <- c("test.server.com/test", "www.google.com/test/index.asp", "http://test.com/?ex")
lapply(paste0("http://tldextract.appspot.com/api/extract?url=", test), fromJSON)
[[1]]
domain subdomain tld
"server" "test" "com"
[[2]]
domain subdomain tld
"google" "www" "com"
[[3]]
domain subdomain tld
"test" "" "com"

How to use a rewrite map to replace a single querystring value with IIS 7 URL Rewrite

We have incoming URL's with querystring's like this:
?query=word&param=value
where we would like to use a rewrite map to build a table of substitution values for "word".
For example:
word = newword
such that the new querystring would be:
?query=newword&param=value
"word" may be any urlencoded value, including %'s
I'm having trouble with the regex match and substitution - I seem to get it to match, but then the substituted value doesn't get passed through.
My current rule looks like:
Match URL: Matches the Pattern: .*
Conditions: match all:
Condition 1: {QUERY_STRING} matches "query=(.+)\&+?(.+)$"
Condition 2: {rewritemap:{C:1}} matches the pattern (.+)
track capture across groups.
Action: rewrite:
rewrite url: ?query={C:1}&param=value
(I've hard coded the &param=value because it isn't changing... having that just pass through from the input would be ideal, I was just being lazy)
So write now using failed request tracking I can see it match, and seemingly replace with the mapped value, but then the url that is output still has the original value.

Resources