I am configuring a XAMPP Apache server to work with wordpress multisites and do not understand the following directive:
"VirtualDocumentRoot "C:/xampp/www/%-2/sub/%-3"
what is the purpose of %-2 and %-3 ?
Forgive the basic nature of my question but I can't seem to understand the mechanics of these two terms. Can anyone point me to where this notation might be explained?
Thanks in advance for any help or direction
Found the answer,
this is known as "Directory Name Interpolation"
Apache explains this here: https://httpd.apache.org/docs/2.4/mod/mod_vhost_alias.html
I've pasted an excerpt:
Directory Name Interpolation
All the directives in this module interpolate a string into a
pathname. The interpolated string (henceforth called the "name") may
be either the server name (see the UseCanonicalName directive for
details on how this is determined) or the IP address of the virtual
host on the server in dotted-quad format. The interpolation is
controlled by specifiers inspired by printf which have a number of
formats: %% insert a % %p insert the port number of the virtual host
%N.M insert (part of) the name
N and M are used to specify substrings of the name. N selects from the
dot-separated components of the name, and M selects characters within
whatever N has selected. M is optional and defaults to zero if it
isn't present; the dot must be present if and only if M is present.
The interpretation is as follows:
0 the whole name
1 the first part
2 the second part
-1 the last part
-2 the penultimate part 2+ the second and all subsequent parts
-2+ the penultimate and all preceding parts 1+ and -1+ the same as 0
If N or M is greater than the number of parts available a single
underscore is interpolated.
Related
I have a script that was very kindly provided for me a while ago which allowed me to generate input files by inserting coordinates from a series of .xyz files into a template file (Create new files by copying contents of coordinate files into template file).
I'm trying to adapt that script to do something very similar, but different in a very slight, but annoying way. In the script, the new directories created to house these new files are named like this:
# File name is in the form '....Hnnn.xyz';
# this will parse nnn from that name.
local inputNumber=$coordFile
# Remove '.xyz'.
inputNumber=${inputNumber%.xyz}
# Remove everything up to and including the 'H'.
inputNumber=${inputNumber##*H}
# Subdirectory name is based on the input number.
local outDir=$baseDir/D$inputNumber
# Create the directory if it doesn't exist.
if [[ ! -d $outDir ]]; then
mkdir $outDir
fi
This worked for my last problem, because the files were all named in the form xxxx_DH000.xyz. However, now the files I have are named using the form xxxx.000.xyz. While everything else in the script works, I cannot figure out how to name the new directories in the form 000.
The line in the script which I think needs to be edited slightly is where it says inputNumber=${inputNumber##*H}. What I cannot figure out is how to get the script to delete everything up to but not including a 0. I've searched online, but the only questions/answers I've found relating to the renaming of files by stripping part of the original names speaks about deleting everything 'up to and including' a string.
I was able to generate directories named 1, 2, 3, etc. with inputNumber=${inputNumber##*0}, however I want all three digits present (i.e. I would like create directories 001, 002, 003, etc.).
As an aside, I cannot use the . as the cutoff point, as there are multiple .s in each file name. An example of one of the file names is tma.h2s-2-pes-b97m-d4-tz.011.xyz.
Is there some way to get the script to simply name the files based on the full three digit number?
Although it's not needed in this case, zsh does support deleting text just before a matched pattern in a string. These parameter expansions will remove everything prior to the first 0 in the string, but keep the 0:
inputNumber='tma.h2s-2-pes-b97m-d4-tz.011.xyz'
inputNumber=${inputNumber:r} # remove '.xyz'
inputNumber=${(SM)inputNumber##0*}
print ${inputNumber}
# ==> 011
This includes a few zsh-isms:
${...:r} returns the 'root' of a filename, removing the extension.
(S) - parameter expansion flag to change the behavior of the ## expansion. It will now search for patterns in the middle of a string, not just at the beginning.
(M) - flag to include the pattern match (the 0*) in the result.
This depends on the number always starting with 0, which may not be a good choice - what file comes after 099?
This next version uses a zsh extended glob pattern to find a number between two periods, and returns that number - i.e. it will find the number in .11., .011., or .2345., but not in .x11.:
coordFile='tma.h2s-2-pes-b97m-d4-tz.022.xyz'
inputNumber=${(*)coordFile//(#b)*.(<->).*/${match}}
print ${inputNumber}
# ==> 022
Some of the pieces:
${...//.../...} - substitution expansion.
(*) - enables extendedglob for this expansion.
(#b) - globbing flag to enable 'backreferences', so that $match will work.
<-> - matches a number. This can be restricted to a range if needed, like <100-199>.
(<->) - puts the number into a match group.
*. and .* - everything before and after the number; these are not in the match group.
${match} - the matched string from the parenthesized part of the pattern. This is used as the replacement for the entire string, so we get just the number. If more than one part of the input string matches the pattern, this will be the last one. match is actually an array, but since there's only one match group in the pattern, it does not need to be indexed with ${match[1]}.
This variant uses a standard regular expression to find the number:
coordFile='tma.h2s-2-pes-b97m-d4-tz.033.xyz'
match=
[[ $coordFile =~ .*\\.([[:digit:]]+)\\..* ]]
inputNumber=${match[1]}
print ${inputNumber}
# ==> 033
After the [[ ]] test, the match array will contain matches from any parenthesized groups in the regular expression - here, that will be a set of one or more digits in between two periods / full stops.
But, as #choroba and Fravadona have noted, since the number will be always be at the end of the string, you can use the standard #/##/%/%% expansions to remove parts of the string based only on the .s. This is a common idiom that will be familiar to many shell programmers, and will also work in bash (note that other parts of your original script depend on zsh).
inputNumber='tma.h2s-2-pes-b97m-d4-tz.044.xyz'
inputNumber=${inputNumber%.xyz}
inputNumber=${inputNumber##*.}
print ${inputNumber}
# ==> 044
In zsh everything can be consolidated into a single nested substitution:
baseDir='files/are/here'
coordFile='tma.h2s-2-pes-b97m-d4-tz.055.xyz'
local outDir=$baseDir/D${${coordFile:r}##*.}
print $outDir
# ==> files/are/here/D055
I have a list of URLs and I want to extract the main URL to see how many times each URL has been used. as you can imagine, there are so many URLs with different notations. I tried and wrote the following code to extract the main URL:
library(stringr)
library(rebus)
# Step 2: creating a pattern for URL extraction
pat<- "//" %R% capture(one_or_more(char_class(WRD,DOT)))
#step 3: Creating a new variable from URL column of df
#(it should be atomic vector)
URL_var<-df[["URLs"]]
#step 4: using rebus to extract main URL
URL_extract<-str_match(URL_var,pattern = pat)
#step 5: changing large vector to dataframe and changing column name:
URL_data<-data.frame(URL_extract[,2])
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
The result of this code is acceptable for most cases. For example for //www.google.com, it returns www.google.com and for a website like http://image.google.com/steve it returns image.google.com; however, there are so many cases that this code can't recognize the pattern and will fail to find the URL. For example for URL such as http://my-listing.ca/CommercialDrive.html the code will return my which is definitely not acceptable. for another example, for a website like http://www.real-data.ca/clients/ur/ it only returns www.real. It seems that handling - for my code is difficult
Do you have any suggestions on how to improve this code? or do we have any packages to help me extract URLs faster and better?
Thanks
I think you can simply use
library(stringr)
URL_var<-df[["URLs"]]
URL_data<-data.frame(str_extract(URL_var, "(?<=//)[^\\s/:]+"))
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
Here, stringr::str_extract method searches for the first match in the input, and fetches the substring found. Unlike stringr::str_match, it cannot return submatches, so a lookbehind is used in the regex pattern, (?<=...):
(?<=//)[^\s/:]+
It means:
(?<=//) - match a location in the string that is immediately preceded with // string
[^\\s/:]+ - one or more (+) occurrences of any char but whitespace, / and :. The colon is to make sure port number is not included in the match. / makes sure the match stops before the first / and \s (whitespace) makes sure the match stops before the first whitespace.
I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.
Here is my function:
address_to_postcode <- function(addresses) {
# 1. Convert addresses to upper case
addresses = toupper(addresses)
# 2. Regular expression for UK postcodes:
pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"
# 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
present <- grepl(pcd_regex, addresses)
# 4. Extract postcodes matching the regular expression for a valid UK postcode
postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))
# 5. Return NA where an address does not contain a (valid format) UK postcode
postcodes_out <- list()
postcodes_out[present] <- postcodes
postcodes_out[!present] <- NA
# 6. Return the results in a vector (should be same length as input vector)
return(do.call(c, postcodes_out))
}
According to the guidance document, the logic this regular expression looks for is as follows:
"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of
ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one
or two numbers OR One letter followed by one number and then another
letter OR A two part post code where the first part must be One letter
followed by a second letter that must be one of ABCDEFGH
JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and
optionally a further letter after that AND The second part (separated
by a space from the first part) must be One number followed by two
letters. A combination of upper and lower case characters is allowed.
Note: the length is determined by the regular expression and is
between 2 and 8 characters.
My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.
Consider the following example:
> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"
According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':
> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"
... whereas in this case I would expect the output to be NA.
Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:
> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE
Two questions:
Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter
and character ranges exclusive to their position within the regular expression)?
Edit
Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.
Note
Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.
Issues
You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.
1. The space character
My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
here ^
2. Boundaries
You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.
See regex in use here
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^ ^^^
3. Character class oversight
There's a missing - in the character class as pointed out by #deadcrab in his answer here.
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^
4. They made the wrong character class optional!
In the documentation it clearly states:
A two part post code where the first part must be:
One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that
They made the wrong character class optional!
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^^
it should be this one ^^^^^^^^
5. The whole thing is just awful...
There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b
Answer
As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
^^ ^^
You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):
\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b
Note
Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.
The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):
British Overseas Territories
British Forces Post Office
Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)
See this regex in use here.
\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b
I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).
Side note
The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.
As mentioned in the Issues section, they should have:
Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.
here is my regular expression
txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)
Given website addresses, e.g.
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How do I return the root domain in R, e.g.
example.com
example2.co.uk
For my purposes I would define the root domain to have structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
What about something in R that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
Edited: Adding extra info based on Richard's comment
Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
Algorithm
Match domain against all rules and take note of the matching ones.
If no rules match, the prevailing rule is "*".
If more than one rule matches, the prevailing rule is the one which is an exception rule.
If there is no matching exception rule, the prevailing rule is the one with the most labels.
If the prevailing rule is a exception rule, modify it by removing the leftmost label.
The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
The registered or registrable domain is the public suffix plus one additional label.
There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
Somthing lik this should help
> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"
> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"
I am currently investigating the most appropriate dictionary to use in an application I am building.
Inspecting the dictionaries which are bundled with Sublime Text 2, the file format is as you would expect - a list of alphabetically ordered words. However, alot of those words have additional information appended to them. Take this snippet as an example:
abaft
abbreviation/M
abdicate/DNGSn
Abelard/M
abider/M
Abidjan
ablaze
abloom
aboveground
abrader/M
Abram/M
abreaction/MS
abrogator/MS
abscond/DRSG
absinthe/MS
absoluteness/S
absorbency/SM
abstract/ShTVDPiGY
absurdness/S
A fruitless Google search has not shed any light on what the letters after the slash (/) mean.
Maybe they hint at the sex of the word, but that is only a guess and I'd prefer to read a formal explanation of their meaning.
Has anybody come across these?
The letters following the slash are called affixes. These encodings can be prefixes or suffixes that may be applied to the root word.
See this blog post for a nice explanation and examples of what these affixes can be used for.
Another place to look is the aspell manual.
TLDR: each letter in the .dic file following the slash is a name of a rule in the .aff file.
https://superuser.com/a/633869/367530
Each rule is in the .aff file for that language. The rules come in two
flavors: SFX for suffixes, and PFX for prefixes. Each line begins with
PFX/SFX and then the rule letter identifier (the ones that follow the
word in the dictionary file:
PFX [rule_letter_identifier] [combineable_flag]
[number_of_rule_lines_that_follow]
You can normally ignore the combinable flag, it is Y or N depending on
whether it can be combined with other rules. Then there are some
number of lines (indicated by the ) that list different possibilities
for how this rule applies in different situations. It looks like this:
PFX [rule_letter_identifier] [number_of_letters_to_delete]
[what_to_add] [when_to_add_it]
For example:
SFX B Y 3
SFX B 0 able [^aeiou]
SFX B 0 able ee
SFX B e able [^aeiou]e
If B is one of the letters following a word, i.e. someword/B, then this is one of the
rules that can apply. There are three possibilities that can happen
(because there are three lines). Only one will apply:
able is added to the end when the end of the word is not (indicated by ^) one of the letters in the set (indicated by [ ]) of letters a, e, i, o, and u. For example, question → questionable
able is added to the end when the end of the word is ee. For example, agree → agreeable.
able is added to the end when the end of the word is not a vowel ([^aeiou]) followed by an e. The letter e is stripped (the column before able). For example, excite → excitable.
PFX rules are the same, but apply at the beginning of the word instead
for prefixes.