Curly braces in robots txt file - web-scraping

I have been working on web scraping and encountered the below patterns in one robots.txt file.
Disallow: /*{{url}}*
Disallow: /*{{imageURL}}*
Do they mean than I am not allowed scrape any URL?

This looks like the site author made an error, as {{url}} and {{imageURL}} were probably intended to be variables that should be replaced with the actual values.
When interpreting this record according to the original robots.txt specification, all characters have to be interpreted literally, so URLs like these would be disallowed:
https://example.com/*{{url}}*
https://example.com/*{{url}}*.bar
https://example.com/*{{url}}*/
https://example.com/*{{url}}*/foo
As { and } are not allowed to appear in a URL path (list of allowed characters), it would mean that all URLs are allowed to be crawled. However, if you prefer, you could assume that it applies to the percent-encoded forms of {/}, but that’s not something the spec requires.
When interpreting this record based on popular extensions of the robots.txt spec (e.g., as used by Google Search), the * has a special meaning: each * in a Disallow value can be replaced with nothing or any sequence of characters. This would lead to many more disallowed URLs, but they would still have to contain literally {{url}} and {{imageURL}}.

Related

Extract subdomain from url using regex

I've searched through all of the related topics here but none seems to answer my specific need. Here is the problem: Given a URL (sans protocol), I want to extract the subdomain portion, excluding www. The domain portion is always the same so I don't need to support all TLDs. Examples:
www.subdomain.domain.com should match subdomain
www.domain.com should match nothing
domain.com should match nothing
This is one of the many iterations I have tried:
[^(www\.)]\w+[^(\.domain\.com)]
Square brackets indicate character class and will remove all the order of otherwise special meaning of most characters.
You can try something like this instead:
((?:[^.](?<!www))+)\.domain\.com
regex101 demo
To return what you're looking for instead of retrieving it through submatches:
((?:[^.](?<!www))+)(?=\.domain\.com)
regexp101 revised

Google Analytics Goals: Prevent tracking of URL parameters of subfolders

On my site I am tracking the URL /shop/ as goal by head match. As there are some URL parameters I cannot use exact match here.
Additionally, I am tracking a goal by exact match which is a URL to subfolder: /shop/process/paid.php
The problem is that GA tracks this subfolder with the head match as well, and thus saves the URL parameters that come along with paid.php, e.g. paid.php?email=customer#home.com
How can I prevent GA to track the URL parameters?
How would the setup look like?
Thanks!
That should work with a custom filter:
admin->profile->filters->custom filter->search and replace.
Search for
/shop/process/paid.php\?.*
(that's your url with arbitrary query parameters, the "\" is an escape sign since "?" is also an control character in regular expression. Dot means any character and "*" means any number of the preceding (in that case any) character) and replace with the desired url ( /shop/process/paid.php).
There is probably a more elegant solution but like most people I'm not good at this regex stuff. This should work however.
Alternatives:
If those query parameters are nowhere needed in the tracking data you can exlude them completely in the profile settings.
You can created a profile for the subdirectory based on the directory (include filter->request uri contains "/shop" and set only this profile to remove query parameters

Disallowing long url's in robots.txt using wildcards in between

I have a situation where I need to disallow crawling on specific pages that all have the same pattern such as:
/folder1/folder2/folder3/review
Where as /folder1/folder2/folder3/ would be a listing, and adding /review would be what I want to disallow crawling to.
Would this line added to robots.txt be valid and block on the the review page and not the listing or anything else?
Disallow: /folder1/*/*/review
Thanks
The double * is redundant.
A simple
/folder1/*/review
or even
/*/review
will do.
If you are trying to state that there must be a 3 folder long path before the "review" URL than I don't think you can do this is robots.txt. At least not with wildcards, beacuse * can mean any string and any number of folders.
Try using RegEx in htaccess instead.

301 Redirect with Regular Expressions

Couldn't find an answer to this and thought it might be a quick answer.
My company, a local news site, is working on migrating to WordPress from a proprietary CMS. Part of the challenge is we are restructuring URLs. I will be utilizing 301 redirects but my issue is as follows:
Example Page name: Story Name: is "this"
Example Old CMS Page URL: /story-name--is--this-/
New CMS Page URL: /news/2012/09/12/story-name-is-this/
The old CMS turned special characters and spaces into hyphens. WordPress will be configured to instead ignore special characters and simply turn spaces into hyphens. Additionally, the old CMS did not include the date in the URL, and I'm not sure the best route to take regarding adding the date.
Thanks!
You're either going to have to write a script that takes all of your old links, does a lookup in your database to transform it into the new link, and redirect the browser to the new link. Or you'll have to enumerate the entire mapping of old links -> new links and create a 301 redirect for each of them (in either your vhost/server config or in an htaccess file):
Redirect 301 /story-name--is--this-/ /news/2012/09/12/story-name-is-this/
It's not clear what is your real question? I am also not sure what Regular expressions have to do with the problem.
There is no information about what your old CMS is capable of, assuming that you can intercept the calls to old articles when they are accessed via the browser, but before they are rendered you can form and send the redirect back to the browser dynamically generating the url using the programming mechanisms available in your proprietary CMS.
Again, assuming you have access to Java:
A. When generating the redirect URL you can access the article's date and form the
2012/09/12 from the date, you can use SimpleDateFormatter to format Dates into a string representation like YYYY/MM/DD.
B. You can use similar approach with the titles and replace the list of special characters in the title string with empty spaces. For example Apache StringUtils library can let you specify a set of characters to look for and if any are found they will be replaced with the target character.
C. You concatenate the output of A and B to create the target redirect URL and send it back to the browser instead of the article itself.

URL without "http|https"

I just learned from a colleague that omitting the "http | https" part of a URL in a link will make that URL use whatever scheme the page it's on uses.
So for example, if my page is accessed at http://www.example.com and I have a link (notice the '//' at the front):
Google
That link will go to http://www.google.com.
But if I access the page at https://www.example.com with the same link, it will go to https://www.google.com
I wanted to look online for more information about this, but I'm having trouble thinking of a good search phrase. If I search for "URLs without HTTP" the pages returned are about urls with this form: "www.example.com", which is not what I'm looking for.
Would you call that a schemeless URL? A protocol-less URL?
Does this work in all browsers? I tested it in FF and IE 8 and it worked in both. Is this part of a standard, or should I test more browsers?
Protocol relative URL
You may receive unusual security warnings in some browsers.
See also, Wikipedia Protocol-relative URLs for a brief definition.
At one time, it was recommended; but going forward, it should be avoided.
See also the Stack Overflow question Why use protocol-relative URLs at all?.
It is called network-path reference (the part that is missing is called scheme or protocol) defined in RFC3986 Section 4.2
4.2 Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution
algorithm of Section 5.
A relative reference that begins with two slash characters is
termed a network-path reference (emphasis mine); such references are rarely used.
A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.

Resources