Disallowing long url's in robots.txt using wildcards in between - wildcard

I have a situation where I need to disallow crawling on specific pages that all have the same pattern such as:
/folder1/folder2/folder3/review
Where as /folder1/folder2/folder3/ would be a listing, and adding /review would be what I want to disallow crawling to.
Would this line added to robots.txt be valid and block on the the review page and not the listing or anything else?
Disallow: /folder1/*/*/review
Thanks

The double * is redundant.
A simple
/folder1/*/review
or even
/*/review
will do.
If you are trying to state that there must be a 3 folder long path before the "review" URL than I don't think you can do this is robots.txt. At least not with wildcards, beacuse * can mean any string and any number of folders.
Try using RegEx in htaccess instead.

Related

Curly braces in robots txt file

I have been working on web scraping and encountered the below patterns in one robots.txt file.
Disallow: /*{{url}}*
Disallow: /*{{imageURL}}*
Do they mean than I am not allowed scrape any URL?
This looks like the site author made an error, as {{url}} and {{imageURL}} were probably intended to be variables that should be replaced with the actual values.
When interpreting this record according to the original robots.txt specification, all characters have to be interpreted literally, so URLs like these would be disallowed:
https://example.com/*{{url}}*
https://example.com/*{{url}}*.bar
https://example.com/*{{url}}*/
https://example.com/*{{url}}*/foo
As { and } are not allowed to appear in a URL path (list of allowed characters), it would mean that all URLs are allowed to be crawled. However, if you prefer, you could assume that it applies to the percent-encoded forms of {/}, but that’s not something the spec requires.
When interpreting this record based on popular extensions of the robots.txt spec (e.g., as used by Google Search), the * has a special meaning: each * in a Disallow value can be replaced with nothing or any sequence of characters. This would lead to many more disallowed URLs, but they would still have to contain literally {{url}} and {{imageURL}}.

Google Analytics Goals: Prevent tracking of URL parameters of subfolders

On my site I am tracking the URL /shop/ as goal by head match. As there are some URL parameters I cannot use exact match here.
Additionally, I am tracking a goal by exact match which is a URL to subfolder: /shop/process/paid.php
The problem is that GA tracks this subfolder with the head match as well, and thus saves the URL parameters that come along with paid.php, e.g. paid.php?email=customer#home.com
How can I prevent GA to track the URL parameters?
How would the setup look like?
Thanks!
That should work with a custom filter:
admin->profile->filters->custom filter->search and replace.
Search for
/shop/process/paid.php\?.*
(that's your url with arbitrary query parameters, the "\" is an escape sign since "?" is also an control character in regular expression. Dot means any character and "*" means any number of the preceding (in that case any) character) and replace with the desired url ( /shop/process/paid.php).
There is probably a more elegant solution but like most people I'm not good at this regex stuff. This should work however.
Alternatives:
If those query parameters are nowhere needed in the tracking data you can exlude them completely in the profile settings.
You can created a profile for the subdirectory based on the directory (include filter->request uri contains "/shop" and set only this profile to remove query parameters

301 Redirect with Regular Expressions

Couldn't find an answer to this and thought it might be a quick answer.
My company, a local news site, is working on migrating to WordPress from a proprietary CMS. Part of the challenge is we are restructuring URLs. I will be utilizing 301 redirects but my issue is as follows:
Example Page name: Story Name: is "this"
Example Old CMS Page URL: /story-name--is--this-/
New CMS Page URL: /news/2012/09/12/story-name-is-this/
The old CMS turned special characters and spaces into hyphens. WordPress will be configured to instead ignore special characters and simply turn spaces into hyphens. Additionally, the old CMS did not include the date in the URL, and I'm not sure the best route to take regarding adding the date.
Thanks!
You're either going to have to write a script that takes all of your old links, does a lookup in your database to transform it into the new link, and redirect the browser to the new link. Or you'll have to enumerate the entire mapping of old links -> new links and create a 301 redirect for each of them (in either your vhost/server config or in an htaccess file):
Redirect 301 /story-name--is--this-/ /news/2012/09/12/story-name-is-this/
It's not clear what is your real question? I am also not sure what Regular expressions have to do with the problem.
There is no information about what your old CMS is capable of, assuming that you can intercept the calls to old articles when they are accessed via the browser, but before they are rendered you can form and send the redirect back to the browser dynamically generating the url using the programming mechanisms available in your proprietary CMS.
Again, assuming you have access to Java:
A. When generating the redirect URL you can access the article's date and form the
2012/09/12 from the date, you can use SimpleDateFormatter to format Dates into a string representation like YYYY/MM/DD.
B. You can use similar approach with the titles and replace the list of special characters in the title string with empty spaces. For example Apache StringUtils library can let you specify a set of characters to look for and if any are found they will be replaced with the target character.
C. You concatenate the output of A and B to create the target redirect URL and send it back to the browser instead of the article itself.

IIS 7 URL rewrite on WCF Service

Question Edited for better understanding:
I have a WCF service and any of my links look like :
https://192.168.1.31/ContactLibrary2.0HTTPS/Service.svc/..... .
I want to get rid of the Service.svc. I installed URL Writer in IIS but i don't know how to work with it. I search a little bit and didn't find anything to help me with this particular problem.
Any idea ?
Assuming you are configuring the application hosted at /ContactLibrary2.0HTTPS directly (and not the website containing that directory, for example), you may add an exact match for:
rest/GetContact
with a rewrite url of:
Service.svc/rest/GetContact
Perhaps you wish to rewrite every action of Service.svc, however; then you would need a regular expression match for:
^rest/.*$
with a rewrite url of:
Service.svc/{R:0}
UPDATE
Assuming you also need to remove that string from the urls of your HTML pages, you would need to couple the aforementioned inbound rule with a new outbound rule, applied to the files you are interested in.
To do that, please:
add a new outbound rule to your website and give it a name;
add a new precondition with two rules (matching any of them):
{RESPONSE_CONTENT_TYPE} matches text/html
{RESPONSE_CONTENT_TYPE} matches application/xhtml+xmll
configure the rule to match the response scope, matching the content within A tags:
should match the pattern using a regular expression;
with this pattern: ^(.*)(/Service\.svc/)(.*)$
case insensitive;
configure the action to be a rewrite, with this value: {R:1}{R:3}

Need regex help for URL rewriting querystrings to friendly URLs

I updated my website CMS and the URL formats have changed. Where previously I had the URL /blog.aspx?Year=XXXX&Month=YY I now have /blog/XXXX/YY
Can someone help me create a regex for this?
Two additional notes:
it has to also support simply the year (/blog.aspx?Year=XXX)
the old Month urls use only 1 digit for single digit months (/blog.aspx?Year=2009&Month=2 instead of Month=02)
Here is what I came up with:
/blog.aspx[?]Year=([0-9]{4})([&]?)(Month=)?([0-9]*)
I can't seem to get it to work, as I still get a 404 on the page when I go to one of the above URLs.
Is this workable?
/blog.aspx\?Year=([0-9]{4})(?>\&?Month=?([0-9]{1,2})|)
works with these input
/blog.aspx?Year=1983&Month=2
/blog.aspx?Year=1983
/blog.aspx?Year=1983&Month=12
there is this (?>blabla|moomoo) syntax.
If it cant find blabla match, it will match moomoo
Though i suspect regex here is not the root problem, what CMS handles the redirect?

Resources