Disallow users from replacing hyphens with periods on WordPress sites - wordpress

After reviewing Google Analytics and ad traffic we realized that people were able to find pages on client sites in a very odd way. Replacing the hyphens with periods.
For example...
Correct permalink: www.domain.com/this-is-a-link
Incorrect: www.domain.com/this.is.a.link
Both work and send the user to the same page. But I'm not sure why. We tried various browsers and it seems to work the same in all of them. Normally, this would be helpful to the user (generally speaking) but it is skewing the analytics.
I suspect the ad campaign folks created a link with the periods which started the problem. But even with fixing that, it doesn't answer the question of why this even works or how to disallow this behavior / functionality.
Any thoughts?

Wordpress uses mod_rewrite for permalink. And mod_rewrite uses pattern matching in your urls to distinguish what to rewrite and what not to rewrite within your .htaccess file.
The . character actually means any character in regular express pattern matching which is what mod_rewrite used to determine what to rewrite.
to illustrate this better, take your example
www.domain.com/this-is-a-link
to be the correct link that you desire but in the case of
www.domain.com/this.is.a.link
it will also match as . is being read as - since single dot means any character
you can read more about mod_rewrite to get a better understand why period is being read as dashes too.
The only way to solve this is to rewrite the default Wordpress native mod_rewrite pattern or report this as a bug to ask the core communities to list it as a bug would be more appropriate. But this seems pretty common even with large site such as eBay with url
http://www.ebay.com/rpp/halloween-events/sweet-treats
the url with
http://www.ebay.com/rpp/halloween-events/sweet.treats
is also valid. i believe this is a limitation in mod_rewrite so you might want to live with it.

Related

How to check if any URLs within my website contain specific Google Analytics UTM codes?

The website I manage uses Google Analytics to track URLs. Recently I found out that some of the URLs contain UTM codes and should not. I need some way of determining whether or not URLs that contain the following UTM codes utm_source=redirect or utm_source=redirectfolder are currently on the website and being redirected within the same website. If so, I will need to remove the UTM codes on those URLs, because Google Analytics automatically tracks URLs that redirect within the same domain. So it does not require UTM codes (and this actually hurts the analytics).
My apologies if I sound a little broken here, I am still trying to understand it all myself, as I am a new graduate with a CS degree and I am now the only web developer. I am not asking for anyone to write this for me, just if I could be pointed in the right direction to writing a ColdFusion script that may help with this.
So if I understand correctly your codebase is riddled with problematic URLS. To clean up the URLs programmatically you'll need to do a couple of things up front.
Identify the querystring parameter variable/value pair that needs to be
eliminated.
Create a worker file to access all your .cfm and .cfc files (of interest).
Create a loop that goes through the directories and reads, edits and saves your files (be careful here not to go crazy, maybe do not set to overwrite existing files (like make unique, unless you are sure).
Create a find/replace function or regex expression to target and remove your troublesome parameters
Save your file and move on in the loop.
OR:
You can use and IDE like dreamweaver or sublimetext to locate these via a regex search and spot check and remove.
I would selectively remove the URL parameters, but if you have so many pages that it makes no sense, then programmatic removal would be the way to go.
You will be using cfdirectory, cffile, rematch() (and create an array and rebuild) or find/replace replaceNoCase()
Your cfdirectory call will return a variable and like a query you will spin through it like you do with a normal query and cfoutput.
Pull one or two files out of your repo to create your code with until you are confortable. I would code in exit strategies (fail gracefully) like adding a locatable comments to the change spot so you can check it later manually, or escape out if a file won't write and many other try/catch opportunities.
I hope this helps.

HTTP Followed By Four Slashes?

We've just enabled Flexible SSL (CloudFlare) on our website and I was going through swapping all the http://example.com/ to just //example.com/, when I noticed the link to the Font-Awesome css file was like this:
http:////maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css
The http is followed by four slashes, I've seen three (when using local files in the browser) and two is the general standard, but four?
So what does four do? Is it any different to two? And can I swap http:////example.com/ to //example.com/ or should it be ////example.com/?
Is it any different to two?
Well, one is in line with RFC 3986, the other is not. Section 3 clearly states, that the separator between scheme and the authority has to be ://. In case of protocol-relative URLs, the start has to be //. If there is another slash there, it has to be part of an absolute path reference.
The only way for an additional set of slashes there were if those were part of the authority and left unencoded. That could happen if // is the start of:
a user name
a domain name
Neither one seems to be the case here and I am pretty sure that (2) is clashing heavily with the requirements for domain names, while (1) is almost guaranteed to cause interoperability issues. So I assume it's an error by whoever wrote that.
A quick test revealed that firefox is eliminating bogus slashes in the URL while w3m is erroring out.

What does //RK=0 mean at the end of a URL?

When I check with google index "site:http://example.com" I see a few webpages that have the real page followed by //RK=0. I have no idea what this means, how it got there, and why Google indexed it.
As a developer, I need to know why anyone would put such a suffix on the end of a URL, because it's nothing I coded for.
Copying DMAN's answer - one of the non-selected answer from: .htaccess rewrite rule remove everything after RK=0/RS=
Someone found where this mess is coming from.
http://xenforo.com/community/threads/server-logs-with-rk-0-rs-2-i-now-know-what-these-are.73853/
It looks like actually NOT malicious, it's something broken with Yahoo rewrites that create URLs that point to pages that don't exist.
The demo described on xenforo does replicate it, and the pattern of the URLS that Yahoo is producing:
http://r.search.yahoo.com/_ylt=A0SO810GVXBTMyYAHoxLBQx./RV=2/RE=1399899526/RO=10/RU=http%3a%2f%2fkidshealth.org%2fkid%2fhtbw%2f/RK=0/RS=y2aW.Onf1Hs6RISRJ9Hye6gXvow-
Sure does look like the RV=, RE=, RU=, RK=, RS= values are of the same family. It's just that somewhere the arg concatenation is screwing up on their side.
Neal additional comment:
Maybe at one time there was a bug or something odd going on with Yahoo. It caused Google to enter these extra URLs into its index. I am going to remove them with the WebmasterTools "disavow" feature, as Google may be seeing them as duplicate content and thus causing contributing to a "Panda" penalty.

Concrete 5 search results page url

Concrete 5 search results page url contains some parameters. how to remove that parameters and make the url user friendly
On an apache server I recommend you to use the mod_rewrite module to use the RewriteEngine.
With this module you can specify aliases for some internal URLs (of course with parameters as well). You can also use RegEx for this.
RewriteEngine on Wikipedia
mod_rewrite tutorial
Short answer: it's probably not worth the trouble.
Long answer...
I'm guessing you see three query parameters when using the search block:
query
search_paths[]
submit
The first parameter is required to make the searches work, but the other two can be dropped. When I build concrete5 themes, I usually "hard-code" the html for the search form, so that I can control which parameters are sent (basically, don't provide a "name" to the submit button, and don't include a "search_paths" hidden field).
The "query" parameter, though, is not going to be easy to get rid of. The problem is that for a search, you're supposed to have a parameter like that in the URL. You could work around this by using javascript -- when the search form is submitted, use some jquery to rewrite the request so it puts that parameter at the end of the URL (for example, http://example.com/search?query=test becomes http://example.com/search/test). Then, as #tuxtimo suggests, you add a rewrite rule to your .htaccess file to take that last piece of the URL and treat it as the ?query parameter that the system expects. But this won't work if the user doesn't have javascript enabled (and hence probably not for Googlebot either, which means that this won't really serve you any SEO purpose -- which I further imagine is the real reason you're asking this question to begin with).
Also, you will run into a lot of trouble if you ever add another page under the page that you show the search results on (because you have the rewrite rule that treats everything after the top-level search page path as a search parameter -- so you can never actually reach an address that exists below that path).
So I'd just make a nice clean search form that only sends the ?query parameter and leave it at that -- I don't think those are really that much less user-friendly than /search-term would be.

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

Resources