What does //RK=0 mean at the end of a URL? - wordpress

When I check with google index "site:http://example.com" I see a few webpages that have the real page followed by //RK=0. I have no idea what this means, how it got there, and why Google indexed it.
As a developer, I need to know why anyone would put such a suffix on the end of a URL, because it's nothing I coded for.

Copying DMAN's answer - one of the non-selected answer from: .htaccess rewrite rule remove everything after RK=0/RS=
Someone found where this mess is coming from.
http://xenforo.com/community/threads/server-logs-with-rk-0-rs-2-i-now-know-what-these-are.73853/
It looks like actually NOT malicious, it's something broken with Yahoo rewrites that create URLs that point to pages that don't exist.
The demo described on xenforo does replicate it, and the pattern of the URLS that Yahoo is producing:
http://r.search.yahoo.com/_ylt=A0SO810GVXBTMyYAHoxLBQx./RV=2/RE=1399899526/RO=10/RU=http%3a%2f%2fkidshealth.org%2fkid%2fhtbw%2f/RK=0/RS=y2aW.Onf1Hs6RISRJ9Hye6gXvow-
Sure does look like the RV=, RE=, RU=, RK=, RS= values are of the same family. It's just that somewhere the arg concatenation is screwing up on their side.
Neal additional comment:
Maybe at one time there was a bug or something odd going on with Yahoo. It caused Google to enter these extra URLs into its index. I am going to remove them with the WebmasterTools "disavow" feature, as Google may be seeing them as duplicate content and thus causing contributing to a "Panda" penalty.

Related

How to check if any URLs within my website contain specific Google Analytics UTM codes?

The website I manage uses Google Analytics to track URLs. Recently I found out that some of the URLs contain UTM codes and should not. I need some way of determining whether or not URLs that contain the following UTM codes utm_source=redirect or utm_source=redirectfolder are currently on the website and being redirected within the same website. If so, I will need to remove the UTM codes on those URLs, because Google Analytics automatically tracks URLs that redirect within the same domain. So it does not require UTM codes (and this actually hurts the analytics).
My apologies if I sound a little broken here, I am still trying to understand it all myself, as I am a new graduate with a CS degree and I am now the only web developer. I am not asking for anyone to write this for me, just if I could be pointed in the right direction to writing a ColdFusion script that may help with this.
So if I understand correctly your codebase is riddled with problematic URLS. To clean up the URLs programmatically you'll need to do a couple of things up front.
Identify the querystring parameter variable/value pair that needs to be
eliminated.
Create a worker file to access all your .cfm and .cfc files (of interest).
Create a loop that goes through the directories and reads, edits and saves your files (be careful here not to go crazy, maybe do not set to overwrite existing files (like make unique, unless you are sure).
Create a find/replace function or regex expression to target and remove your troublesome parameters
Save your file and move on in the loop.
OR:
You can use and IDE like dreamweaver or sublimetext to locate these via a regex search and spot check and remove.
I would selectively remove the URL parameters, but if you have so many pages that it makes no sense, then programmatic removal would be the way to go.
You will be using cfdirectory, cffile, rematch() (and create an array and rebuild) or find/replace replaceNoCase()
Your cfdirectory call will return a variable and like a query you will spin through it like you do with a normal query and cfoutput.
Pull one or two files out of your repo to create your code with until you are confortable. I would code in exit strategies (fail gracefully) like adding a locatable comments to the change spot so you can check it later manually, or escape out if a file won't write and many other try/catch opportunities.
I hope this helps.

Get page through database, but based on the URL

I have a site with the URL looking like this:
http://example.com/index.php?id=1 - I want to turn that into http://example.com/1-page-title-goes-here. I do not want a redirect, because nobody knows it's index.php?id=1 yet, because the site has not launched yet.
The id=1 determines what the page should load. If id=1, it will do a SELECT from the database on that ID, then display the appropriate information on the page. Farely simple and does the job really well.
I am running Nginx on my server, which I assume would be the way to go with this..? But I am not sure. From what I can see around on the web, the rewrite rule would check the first digit(s) after / and before -, then do the database lookup based on that. However, doesn't that mean you could type anything after the - and it would still be a legit link? Like: http://example.com/1-anything-can-go-here-and-it-would-not-matter. Am I right?
Which approach would be the best? The rewrite to check the first digit(s) would work, but what about the last part of the URL? What is the best approach of doing this? Thanks in advance!
EDIT: I just noticed Stackoverflow works like this. If I add anything after http://stackoverflow.com/questions/32848332/get-page-through-database-but-based-on-the-url it will change it back to the original URL. Same happens if I remove something from the URL.

Disallow users from replacing hyphens with periods on WordPress sites

After reviewing Google Analytics and ad traffic we realized that people were able to find pages on client sites in a very odd way. Replacing the hyphens with periods.
For example...
Correct permalink: www.domain.com/this-is-a-link
Incorrect: www.domain.com/this.is.a.link
Both work and send the user to the same page. But I'm not sure why. We tried various browsers and it seems to work the same in all of them. Normally, this would be helpful to the user (generally speaking) but it is skewing the analytics.
I suspect the ad campaign folks created a link with the periods which started the problem. But even with fixing that, it doesn't answer the question of why this even works or how to disallow this behavior / functionality.
Any thoughts?
Wordpress uses mod_rewrite for permalink. And mod_rewrite uses pattern matching in your urls to distinguish what to rewrite and what not to rewrite within your .htaccess file.
The . character actually means any character in regular express pattern matching which is what mod_rewrite used to determine what to rewrite.
to illustrate this better, take your example
www.domain.com/this-is-a-link
to be the correct link that you desire but in the case of
www.domain.com/this.is.a.link
it will also match as . is being read as - since single dot means any character
you can read more about mod_rewrite to get a better understand why period is being read as dashes too.
The only way to solve this is to rewrite the default Wordpress native mod_rewrite pattern or report this as a bug to ask the core communities to list it as a bug would be more appropriate. But this seems pretty common even with large site such as eBay with url
http://www.ebay.com/rpp/halloween-events/sweet-treats
the url with
http://www.ebay.com/rpp/halloween-events/sweet.treats
is also valid. i believe this is a limitation in mod_rewrite so you might want to live with it.

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

Token replacement

I currently implement a replace function in the page render method which replaces commonly used strings - such as replace [cfe] with the root to the customer front end. This is because the value may be different based on the version of the site - for example the root to the image folder ([imagepath]) is /Images on development and live, but /Test/Images on test.
I have a catalogue of products for which I would like to change [productName] to a link to the catalogue page for that product. I would like to go through the entire page and replace all instances of [someValue] with the relevant link. Currently I do this by looping through all the products in the product database and replacing [productName] with the link to the catalog page for that product. However this is limited to products which exist in the database. "Links" to products which have been removed currently wont be replaced, so [someValue] will be displayed to the user. This does not look good.
So you should be able to see my problem from this. Does anyone know of a way to achieve what I would like to easily? I could use regexes, but I don't have much experience of those. If this is the easiest way, using "For Each Match As String In Regex.Matches(blah, blah)" then I am willing to look further into this.
However at some point I would like to take this further - for example setting page layouts such as 3 columns with an image top right using [layout type="3colImageTopRight" imageURL="imageURL"]Content here[/layout]. I think I could kind of do this now, but I cant figure out how to deal with this if the imageURL were, say, [Image:Product01.gif] (using regex.match("[[a-zA-Z]{0,}]") I think would match just [layout type="3colImageTopRight" imageURL="[Image:Product01.gif] (it would not get to the end of the layout tag). Obviously the above wouldn't quite work, as I haven't included double quotes in the match string or anything, but you get the general idea. You should be able to get the general idea of what I am getting at and what I am trying to do though.
Does anyone have any ideas or pointers which could help me with this? Also if this is not strictly token replacement then please point me to what it is, so I can further develop this.
Aristos - hope reexplaining this resolves the confusion.
Thanks in advance,
Regards,
Richard Clarke
#RichardClarke - I would go with Regular Expressions, they're not as terrible to learn as you might think and with a bit of careful usage will solve your problems.
I've always found this a very useful tool.
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
goes nicely with a cheat sheet ;-)
http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/
Good luck.

Resources