Greetings.
I've been trying to grok regexes and rewrites but i'm a newbie at it. I have a simple site, one dir only, all i want is to rewrite domain.com/file.php?var=value to domain.com, making sure the user will only see domain.com in the adress bar throughout site navigation (even if i start making the site more complex).
Simply speaking it's freezing the url, although if the site grow's i'd rather make some "clean url" approach but i'm still at basic php and i sense i'd have to rtfm on http/1.1
You can use nginx rewrite rule like this:
rewrite ^([^.]*)$ /index.php
to send all requests that do not contain a "." to your index.php script. In your php script, explode the request url to get the passed parameters, and then call the php page you want to:
#URL /people/jack_wilson
$url_vars = explode('/',$_SERVER['request_uri']); # array('','people','jack_wilson');
$page = $url_vars[1];
if($page == 'people'){
$username = $url_vars[2];
require('people.php'); #handle things from here.
}
Typically this type of "clean" URL stuff is typically accomplished with a wrapper frame for the entire site. The browser shows the URL of that frame only, and the contents can be whatever.
But I actually don't recommend it, because the user may want to bookmark the "dirty" URL.
This type of URL obfuscation diminishes usability for advanced users.
Not to be a jerk, but I typically only see this type of URL obfuscation on "artsy" sites that care more about what their address bar looks like than usability of their site. Making your users happy via enhanced usability is a better long term approach IMHO.
I guess freezing the url is driven by marketing desires, so here's the downside to that, from marketing standpoint as well: your users won't be able to send a link to a page they liked to their friends via IM, email or facebook, so it actually decreases the appeal of your site even for the most clueless users.
I don't (knowledgeably) think it's safe to have php variables showing around in the adress bar (although they'd show in the status bar of some browsers anyway...). Idealy the user wouldn't know what is the site using behind the scenes - i'm configuring error pages for starters. And yes, for aesthetics as well. If i can rewrite to domain.com/foo/bar i'd be happy; assuming nginx would handle the "translation back to ugly URLs" if it got such a request, which i think it does with location-directives. But having domain.com/file.php?var1=value&var2=value kinda annoys me, makes me feel i'm exposing the site too much.
Frames are not recomended (especially 'cos of search engines) and i'm trying to stick to xhtml 1.1 Strict (so far so good). If the content is well designed, easy to navigate, acessible regardless of browser-choice, intuitive, etc... i guess i could live with cute URLs :)
I'd gladly grok through any rtfm material regarding good webdesign techniques, php, http/1.1 and whatever made you go "Yeah! that's what i've been looking for!" at 4am.
Thanks :)
[if i understand this site right, this oughta be a reply to the original post, not to an answer... sorry]
You could also look into a PHP framework such as Code Igniter that will handle all the PHP side of this for you.
http://codeigniter.com/
Related
HI so i keep running across websites which when looked through or searched (using their own search function) return's a static URL ie.) ?id=16 or default.aspx no mater what page of the website you visit after the search has been performed. This becomes a problem when i want to go directly to a post/page within one of these sites so i'm wondering. If anyone knows How could i actually find out what the absolute URL is.
So that i can navigate straight to it. I'm not really familiar with coding but have tried looking in the page source but i wasn't really able to gleam anything from there.
The basics around asp.net urls: http://www.codeproject.com/Articles/142013/There-is-something-about-Paths-for-Asp-net-beginne
It all really depends on what you're trying to find, as far as finding a backway to locate a absolute path, is highly doubtful. If the owner of the site(most blogs) want you to have a perma link to a page, they use url-rewriting for putting things in the URI like title page and such. Alot of MVC sites do this now.
The '?id=16' you're seeing is just a query string, a holder for other logic they are doing.
I have a self-hosted wordpress blog, and as almost expected, I found there's another blog scraping my contents, posting a perfect copy of my own posts (texts, images not hotlinked but fetched and reupped to the clone's server, html layout within the posts) with a few hours of delay.
however I must confess I'm infuriated to see that when I search Google for keywords relevant to my posts, the scraping clone always comes first.
So, here I am, open for suggestions, would you know how to prevent my site from being successfully scraped ?
Technical precisions :
the clone blog appears to be self-hosted, and so am I, I'm on a debian+webmin+virtualmin dedi
my RSS feed is already cut with a "read more on" halfway. Hey, I just thought I should publish a post while assigning it a date like 2001-01-01, and see if it appears on the clone blog, that would allow to know if my RSS is still used as a signal for "hey, it's scraping time !"
my logs can't find the scraper among legit traffic, either it's non-identifiable or else it's lost among the flood of legit traffic
I already htaccess-banned and iptables-banned the .com domain of the clone, my contents are still cloned nonetheless
the clone website makes use of reverse proxies, so I can't trace where it is hosted and what actual IPs should be blocked (well, unless I iptables-ignore-ban half of Europe to ban the whole IP ranges of its data storage facility, but I'm slightly reluctant to that !)
I'm confident this isn't hand-made, the cloning has been running for two years now, every day without fail
only my new posts are cloned, not the rest of my website (not the sidebars, not the wordpress pages as opposed to wordpress posts, not the single pages), so setting up a jail.html to log who opens it page won't work, no honey-potting
when my posts contain internal links pointing to another page of my website, the posts on the clone won't be rewritten and will still point to my own website
I'd love help and suggestions with this issue. Not being cloned, but losing traffic to that bot while I'm the original publisher.
You can't really stop them in the end, but you might be able to find them and mess with them. Try hiding the request IP in an HTML comment, or white-on-white text, or just somewhere out of the way, then see what IPs show up on the copies. You can also try to obfuscate that text if you want by turning it into a hex string or something so it's less obvious to someone who doesn't know or make it look like an error code, just so they don't catch on to what you're doing.
In the end, though, I'm not sure how much it will buy you. If they're really inattentive, rather than shutting them down and calling attention to the fact that you're onto them, you can feed them gibberish or whatever whenever one of their IPs crops up. That might be fun and it's not too hard to make a gibberish generator by putting sample texts into a Markov chain.
EDIT: Oh, and if pages aren't rewritten too much, you might be able to add some inline JS to make them link to you, if they don't strip that. Say, a banner that only shows up if they're not at your site, giving the original link to your articles and suggesting that people read that.
Are you willing to shut down your RSS Feed? if so you could do something like
function fb_disable_feed() {
wp_die( __('No feed available,please visit our homepage!') );
}
add_action('do_feed', 'fb_disable_feed', 1);
add_action('do_feed_rdf', 'fb_disable_feed', 1);
add_action('do_feed_rss', 'fb_disable_feed', 1);
add_action('do_feed_rss2', 'fb_disable_feed', 1);
add_action('do_feed_atom', 'fb_disable_feed', 1);
it means if you go to a feed page, it just returns with the message in wp_die() on line two. We use it for 'free' versions of our WP Software with an if-statement so they can't hook into their RSS feeds to link to their main website, it's an upsell opportunity for us, it works well is my point, haha.
Even though this is a little old of a post, I thought it would still be helpful for me to weigh in in case other people see the post and have the same question. Since you've eliminated the RSS feed from the mix and youre pretty confident it isnt a manual effort, then what you need to is better stop the bots they are using.
First, I would recommend banning proxy servers in your IPTables. You can get a list of known proxy server addresses from Maxmind. This should limit their ability to anonymize themselves.
Second, it would be great to make it harder for them to scrape. You could accomplish this in one of a couple of ways. You could render part, or all of your site in javascript. If nothing else, you could at least just render the links in javascript. This will make it significantly harder for them to scrape you. Alternatively, you can put your content within an iframe inside the pages. This will also make it somewhat harder to crawl and scrape.
All this said, if they really want your content they will pretty easily get by these traps. Honestly, fighting off webscrapers is an arms race. You cannot put any static trap in place to stop them, instead you have to continuously evolve your tactics.
For full disclosure, I am a co-founder of Distil Networks, and we offer an anti-scraping solution as a service.
I'm using wordpress. I have a post that has a pretty awkward url, like:
http://example.com/dogfood/why-do-dogs-like-bacon-i-will-investigate
I want to change the url to something more succinct, in hopes that it would improve my search results ranking:
http://example.com/dogfood/dogs-and-bacon
But this article is pretty popular now, and many third party sites have links pointing to the original url.
Is the best solution here to:
Duplicate the same post content at the new url.
Leave the old post as-is (the url will remain intact).
Set up a 301 redirect in my .htaccess file from the old url to the new url.
This way links to my old article will still work, but (hopefully) new searches for dog food bacon will start ranking my new url higher? Or can I simply delete the old post after I setup the redirect in my .htaccess file - no need to keep the old url around actually?
Thanks
Here are my thoughts:
Duplicate content is almost always a bad idea.
I think that leaving the old post URL intact is just fine. There's really no reason to change it. If what you want is to generate more traffic, there are other, more "kosher" ways of doing so: get other sites to link to your article or write an update to the article in a new post and add a link to the current article. Besides, I think people are more likely to search for "Why do dogs like bacon?" than "dogs and bacon".
Technically a 301 redirect in your .htaccess file would be the best solution if you insist on changing the URL (which I discourage), but it would be a b*** to maintain for every post you want to redirect. Consider using a plugin that will do this for you, like Redirection.
Currently reading Bloch's Effective Java (2nd Edition) and he makes a point to state, in bold, that overusing POSTs in web applications is inherently bad. Unfortunately, he doesn't specify why.
This startled me, because when I do any web development, all I ever use are POSTs! I have always steered clear of GETs for security reasons and because it felt more professional (long, unsightly URLs always bother me for some reason).
Are there performance differentials between GET and POST? Can anyone elaborate on why overusing POSTs is bad, and why? My understanding - and preliminary searches - seem to all indicate that these two are handles very similarly by the web server. Thanks in advance!
You should use HTTP as it's supposed to be used.
GET should be used for idempotent, read queries (i.e. view an item, search for a product, etc.).
POST should be used for create, delete or update requests (i.e. delete an item, update a profile, etc.)
GET allows refreshing the page, bookmark it, send the URL to someone. POST doesn't allow that. A useful pattern is post/redirect/get (AKA redirect after post).
Note that, except for long search forms, GET URLs should be short. They should usually look like http://www.foo.com/app/product/view?productId=1245, or even http://www.foo.com/app/product/view/1245
You should almost always use GET when requesting content. Only use POST when you are either:
Transmitting sensitive information which should not appear in the URL bar, or
Changing the state on the server (adding/changing/deleting stuff, altough recently some web applications use POST to change, PUT to add and DELETE to delete.)
Here's the difference: If you want to give the link to the page to a friend, or save it somewhere, or even only add it to your bookmarks, you need the full URL of the page. Just like your address bar should say http://stackoverflow.com/questions/7810876/abusing-http-post at the moment. You can Ctrl-C that. You can save that. Enter that link again, you're back at this page.
Now when you use any action other than GET, there is simply no URL to copy. It's like your browser would say you are at http://stackoverflow.com/question. You can't copy that. You can't bookmark that. Besides, if you would try to reload this page, your browser would ask you whether you want to send the data again, which is rather confusing for the non-tech-savy users of your page. And annoying for the entire rest.
However, you should use POST/PUT when transferring data. URL's can only be so long. You can't transmit an entire blog post in an URL. Also, if you reload such a page, You'll almost certainly double-post, because the above described message does not appear.
GET and POST are very different. Choose the right one for the job.
If you are using POST for security reasons, I might drop a mention of other security factors here. You need to ensure that you send the data from a form submit in encrypted form even if you are using POST.
As for the difference between GET and POST, it is as simple as GET is used to send a GET request. So, you would want to get data from a page and act upon it and that is the end of everything.
POST on the other hand, is used to POST data to the application. I am talking about transactions here (complete create, update or delete operations).
If you have a sensitive application that takes, say and ID to delete a user. You would not want to use GET for it because in that case, a witty user may raise mayhem simply changing the ID at the end of the URL and deleting all random uses.
POST allows more data and can be hacked to send streams of files as well. GET has a limited size though.
There is hardly any tradeoff in using GET or POST.
I have been searching around for a way to simply request webpages with HTML5. Specifically, I want to do an HTTP(S) request to a different site. It sounds like this is not possible at first due to obvious security reasons, but I have been told this is possible (maybe with WebSockets?).
I don't want to use iframes, unless there is a way to make it so the external site being requested does not know that it is being requested through an iframe.
I've been having a really difficult time finding this information. All I want to do is load a webpage, and display it.
Does anyone have some code they could share, or some suggestions?
Thanks.
Sounds like you are trying to to circumvent Same Origin Policy. Don't do that :)
If it is just the request you want (and nothing else), there are a number of ways to do it, namely with new Image in JavaScript or iframe. The site will know it is in an iframe by checking top.location.href != window.location.href.
funneling all your requests through a proxy might be a solution - the client addresses only the proxy and the actual url to retrieve would be a query string parameter.
best regards, carsten