Scraping a Site only works once - wordpress

I tried to scrape a specific field on Toys R Us's page...
http://www.toysrus.com/product/index.jsp?productId=13157031
with the selector "price".
It all worked for one time through the page load, then it never worked again. Do some sites have ways of preventing people from scraping their content? I'm kinda new to this, please be gentle. I was using Wordpress, WP-Web-Scraper, and the following as the code in the page:
Price:
[wpws url="http://www.toysrus.com/product/index.jsp?productId=13157031"
selector="price" on_error="error_show" user-agent="diaperbot"]

markratledge is right. The immediate thing to consider is changing your user agent so you aren't identifying yourself. Here's a helpful link to a list of common/most popular agents: http://techblog.willshouse.com/2012/01/03/most-common-user-agents/. Also, your IP is another big thing. If you are scraping with only 1 IP, depending on your volume, you could get blocked quickly. You'll likely need to use a proxy. There are many our there ranging from free to those that charge. I've found Ntrepid's tools to be useful (https://ion.ntrepidcorp.com/).

Do some sites have ways of preventing people from scraping their content?
Yes, they do. They might have detected the user-agent in your query and blocked your IP.
Why? Read the TOS about downloading their content: http://www.toysrus.com/helpdesk/index.jsp?display=safety&subdisplay=terms
That WP plugin is fairly primitive. If you want to scrape sites more efficiently and with better results, use python, a language specifically designed for scraping. Check http://www.google.com/search?q=python+scraper+tutorial

webscraper has some issues with cache, set cacetime to 0 !

Related

how to write crawler to crawl data from instagram ?

is there any ways crawling / monitoring instagram data for research purpose ?
I tried the official API but it only worked in sandbox which is impossible for crawling the real info like followers. I need to monitor certain accounts and also extend the range by the followers / followings and track their behaviors (e.g. the progress of number of likes)
So anybody can have some suggestions ? or could give me some references about related crawling task ?
Maybe you can try using beautiful soup library and read this book by Ryan Mitchell: Web Scraping with Python. Basically you should understand the DOM, REGEXs and how to algorithmically jump page to page so that you can get started.
Also do check a site's ToS before you start and know that they may have some guidelines/rules against scraping as generally all sites have robot.txt files these days that specify how/what you're allowed to scrape

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

How Does an RSS Feed Work?

How's it going?
I've found a lot of more detailed answers relating to specific problems relating to RSS feeds, but I can't really figure out how you USE one, basically.
Could someone explain?
I see the RSS feed icon at the top of a lot of Wordpress sites, including my own, but when I click it, it just seems to be a long XML file. I don't know what to do with it, or even why it would be there.
How do you use this? Are you meant to hit it with an API request, or is there a particular kind of software that you use?
Cheers
Before telling you what RSS, let me describe you a common problem that many people have.
Say there is a bunch of sites that you really like and it's sort of a
daily routine for you to go thru them. They may be a news site, your
friend's blog, but also craigslist bcause you're currently looking for
a new house and maybe a weather site to know how late you should stay
at work :)
The first thing you do when you get to work, is open your web browser
and these sites in new tabs. It's not particularly cumbersome because
there are just 4 sites. But think about it: maybe there is a new blog
that you start to like and ho, these cartoons are really funny. Maybe
there is also a bit of financial info that you're interested in and
the pictures that your brother is posting to Flickr every couple day:
they just had a new baby! Also, as you're trying to buy a house, you'd
love a little raise and you've figured that your boss really likes it
when you tell her that you've read about your company in the news or
when you tell her about a new competing product... There is also
StackOverflow. You're desperately trying to get this "expert" badge
and boost up your reputation: this may help with your boss too or even
when you're looking for a new job.
Opening all these tabs is starting to take a toll and you keep
forgetting an important one. You're also slowly getting tired of the
different reading experience that all these sites have: small fonts,
large fonts, ads all over...etc. Now you have a problem.
Imagine there is a tool that does the following: you can tell it what sites you care about, and then, this tool will look up the new stuff for you. It will show everything in a nice looking format. It should also help you identify what's really worth seeing ASAP or maybe have some kind of "serendipity" mode that you can go into and find interesting stuff that you would have missed otherwise. The tool will obviously send you to the original sites should you need more info about any particular story or classified...
This tool exists. It's usually called a Reader, mostly because it lets your read more things online. Often times you'll see them called "RSS reader", because RSS is what they use to get the information from all these sites. RSS is the pipe. You as a user should probably not know about it, but that's what the readers depend on. In an ideal world, when you're on site you like, you should just hit "follow" on a button like this one and then you'd be redirected to your reader of choice. Later when new content is added, you'll get it straight in your reader.
To get a bit into more technical details, RSS (like Atom) is an XML flavor. It's a collection (mostly reverse chronological) of entries. Entries have at least a title and a link to the actual story. They should also include a unique identifier and could have other elements like a description, an image, tags, author information... etc.
RSS is great because it's content agnostic. It can be used to represent a lot of different things (as described in the little story) and decouples the publishing platform from the subscribing platform: they don't even know the other one exists. RSS is their lingua-franca.
I wrote a blog post about this very question not long ago. Here's the link if you're interested in reading my personal interpretation. https://www.rss.com/whatisrss
An XML file is all the content of a page, with no markup. The XML represents the data in its rawest, most descriptive form. Many readers can interpret XML sources from a variety of places, and format all of the data in its own unique way.

Hukkster technology stack

I started using Hukkster.com a few days ago. It is really fast and accurate.
The bookmarklet of hukkster always fetches correct price from the product page.
This happens for all the featured merchants it supports.
I was really curious to know what technology stack they might me using for such a fast and accurate response ?
I have tried to search everything I could on google. I found nothing other than Hukkster success story, Hukkster in NEWS etc.
There was nothing related to technology stuff used by Hukkster.
It is Mozenda .
Found it. Here it is:
http://blogs.wsj.com/venturecapital/2012/08/29/the-founders-creators-of-new-shopping-app-hukkster-definitely-not-brogrammers/
The co-founders believed in their idea. There was just one problem–neither one knew how to code. They didn’t let that stop them. They developed a “paper prototype” that they could run without coding. They built a crawler using a data extraction service called Mozenda, and did the rest of Hukkster’s legwork with spreadsheets, emails and phones.
http://www.mozenda.com/

Crawling wikipedia

I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?
Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.
Give a try to the Wikipedia API and your programming skills.
There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.
It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php

Resources