Defensive web scraping techniques for scrapy spider - web-scraping

I have been web scraping for about 3 months now, and I have noticed that many of my spiders need to be constantly babysat, because of websites changing. I use scrapy, python, and crawlera to scrape my sites. For example, 2 weeks ago I created a spider and just had to rebuild it due to the website changing their metatags from singular to plural (so location became locations). Such a small change shouldn't be able to really mess with my spiders, so I would like to take a more defensive approach to my collections moving forward. Does anyone have any advice for web scraping to allow for less babysitting? thank you in advance!

Since you didn't post any code I can only give general advice.
Look if there's a hidden API that retrieves the data you're looking for.
Load the page in Chrome. Inspect with F12 and look under Network tab. Click CTRL + F and you can search for the text you see on screen which you want to collect. If you find any file under the Network tab that contains the data as json, that is more reliable since the backend of a webpage will change less frequent than the frontend.
Be less specific with selectors. Instead of doing body > .content > #datatable > .row::text you can change to #datatable > .row::text. Then your spider will be less likely to break on small changes.
Handle errors with try except so to stop the whole parse function from ending if you're expecting some data might be inconsistent.

Related

Scrape a page when URL does not change with page number - R

I want to scrape all the URLs from this page:
http://www.domainia.nl/QuarantaineList.aspx
I am able to scrape the first page, however, I can not change the page, because it is not in the URL. So how can I change the page with scraping? I've been looking into RSelenium, but could not get it working.
I'm running the next code to get at least the first page:
#Constructin the to scrape urls
baseURL <- "http://www.domainia.nl/quarantaine/"
date <- gsub("-", "/", Sys.Date())
URL <- paste0(baseURL, date)
#Scraping the page
page <- read_html(URL) %>% html_nodes("td") %>% html_text()
links <- str_subset(page, pattern = "^\r\n.*.nl$")
links <- gsub(pattern = "\r\n", "", links) %>% trimws
I've looked at the site; it's using a Javascript POST to refresh its contents.
Originally a HTTP-POST was meant to send information to a server, for example to send the contents of what somebody entered in a form. As such, it often includes information on the page you are coming from, which means you probably will need more information then just "page n".
If you want to get another page, like your browser would show you, you need to send a similar request. The httr package inlcudes a POST function, I think you should take a look at that.
For knowing what to post, I think it's most useful to capture what your browser does, and copy that. In Chrome, you can use inspect, tab Network to see what is sent and received, I bet other browsers have similar tools.
However, it looks like that website makes its money by showing that information, and if some other source would show the same things, they'd lose money. Therefore I doubt if it's that easy to emulate, I think some part of the request differs every time, yet needs to be exactly right. For example, they could build checks to see if the entire page was rendered, instead of discarded like you do. So I wouldn't be surprised if they intentionally make it very hard to do what you are trying to do.
Which brings me to an entirely different solution: ask them!
When I tried scraping a website with dynamically generated content for the first time, I was struggling as well. Until I explored the website some more, and saw that they had a link where you could download the entire thing, tidied up, in a nice csv-format.
And for a webserver, people trying to scrape their website is often inconvenient, it also demands resources from the server, a lot more than someone downloading a file.
It's quite possible they'll tell you "no", but if they really don't want you to get their data, I bet they've made it difficult to scrape. Maybe you'll just get banned if you make too many requests from the same IP, maybe some other method.
And it's also entirely possible that they don't want their data in the hands of a competitor, but that they'll give it to you if you only use it for a particular purpose.
(too big for a comment and it also has as salient image, but not an answer, per se)
Emil is spot on, except that this is a asp.net/sharepoint-esque site with binary "view states" and other really daft web practices that will make it nigh impossible to scrape with just httr:
When you do use the Network tab (again, as Emil astutely suggests) you can also use curlconverter to automatically build httr VERB functions out of requests "Copied as cURL".
For this site — assuming it's legal to scrape (it has no robots.txt and I am not fluent in Dutch and did not see an obvious "terms and conditions"-like link) — you can use something like splashr or Selenium to navigate, click and scrape since it acts like real browser.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

How Does an RSS Feed Work?

How's it going?
I've found a lot of more detailed answers relating to specific problems relating to RSS feeds, but I can't really figure out how you USE one, basically.
Could someone explain?
I see the RSS feed icon at the top of a lot of Wordpress sites, including my own, but when I click it, it just seems to be a long XML file. I don't know what to do with it, or even why it would be there.
How do you use this? Are you meant to hit it with an API request, or is there a particular kind of software that you use?
Cheers
Before telling you what RSS, let me describe you a common problem that many people have.
Say there is a bunch of sites that you really like and it's sort of a
daily routine for you to go thru them. They may be a news site, your
friend's blog, but also craigslist bcause you're currently looking for
a new house and maybe a weather site to know how late you should stay
at work :)
The first thing you do when you get to work, is open your web browser
and these sites in new tabs. It's not particularly cumbersome because
there are just 4 sites. But think about it: maybe there is a new blog
that you start to like and ho, these cartoons are really funny. Maybe
there is also a bit of financial info that you're interested in and
the pictures that your brother is posting to Flickr every couple day:
they just had a new baby! Also, as you're trying to buy a house, you'd
love a little raise and you've figured that your boss really likes it
when you tell her that you've read about your company in the news or
when you tell her about a new competing product... There is also
StackOverflow. You're desperately trying to get this "expert" badge
and boost up your reputation: this may help with your boss too or even
when you're looking for a new job.
Opening all these tabs is starting to take a toll and you keep
forgetting an important one. You're also slowly getting tired of the
different reading experience that all these sites have: small fonts,
large fonts, ads all over...etc. Now you have a problem.
Imagine there is a tool that does the following: you can tell it what sites you care about, and then, this tool will look up the new stuff for you. It will show everything in a nice looking format. It should also help you identify what's really worth seeing ASAP or maybe have some kind of "serendipity" mode that you can go into and find interesting stuff that you would have missed otherwise. The tool will obviously send you to the original sites should you need more info about any particular story or classified...
This tool exists. It's usually called a Reader, mostly because it lets your read more things online. Often times you'll see them called "RSS reader", because RSS is what they use to get the information from all these sites. RSS is the pipe. You as a user should probably not know about it, but that's what the readers depend on. In an ideal world, when you're on site you like, you should just hit "follow" on a button like this one and then you'd be redirected to your reader of choice. Later when new content is added, you'll get it straight in your reader.
To get a bit into more technical details, RSS (like Atom) is an XML flavor. It's a collection (mostly reverse chronological) of entries. Entries have at least a title and a link to the actual story. They should also include a unique identifier and could have other elements like a description, an image, tags, author information... etc.
RSS is great because it's content agnostic. It can be used to represent a lot of different things (as described in the little story) and decouples the publishing platform from the subscribing platform: they don't even know the other one exists. RSS is their lingua-franca.
I wrote a blog post about this very question not long ago. Here's the link if you're interested in reading my personal interpretation. https://www.rss.com/whatisrss
An XML file is all the content of a page, with no markup. The XML represents the data in its rawest, most descriptive form. Many readers can interpret XML sources from a variety of places, and format all of the data in its own unique way.

application to list the page elements of an url

I need to make an application which will access an URL(like http://google.com) and return the time spent to load all elements(images, css, js...) and compare this results with the previous results.
This application need to be a Desktop app, and I will save the informations in a text file ou xml, and use this file do compare with previous results.
I have searched for a similar application, but nothing...
There are some plugins for firefox that list these elements, like Yslow or Firebug, but not what I need.
So, i'm totally lost and I don't know how to start this work?
Exists the possibility of make this application? What language is better for this type of application?
Thks!
This is a very objective question, so without you elaborating more on your requirements, you may not get any useful answers.
Some things you would need to answer are: how many URLs you want to check, where are you wanting to store the results (database, files etc), does it need to run on the desktop or on a server etc.
Personally, I like the statistics that cURL gives you - DNS time, connect time, receive time etc - so you could write something in PHP, but as I stress that is personal preference and may not suit your situation.

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Resources