I'm currently using a service that provides a simple to use API to set up web scrapers for data extraction. The extraction is rather simple: grab the title (both text and hyperlink url) and two other text attributes from each item in a list of items that varies in length from page to page, with a max length of 30 items.
The service performs this function well, however, the speed is somewhat slow at about 300 pages per hour. I'm currently scraping up to 150,000 pages of time sensitive data (I must use the data within a few days or it becomes "stale"), and I predict that number to grow several fold. My workaround is to clone these scrapers dozens of times and run them simultaneously on small sets of URLs, but this makes the process much more complicated.
My question is whether writing my own scraper using Scrapy (or some other solution) and running it from my own computer would achieve a performance greater than this, or is this magnitude simply not within the scope of solutions like Scrapy, Selenium, etc. on a single, well-specced home computer (attached to an 80mbit down, 8mbit up connection).
Thanks!
You didn't provide the site you are trying to scrape, so I can only answer according to my general knowledge.
I agree Scrapy should be able to go faster than that.
With Bulk Extract import.io is definitely faster, I have extracted 300 URLs in a minute, you may want to give it a try.
You do need to respect the website ToUs.
Related
I have implemented a few web scraping projects - ranging from small to mid size (around 100.000 scraped pages) - in my past. Usually my starting point is an index page that links to several pages with the details I want to scrape. In the end most of the time my projects worked. But I always feel like I could improve the workflow (especially regarding the challenge of reducing the traffic I cause to the scraped web sites [and connected to that topic: the risk of being banned :D]).
That's why I was wondering about your (best practice) approaches of web scraper designs (for small and mid size projects).
Usually I build my web scraping projects like that:
I identify a starting point, which contains the urls I want scrape data from. The starting point has quite a predictable structre which makes it easy to scrape
I take a glimpse at the endpoints I want to scrape and figure out some functions to scrape and process data
I collect all the urls (endpoints) I want to scrape from my starting point and store them in a list (sometimes the starting point are several pages ... for example if search results are displayed and one page only shows 20 results ... but the structure of these pages is almost identical)
I start crawling the url_list and scrape the data I am interested in.
To scrape the data, I run some functions to structure and store the data in the format I need
Once I have sucessfully scraped the data, I mark the url as "scraped" (if I run into errors, timeouts or something similar, I don't have to start from the beginning, but can continue from where the process stopped)
I combine all the data I need and finish the project
Now I am wondering if it could be a good idea to modify this workflow and stop extracting/processing data while crawling. Instead I would collect the raw data/the website, mark the url as crawled and continue crawling. When all websites are downloaded (or - if it is a bigger project - between bigger tasks) I would run functions to process and store the raw data.
Benefits of this approach would be:
if I run into errors based on unexpected structure I would not have to re-scrape all the pages before. I would only have to change my code and run it on the stored raw data (which would minimize the traffic I cause)
as websites keep changing I would have a pool of reproducable data
Cons would be:
especially if projects grow in size this approach could require too much space
Without knowing your goal, it's hard to say, but I think it's a good idea as far as debugging goes.
For example, if the sole purpose of your scraper is to record some product's price, but your scraper suddenly fails to obtain that data, then yes- it would make sense to kill the scraper.
But let's say the goal isn't just the price, but various attributes on a page, and the scraper is just failing to pick up on one attribute due to something like a website change. If that were the case, and there is still value in scraping the other data attributes, then I would continue scraping, but log the error. Another consideration would be the failure rate. Web scraping is very finicky- sometimes web pages load differently or incompletely, and sometimes websites change. Is the scraper failing 100%? Or perhaps it is just failing 5% of the time?
Having the html dump saved on error certainly would help debug issues like xpath failing and such. You could minimize the amount of space consumed by more careful error handling. For example, save a file containing an html dump if one doesn't already exist for this specific error of, for example, an xpath failing to return a value, or a type mismatch, etc.
Re: getting banned. I would recommend using a scraping framework. For example, in python there is Scrapy which handles the flow of requests. Also, proxy services exist to avoid getting banned. In the US at least, web scraping has been explicitly deemed legal. All companies account for web scraping traffic. You aren't going to break a service with 100k scrapes. Think about the millions of scrapes a day Walmart does on Amazon, and vice versa.
I am trying to use the set api to set an object in firebase. The object is fairly large, the serialized json is 2.6 mb in size. The root node has around 90 chidren, and in all there are around 10000 nodes in the json tree.
The set api seems to hang and does not call the callback.
It also seems to cause problems with the firebase instance.
Any ideas on how to work around this?
Since this is a commonly requested feature, I'll go ahead and merge Robert and Puf's comments into an answer for others.
There are some tools available to help with big data imports, like firebase-streaming-import. What they do internally can also be engineered fairly easily for the do-it-yourselfer:
1) Get a list of keys without downloading all the data, using a GET request and shallow=true. Possibly do this recursively depending on the data structure and dynamics of the app.
2) In some sort of throttled fashion, upload the "chunks" to Firebase using PUT requests or the API's set() method.
The critical components to keep in mind here is that the number of bytes in a request and the frequency of requests will have an impact on performance for others viewing the application, and also count against your bandwidth.
A good rule of thumb is that you don't want to do more than ~100 writes per second during your import, preferably lower than 20 to maximize your realtime speeds for other users, and that you should keep the data chunks in low MBs--certainly not GBs per chunk. Keep in mind that all of this has to go over the internets.
I know there are lots of similar questions out there like this, but all of the solutions are eithers ones I cannot use or do not work. The basics of the issue is that I have to make a web service call that returns a typed dataset. This dataset can have 30,000 rows or more in some cases. So my issue is how do I get the page to be more responsive and perhaps load everything while the web service is still downloading the dataset?
Please note that normally I would never return this amount of data and would instead do paging on the server side, but the requirements for this really lock down what I can do. I can make the web service return JSon if need be, but my problem at that point is how to get the JSON data back into a format that the gridview could use to bind the data. I know there is an external library out there, but that is out as well.
Sad to say that the restrictions I have here are pretty obscene, but they are what they are and I cannot really change them.
TIA
-Stanley
A common approach to this kind of scenario is to page (in chunks) your data as it comes back. Do this asynchronously (separate thread). You might even be able to do this in only two chunks: first 1000 rows, then the rest. It will seem very responsive to your users. If there is any way to require the users to filter the result-set, to reduce the result-set, that would be ideal.
#Lostdreamer is right. Use JQuery to do two AJAX calls. The first call gets the first 1000 rows then kicks off the second call (etc). Honestly, this is simply simulating what HTTP typically does (limiting packet sizes and loading multiple chunks).
I have a flex webapp that retrieves some names & addresses from a database. Project works fine but I'd like to make it faster. Instead of making a call to the database for each name request, I could pre-load all names into an array & filter the array when the user makes a request. Before I go down this route though I wanted to check if it is even feasible to have an application w/ 50,000 or 1 million elements in an array? What is the limit b/f it slows down the app? (I anticipate that it will have a lot to do w/ what else is going on in my app but for this sake lets just assume the app ONLY consists of this huge array).
Searching through a large array can be slower than necessary, particularly if you're talking about 1 million records.
Can you split it into a few still-large-but-smaller arrays? If you're always searching by account number, then divide them up based on the first digit or two digits.
To directly answer your question though, pure AS3 processing of a 50,000 element array should be fine. Once you get over 250,000 I'd think you need to break it up.
Displaying that many UI elements is different though. If you try to bind a chart to a dataProvider with 10,000 elements, it's too much. Same for a list or datagrid.
But for pure model data, not ui bound, I'd recommend up to 250,000 in my experience.
If your loading large amounts of data (not sure if your using Lists though), you could check out James Wards post about using AsyncListView with paging to grab the data in chuncks as its needed. Gonna try and implement something like this soon. His runnable example uses 100,000 rows with paging of 100 (works with HttpService/AMF type calls):
http://www.jamesward.com/2010/10/11/data-paging-in-flex-4/
Yes, you could probably stuff a few million items in an array if you wanted to, and the Flash player wouldn't yell at you. But do you really want to?
Is the application going to take longer to start if it has to download the entire database locally before being able to work? If the additional time needed to download that much data isn't significant, are a few database lookups really worth optimizing?
If you have a good use case to do this, you're going to have to pay attention to the way you use those data structures. Looping over the array to find an item is going to be a bit slow, so you'll want to create indexes locally, most likely by using a few hash structures. The more flexible you allow the search queries to be, the more interesting the indexing issues will be.
I am building UI for a large product catalog (millions of products).
I am using Sql Server, FreeText search and ASP.NET MVC.
Tables are normalized and indexed. Most queries take less then a second to return.
The issue is this. Let's say user does the search by keyword. On search results page I need to display/query for:
Display 20 matching products on first page(paged, sorted)
Total count of matching products for paging
List of stores only of all matching products
List of brands only of all matching products
List of colors only of all matching products
Each query takes about .5 to 1 seconds. Altogether it is like 5 seconds.
I would like to get the whole page to load under 1 second.
There are several approaches:
Optimize queries even more. I already spent a lot of time on this one, so not sure it can be pushed further.
Load products first, then load the rest of the information using AJAX. More like a workaround. Will need to revise UI.
Re-organize data to be more Report friendly. Already aggregated a lot of fields.
I checked out several similar sites. For ex. zappos.com. Not only they display the same information as I would like in under 1 second, but they also include statistics (number of results in each category).
The following is the search for keyword "white"
http://www.zappos.com/white
How do sites like zappos, amazon make their results, filters and stats appear almost instantly?
So you asked specifically "how does Zappos.com do this". Here is the answer from our Search team.
An alternative idea for your issue would be using a search index such as solr. Basically, the way these work is you load your data set into the system and it does a huge amount of indexing. My projects include product catalogs with 200+ data points for each of the 140k products. The average return time is less than 20ms.
The search indexing system I would recommend is Solr which is based on lucene. Both of these projects are open source and free to use.
Solr fits perfectly for your described use case in that it can actually do all of those things all in one query. You can use facets (essentially group by in sql) to return the list of different data values for all applicable results. In the case of keywords it also would allow you to search across multiple fields in one query without performance degradation.
you could try replacing you aggergate queries with materialized indexed views of those aggregates. this will pre-compute all the aggregates and will be as fast as selecting any regular row data.
.5 sec is too long for an appropriate hardware. I agree with Aaronaught and first thing to do is to convert it in single SQL or possibly Stored Procedure to ensure it's compiled only once.
Analyze your queries to see if you can create even better indexes (consider covering indexes), fine tune existing indexes, employ partitioning.
Make sure you have appropriate hardware config - data, log, temp and even index files should be located on independent spindles. make sure you have enough RAM and CPU's. I hope you are running 64-bit platform.
After all this, if you still need more - analyze most used keywords and create aggregate result tables for top 10 keywords.
Amount Amazon - they most likely use superior hardware and also take advantage of CDN's. Also, they have thousands of servers surviving up the content and there is no performance bottlenecks - data is duplicated multiple times across several data centers.
As completely separate approach - you may want to look into "in-memory" databases such as CACHE - this is the fastest you can get on DB side.