web scraping design - best practice - web-scraping

I have implemented a few web scraping projects - ranging from small to mid size (around 100.000 scraped pages) - in my past. Usually my starting point is an index page that links to several pages with the details I want to scrape. In the end most of the time my projects worked. But I always feel like I could improve the workflow (especially regarding the challenge of reducing the traffic I cause to the scraped web sites [and connected to that topic: the risk of being banned :D]).
That's why I was wondering about your (best practice) approaches of web scraper designs (for small and mid size projects).
Usually I build my web scraping projects like that:
I identify a starting point, which contains the urls I want scrape data from. The starting point has quite a predictable structre which makes it easy to scrape
I take a glimpse at the endpoints I want to scrape and figure out some functions to scrape and process data
I collect all the urls (endpoints) I want to scrape from my starting point and store them in a list (sometimes the starting point are several pages ... for example if search results are displayed and one page only shows 20 results ... but the structure of these pages is almost identical)
I start crawling the url_list and scrape the data I am interested in.
To scrape the data, I run some functions to structure and store the data in the format I need
Once I have sucessfully scraped the data, I mark the url as "scraped" (if I run into errors, timeouts or something similar, I don't have to start from the beginning, but can continue from where the process stopped)
I combine all the data I need and finish the project
Now I am wondering if it could be a good idea to modify this workflow and stop extracting/processing data while crawling. Instead I would collect the raw data/the website, mark the url as crawled and continue crawling. When all websites are downloaded (or - if it is a bigger project - between bigger tasks) I would run functions to process and store the raw data.
Benefits of this approach would be:
if I run into errors based on unexpected structure I would not have to re-scrape all the pages before. I would only have to change my code and run it on the stored raw data (which would minimize the traffic I cause)
as websites keep changing I would have a pool of reproducable data
Cons would be:
especially if projects grow in size this approach could require too much space

Without knowing your goal, it's hard to say, but I think it's a good idea as far as debugging goes.
For example, if the sole purpose of your scraper is to record some product's price, but your scraper suddenly fails to obtain that data, then yes- it would make sense to kill the scraper.
But let's say the goal isn't just the price, but various attributes on a page, and the scraper is just failing to pick up on one attribute due to something like a website change. If that were the case, and there is still value in scraping the other data attributes, then I would continue scraping, but log the error. Another consideration would be the failure rate. Web scraping is very finicky- sometimes web pages load differently or incompletely, and sometimes websites change. Is the scraper failing 100%? Or perhaps it is just failing 5% of the time?
Having the html dump saved on error certainly would help debug issues like xpath failing and such. You could minimize the amount of space consumed by more careful error handling. For example, save a file containing an html dump if one doesn't already exist for this specific error of, for example, an xpath failing to return a value, or a type mismatch, etc.
Re: getting banned. I would recommend using a scraping framework. For example, in python there is Scrapy which handles the flow of requests. Also, proxy services exist to avoid getting banned. In the US at least, web scraping has been explicitly deemed legal. All companies account for web scraping traffic. You aren't going to break a service with 100k scrapes. Think about the millions of scrapes a day Walmart does on Amazon, and vice versa.

Related

SSRS dynamic report generation, pdf and subscriptions?

If this question is deemed inappropriate because it does not have a specific code question and is more "am I barking up the right tree," please advise me on a better venue.
If not, I'm a full stack .NET Web developer with no SSRS experience and my only knowledge comes from the last 3 sleepless nights. The app my team is working on requires end users to be able to create as many custom dashboards as they would like by creating instances of a dozen or so predefined widget types. Some widgets are as simple as a chart or table, and the user configures the widget to display a subset of possible fields selected from a larger set. We have a few widgets that are composites. The Web client is all angular and consumes a restful Web api.
There are two more requirements, that a reasonable facsimile of each widget can be downloaded as a PDF report upon request or at scheduled times. There are several solutions to this requirement, so I am not looking for alternate solutions. If SSRS would work, it would save us from having to build a scheduler and either find a way to leverage the existing angular templates or to create views based off of them, populate them and convert that to a pdf. What I am looking for is he'll in understanding how report generation best practices and how they interact witg .NET assemblies.
My specfic task is to investige if SSRS can create a report based on a composite widget and either download it as a PDF or schedule it as one, and if so create a POC based on a composite widget that contains 2 line graphs and a table. The PDF versions do not need to be displayed the same way as the UI where the graphs are on the same row and the table is below. I can show each graph on its' own as long as the display order is in reading order. ( left to right, then down to the next line)
An example case could be that the first graph shows the sales of x-boxes over the course of last year. The line graph next to it shows the number of new releases for the X-Box over the course of last year. The report in the table below shows the number of X-box accessories sold last year grouped by accessory type (controller, headset, etc,) and by month, ordered by the total sales amount per month.
The example above would take 3 queries. The queries are unique to that users specific instance of that widget on that specific dashboard. The user can group, choose sort columns and anything else that is applicable.
How these queries are created is not my task (at least not yet.) So there is an assumption that a magic query engine creates and stores these sql queries correctly in the database.
My target database is sql 2012 and its' reporting service. I'm disappointed it only supports the 2.0 clr.
OI have the rough outline of a plan, but given my lack of experience any help with this would be appreciated.
It appears I can use the Soap service for scheduling and management. That's straight forward.
The rest of my plan sounds pretty crazy. Any corrections, guidance and better suggestions would be welcome. Or maybe a different methodology. The report server is a big security hole, and if I can accomplish the requirements by only referencing the reporting names paces please point me in the right direction. If not, this is the process I have cobbled together after 3 days of research and a few msdn simple tutorials. Here goes:
To successfully create the report definition, I will need to reference every possible field in the entire superset available. It isn't clear yet if the superset for a table is the same as the superset for a graph , but for this POC I will assume they are. This way, I will only need a single stored procedure with an input parameter that identifies the correct query, which I will select and execute. The result set will be a small subset of the possible fields, but the stored procedure will return every field, with nulls for each row of the omitted fields so that the report knows about every field. Terrible. I will probably be returning 5 columns with data and 500 full of nulls. There has to be a better way. Thinking about the performance hit is making me queasy, but that was pretty easy. Now I have a deployable report. I have no idea how I would handle summaries. Would they be additional queries that I would just append to the result set? Maybe the magic query engine knows.
Now for some additional ugliness. I have to request the report url with a query string that identifies the correct query. I am guessing I can also set the scheduler up with the correct parameter. But man do I have issues. I could call the url using httpWebRequest for my download, but how exactly does the scheduler work? I would imagine it would create the report in a similar fashion, and I should be able to tell it in what format to render. But for the download I would be streaming html. How would I tell the report server to convert it to a pdf and then stream it as such? Can that be set in the reports definition before deploying it? It has no problem with the conversion when I play around on the report server. But at least I've found a way to secure the report server by accessing it through the Web api.
Then there is the issue of cleaning up the null columns. There are extension points, such as data processing extensions. I think these are almost analogous to a step in the Web page life cycle but not sure exactly or else they would be called events. I would need to find the right one so that I can remove the null data column or labels on a pie chart at null percent, if that doesn't break the report. And I need to do it while it is still rdl. And just maybe if I still haven't found a way, transform the rdl to a pdf and change the content type. It appears I can add .net assemblies at the extension points. But is any of this correct? I am thinking like a developer, not like a seasoned SSRS pro. I'm trying, but any help pushing me in the right direction would be greatly appreciated.
I had tried revising that question a dozen times before asking, and it still seems unintelligible. Maybe my own answer will make my own question clear, and hopefully save someone else having to go through what I did, or at least be a quick dive into SSRS from a developer standpoint.
Creating a typical SSRS report involves (quick 40,000 foot overview)
1. Creating your data connection
2. Creating a SQL query or Queries which can be parameterized.
3. Datasets that the query result will fill
4. Mapping Dataset columns to Report Items; charts, tables, etc.
Then you build the report and deploy it to your report server, where the report can be requested by url with any SQL parameters Values added as a querystring:
http://reportserver/reportfolder/myreport?param1=data
How this works is that an RDL file (Report Definition Language) which is just an XML document with a specific schema is generated. The RDL has two elements that were relevant to me, and . As the names infer, the first contains the queries and the latter contains the graphs, charts, tables, etc. in the report and the mappings to the columns in the dataset.
When the report is requested, it goes through a processing pipeline on the report server. By implementing Interfaces in the reporting services namespace, one could create .NET assemblies that could transform the RDL at various stages in the pipeline.
Reporting Services also has two reporting API's. One for managing reports, and another for rendering. There is also the reportserver control which is a .NET Webforms control which is pretty rich in functionality and could be used to create and render reports without even needing a report server instance. The report files the control could generate were RDLC files, with the C standing for client.
Armed with all of this knowledge, I found several solution paths, but all of them were not optimal for my purposes and I have moved on to a solution that did not involve reporting services or RDL at all. But these may be of use to someone else.
I could transform the RDL file as it went through the pipeline. Not very performant, as this involved writing to the actual physical file, and then removing the modifications after rendering. I was also using SQL Server 2012, which only supported the 2.0/3.5 framework.
Then there were the services. Using either service, I could retrieve an RDL template as a byte array from my application. I wasn't limited by the CLR version here. With the management server, I could modify the RDL and deploy that to the Report Server. I would only need to modify the RDL once, but given the number of files I would need and having to manage them on the remote server, creating file structures by client/user/Dashboard/ReportWidget looked pretty ugly.
Alternatively, I instead of deploying the RDL templates, why not just store them in the database in byte array format. When I needed a specific instance, I could fetch the RDL template, add my queries and mappings to the template and then pass them to the execution service which would then render them. I could then save the resulting RDL in the database. It would be much easier for me to manage there. But now the report server would be useless, I would need my own services for management and to create subscriptions and to mail them I would need a queue service and an SMTP mailer, removing all the extras I would get from the report server, need to write a ton of custom code, and still be bound by RDL. So I would be creating RDLM, RDL mess.
It was the wrong tool for the job, but it was an interesting exercise, I learned more about Reporting Services from every angle, and was paid for most of that time. Maybe a blog post would be a better venue, but then I would need to go into much greater detail.

Performance limitations of Scrapy (and other non-service scraping/extraction solutions)

I'm currently using a service that provides a simple to use API to set up web scrapers for data extraction. The extraction is rather simple: grab the title (both text and hyperlink url) and two other text attributes from each item in a list of items that varies in length from page to page, with a max length of 30 items.
The service performs this function well, however, the speed is somewhat slow at about 300 pages per hour. I'm currently scraping up to 150,000 pages of time sensitive data (I must use the data within a few days or it becomes "stale"), and I predict that number to grow several fold. My workaround is to clone these scrapers dozens of times and run them simultaneously on small sets of URLs, but this makes the process much more complicated.
My question is whether writing my own scraper using Scrapy (or some other solution) and running it from my own computer would achieve a performance greater than this, or is this magnitude simply not within the scope of solutions like Scrapy, Selenium, etc. on a single, well-specced home computer (attached to an 80mbit down, 8mbit up connection).
Thanks!
You didn't provide the site you are trying to scrape, so I can only answer according to my general knowledge.
I agree Scrapy should be able to go faster than that.
With Bulk Extract import.io is definitely faster, I have extracted 300 URLs in a minute, you may want to give it a try.
You do need to respect the website ToUs.

Identify web-objects from redundant URIs using HTTP requests

I am struggling with an ill-constructed web-server log file, which I want to summarize to analyse attendance of the hosted site. Unfortunately for me, the architecture of the site is messy, so that there are no indexes of the hosted objects (html pages, jpg images, pdf document, etc.) while several URIs can refer to the same page. For example :
http://www.site.fr/main.asp?page=foo.htm
http://www.site.fr/storage-tree/foo.htm
http://www.site.fr/specific.asp?id=200
http://www.site.fr/specific.asp?path=/storage-tree/foo.htm
etc. without any obvious regularities between the duplicate URIs.
How, conceptually and pratically, can I efficiently identify the pages? As I see the problem, the idea is to construct an index linking log's URIs with a unique-object identifier constructed from http requests. There are three loose constraints :
I use R for the statistical part, and would therefore prefer to use it for http processing too
logs consist in hundreds of thousands of different URIs (among which forms, search and database queries) so that rapidity might be a matter
If I want to be able to tell, even in three days or a month, that this new URI is a known previously identified page, I have store the features I use to assess that two URIs refer to the same page. Then, storage space is a matter.
This is pretty easy with httr:
library(httr)
HEAD("http://gmail.com")$url
You will probably also want to check the status_code returned by HEAD, as failures often won't be redirected.
(One advantage of using httr over RCurl here is that it automatically preserves the connection across multiple http calls to the same site, which makes things quite a bit faster)

Tips for Integrating R code into a Mechanical Turk (e.g.) task?

I would like to randomize survey respondents on Mechanical Turk (or Survey Monkey, or a comparable web-based instrument) to particular conditions using my own R code. For example, the respondent might answer five background questions, then be exposed to a random question. I want to use the background data, run my R code on it, and return the question to the respondent immediately. (To be clear, I have a particular way I want to do the randomization in R that differs from complete randomization or random allocation of, e.g., 60% to one condition, 40% to the other.)
Any suggestions for how to go about integrating R code into a web-based survey like this?
Have you considered having MTurk query a web server which you control running R on it to get its randomization? You could then just feed MTurk a spreadsheet with ID codes, put those ID codes in the URL to the web server, and the web server could keep track of which IDs it randomized to what.
A demonstration of how simple this might be is in Section 3 here:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RApacheProject/paper.pdf
Another more end user-oriented walkthrough:
http://www.jstatsoft.org/v08/i10/paper
Could also look to Rweb, but that would be less secure. Many other options exist.
Basically you want Mechanical Turk to load a frame with your webpage in it. The webpage it requests would have a CGI submit embedded in it ( e.g. MT loads a frame with the contents of the URL http://www.myserver.com/myproject.html?MTid=10473 ). Then your R script on the web server does the randomization, returns a webpage containing only the random number, and records on the web server which MTid was in the URL and which random number was generated. At the end just merge the web server's data with the MT data by the MTid.

How to build large/busy RSS feed

I've been playing with RSS feeds this week, and for my next trick I want to build one for our internal application log. We have a centralized database table that our myriad batch and intranet apps use for posting log messages. I want to create an RSS feed off of this table, but I'm not sure how to handle the volume- there could be hundreds of entries per day even on a normal day. An exceptional make-you-want-to-quit kind of day might see a few thousand. Any thoughts?
I would make the feed a static file (you can easily serve thousands of these), regenerated periodically. Then you have a much broader choice, because it doesn't have to run below second, it can run even minutes. And users still get perfect download speed and reasonable update speed.
If you are building a system with notifications that must not be missed, then a pub-sub mechanism (using XMPP, one of the other protocols supported by ApacheMQ, or something similar) will be more suitable that a syndication mechanism. You need some measure of coupling between the system that is generating the notifications and ones that are consuming them, to ensure that consumers don't miss notifications.
(You can do this using RSS or Atom as a transport format, but it's probably not a common use case; you'd need to vary the notifications shown based on the consumer and which notifications it has previously seen.)
I'd split up the feeds as much as possible and let users recombine them as desired. If I were doing it I'd probably think about using Django and the syndication framework.
Django's models could probably handle representing the data structure of the tables you care about.
You could have a URL that catches everything, like: r'/rss/(?(\w*?)/)+' (I think that might work, but I can't test it now so it might not be perfect).
That way you could use URLs like (edited to cancel the auto-linking of example URLs):
http:// feedserver/rss/batch-file-output/
http:// feedserver/rss/support-tickets/
http:// feedserver/rss/batch-file-output/support-tickets/ (both of the first two combined into one)
Then in the view:
def get_batch_file_messages():
# Grab all the recent batch files messages here.
# Maybe cache the result and only regenerate every so often.
# Other feed functions here.
feed_mapping = { 'batch-file-output': get_batch_file_messages, }
def rss(request, *args):
items_to_display = []
for feed in args:
items_to_display += feed_mapping[feed]()
# Processing/returning the feed.
Having individual, chainable feeds means that users can subscribe to one feed at a time, or merge the ones they care about into one larger feed. Whatever's easier for them to read, they can do.
Without knowing your application, I can't offer specific advice.
That said, it's common in these sorts of systems to have a level of severity. You could have a query string parameter that you tack on to the end of the URL that specifies the severity. If set to "DEBUG" you would see every event, no matter how trivial. If you set it to "FATAL" you'd only see the events that that were "System Failure" in magnitude.
If there are still too many events, you may want to sub-divide your events in to some sort of category system. Again, I would have this as a query string parameter.
You can then have multiple RSS feeds for the various categories and severities. This should allow you to tune the level of alerts you get an acceptable level.
In this case, it's more of a manager's dashboard: how much work was put into support today, is there anything pressing in the log right now, and for when we first arrive in the morning as a measure of what went wrong with batch jobs overnight.
Okay, I decided how I'm gonna handle this. I'm using the timestamp field for each column and grouping by day. It takes a little bit of SQL-fu to make it happen since of course there's a full timestamp there and I need to be semi-intelligent about how I pick the log message to show from within the group, but it's not too bad. Further, I'm building it to let you select which application to monitor, and then showing every message (max 50) from a specific day.
That gets me down to something reasonable.
I'm still hoping for a good answer to the more generic question: "How do you syndicate many important messages, where missing a message could be a problem?"

Resources