I'm running Rcrawler on a very large website, so it takes a very long time (3+ days with default page depth). Is there a way to not download all the HTMLs to make the process faster?
I only need the URLs that are stored in the INDEX.
Or can anyone recommend another way to make Rcrawler run faster?
I have tried running it with a smaller page depth (5), but it is still taking forever.
I am dealing with the same issue. Depending on the source, in some cases I am even running at depth 1.
Best,
Janusz
Related
I have implemented a few web scraping projects - ranging from small to mid size (around 100.000 scraped pages) - in my past. Usually my starting point is an index page that links to several pages with the details I want to scrape. In the end most of the time my projects worked. But I always feel like I could improve the workflow (especially regarding the challenge of reducing the traffic I cause to the scraped web sites [and connected to that topic: the risk of being banned :D]).
That's why I was wondering about your (best practice) approaches of web scraper designs (for small and mid size projects).
Usually I build my web scraping projects like that:
I identify a starting point, which contains the urls I want scrape data from. The starting point has quite a predictable structre which makes it easy to scrape
I take a glimpse at the endpoints I want to scrape and figure out some functions to scrape and process data
I collect all the urls (endpoints) I want to scrape from my starting point and store them in a list (sometimes the starting point are several pages ... for example if search results are displayed and one page only shows 20 results ... but the structure of these pages is almost identical)
I start crawling the url_list and scrape the data I am interested in.
To scrape the data, I run some functions to structure and store the data in the format I need
Once I have sucessfully scraped the data, I mark the url as "scraped" (if I run into errors, timeouts or something similar, I don't have to start from the beginning, but can continue from where the process stopped)
I combine all the data I need and finish the project
Now I am wondering if it could be a good idea to modify this workflow and stop extracting/processing data while crawling. Instead I would collect the raw data/the website, mark the url as crawled and continue crawling. When all websites are downloaded (or - if it is a bigger project - between bigger tasks) I would run functions to process and store the raw data.
Benefits of this approach would be:
if I run into errors based on unexpected structure I would not have to re-scrape all the pages before. I would only have to change my code and run it on the stored raw data (which would minimize the traffic I cause)
as websites keep changing I would have a pool of reproducable data
Cons would be:
especially if projects grow in size this approach could require too much space
Without knowing your goal, it's hard to say, but I think it's a good idea as far as debugging goes.
For example, if the sole purpose of your scraper is to record some product's price, but your scraper suddenly fails to obtain that data, then yes- it would make sense to kill the scraper.
But let's say the goal isn't just the price, but various attributes on a page, and the scraper is just failing to pick up on one attribute due to something like a website change. If that were the case, and there is still value in scraping the other data attributes, then I would continue scraping, but log the error. Another consideration would be the failure rate. Web scraping is very finicky- sometimes web pages load differently or incompletely, and sometimes websites change. Is the scraper failing 100%? Or perhaps it is just failing 5% of the time?
Having the html dump saved on error certainly would help debug issues like xpath failing and such. You could minimize the amount of space consumed by more careful error handling. For example, save a file containing an html dump if one doesn't already exist for this specific error of, for example, an xpath failing to return a value, or a type mismatch, etc.
Re: getting banned. I would recommend using a scraping framework. For example, in python there is Scrapy which handles the flow of requests. Also, proxy services exist to avoid getting banned. In the US at least, web scraping has been explicitly deemed legal. All companies account for web scraping traffic. You aren't going to break a service with 100k scrapes. Think about the millions of scrapes a day Walmart does on Amazon, and vice versa.
I've scraped a lot of data (Twitter user information) for research purpose and at the moment all this is stored as a list-object in my global environment. Due to the Twitter limit I append entries frequently till I reach my goal (~200,000 entries). At the moment I've about 100,000 entries in this list object with ~70MB. The problem is that I want to save all this to my SSD (Backup) but when I save my Environment it runs the whole night and then gives an error. Means, in case my computer crashes, I'll lose all my effort! When I save just the object with the "list.save"-function from the rlist-package, it also runs several hours.
Do you have any suggestions how I should handle this issue? Thank you!
I think saveRDS should help.
SaveRDS is used when you want to save only one object
I'm currently using a service that provides a simple to use API to set up web scrapers for data extraction. The extraction is rather simple: grab the title (both text and hyperlink url) and two other text attributes from each item in a list of items that varies in length from page to page, with a max length of 30 items.
The service performs this function well, however, the speed is somewhat slow at about 300 pages per hour. I'm currently scraping up to 150,000 pages of time sensitive data (I must use the data within a few days or it becomes "stale"), and I predict that number to grow several fold. My workaround is to clone these scrapers dozens of times and run them simultaneously on small sets of URLs, but this makes the process much more complicated.
My question is whether writing my own scraper using Scrapy (or some other solution) and running it from my own computer would achieve a performance greater than this, or is this magnitude simply not within the scope of solutions like Scrapy, Selenium, etc. on a single, well-specced home computer (attached to an 80mbit down, 8mbit up connection).
Thanks!
You didn't provide the site you are trying to scrape, so I can only answer according to my general knowledge.
I agree Scrapy should be able to go faster than that.
With Bulk Extract import.io is definitely faster, I have extracted 300 URLs in a minute, you may want to give it a try.
You do need to respect the website ToUs.
I have written around 5k lines in 3 days for my new website. There are a lot places where leaks or Querys for the database can be the reason for slowing my page down but the fact is a single website-call needs around 2 full seconds whats very long i think.
1) How can I meassure the exact time what my page needs to load? (To Compare after I disable a query or change a query if it wirks)
2) How to find the leak / the thing that is slowing down my asp.net site the most?
Use this in page load..
Trace.IsEnabled = true;
It will show everything with time taken by every page events namely life cycle..
You can keep track of time lagging here and then proceed accordingly..
I use MiniProfiler on the applications I work on. If you have SQL Server as data store then use SQL Server Profiler to see what queries are being executed. Other than that, it's mostly grunt work when it comes to tracking performance bottlenecks.
You need to run a profile to check out the execution time each of your methods in a page is taking, for this many tools both free and paid ones are available. You can check out Glimpse which is a nice free tool, available on nuget and is preferred by most.
I'm using Table Wizard + Migrate module to import nodes into my Drupal installation.
I need to import around 60,000 questions / answers (they are both nodes) and I thought it would have been an easy task.
However, the migrate process imports 4 nodes per minute, and it would take approximately 11 days to finish the importing.
I was wondering if I can make it faster by importing directly in mysql. But I actually need to create 60,000 nodes. I guess Drupal is going to store additional information in other tables... and it is not that safe.
what do you suggest me to do ? Wait 10 days ?
Thanks
Table migrate should be orders of magnitude faster than that.
Are you using pathauto?
If yes, try disabling the pathauto module, often causes big performance problems on import.
Second, if disabling pathauto doesn't work, turn off all non-essential modules you may have running - some modules do crazy stuff. Eliminate other modules as the sources of the problem.
Third, is MySQL db log turned on? That can have a big performance impact - not the level you are talking about, but its something to consider.
Third, install xdebug, and tail your mysql log to see exactly whats happening.
What is your PHP memory limit?
Do you have plenty of disk space left?
If you're not doing it, you should use drush to migrate the nodes in batches. You could even write a shell script for it, if you want it automated. Using the command line should lower the time it takes to import the nodes a lot. With a script, you can make it an automated task that you don't have to worry about.
One thing I want to note though, 4 nodes per minute is very low. I once needed to import some nodes from a CSV file, using migrate etc. I needed to import 300 nodes, with location, 4-5 CCK fields and I did it in a matter of seconds. So if you only import 4 nodes per minute, you either have extremely complex nodes, or something fishy is going on.
What are the specs of the computer you are using for this? Where's the import source located?
This is a tough topic, but within Drupal actually very well covered. I don't know the ins- and outs. But do know where to look.
Data Mining Drupalgroup has some pointers, knowledge and information on processing large amounts of data in PHP/Drupal.
Drupal core has batch-functionality built in and called BatchAPI At your service when writing modules! For a working example, see this tutorial on CSV import.
4 node per minute is incredibly slow. Migrate shouldn't normally take that long. You could speed things up a bit by using Drush, but probably not enough to get a reasonable import time (hours, not days). That wouldn't really address your core problem: your import itself is taking too long. The overhead of the Migrate GUI isn't that big.
Importing directly into MySQL would certainly be faster, but there's a reason Migrate exists. Node database storage in Drupal is complicated, so it's generally best to let Drupal work it out rather than trying to figure out what goes where.
Are you using Migrate's hooks to do additional processing on each node? I'd suggest adding some logging to see what exactly is taking so long. Test it on 10 nodes at a time until you figure out the lag before doing the whole 60k.
We had a similar problem on a Drupal 7 install. Left it run all week-end on an import, and it only imported 1,000 lines of a file.
The funny thing is that exactly the same import on a pre-production machine was taking 90 minutes.
We ended up comparing the source code (making sure we are at the same commit in git), the database schema (identical), the quantity of node on each machine (not identical but similar)...
Long story made short, the only significant difference between the two machines was the max_execution_time option in the php.ini settings file.
The production machine had max_execution_time = 30, while the pre-production machine had max_execution_time = 3000. It looks like the migrate module has a kind of system to handle "short" max_execution_time that is less than optimal.
Conclusion : set max_execution_time = 3000 or more in your php.ini, that helps a lot the migrate module.
I just wanted to add a note saying the pathauto disable really does help. I had an import of over 22k rows and before disabling it took over 12 hours and would crash multiple times during the import. After disabling pathauto and then running the import, it took only 62 minutes and didn't crash once.
Just a heads up, I created a module that before the import starts, disables the pathauto module, and then upon the feed finishing, reenables the pathauto module. Here's the code from the module in case anyone needs to have this ability:
function YOURMODULENAME_feeds_before_import(FeedsSource $source) {
$modules = array('pathauto');
drupal_set_message(t('The ').$modules[0].t(' has been deployed and should begin to disable'), 'warning');
module_disable($modules);
drupal_set_message(t('The ').$modules[0].t(' module should have been disabled'), 'warning');
}
function YOURMODULENAME_feeds_after_import(FeedsSource $source) {
$modules = array('pathauto');
module_enable($modules);
drupal_set_message($modules[0].t(' should be reenabled now'), 'warning');
}