Distributed scraping to meet terms of access - web-scraping

I'm trying to scrape a relatively large number (approximately 10,000) of papers from PubMed. The terms of access state:
Do not make concurrent requests, even at off-peak times.
Does this rule out using a distributed method?

Related

Firebase and cartesian public reads

I'm working on a product that displays the results of running races. Races could have thousands of participants. So, in the days after a medium-sized event, there might be 3000 non-authenticated users wanting to browse 3000 results.
Although not every visitor will view all the results, the maximum damage at 3000 * 3000 would be 9,000,000 reads which at $.06 (Google cloud pricing) would cost $540,000 (Update: I'm a dummy, I missed the "per 100,000 documents" part, so this would only be $540).
Obviously, I wouldn't deliver all 3000 results for each visit - there would be paging and limits. Though, there's something inherently scary about the possibility of those costs.
Questions:
Is firebase simply the wrong technology for this type of product?
Is firebase not really intended for non-authenticated apps? Obviously DDOS becomes a concern for public access and there's no real protection in FB for this.
Every post I've read on these topics assumes developers are building apps for authenticated users.
9,000,000 reads which at $.06 (Google cloud pricing) would cost $540,000
The Firestore pricing of $0.06 is for 100,000 document reads, so 9 million document reads cost $540.
Aside from that: you should model your data in a way that ensures you read the data that the user actually sees. For example, if all users will read the entirety of all 3,000 documents, consider using a data bundle to distribute that to them.
Realistically though it is more likely that each user will read just a subset of the documents, and probably not of all 3,000 documents. So consider if you can combine the part that they'll read into a more cost-efficient structure. For if these were news articles: you could store the headline and intro paragraph of the first 100 articles in a single document, and just read that document (let's call it the frontpage) into each client when it starts.
There are many more ways to model the data, depending on the use-cases of your app. To learn more on how to think about such data modeling, I recommend reading NoSQL data modeling and watching the excellent Get to know Cloud Firestore video series.

Reuse of DynamoDB table

Coming from an SQL background, it's easy to fall into a SQL pattern and mindset when designing NOSQL databases like DynamoDB. However, there are many best practices that rely on merging different kinds of data with different schemas in the same table. This can be very efficient to query for related data, in lieu of SQL joins.
However, if I have two distinct types of data with different schemas, and which are never queried together, since the introduction of on demand pricing for DynamoDB, is there any reason to merge data of different types into one table simply to keep the number of tables down? Prior to on demand, you had to pay for the capacity units per hour, so limiting the number of tables was reasonable. But with on demand, is there any reason not to create 100 tables if you have 100 unrelated data schemas?
I would say that the answer is "no, but":
On-demand pricing is significantly more expensive than provisioned pricing. So unless you're just starting out with DynamoDB with a low volume of requests, or have extremely fluctuating demand you are unlikely to use just on-demand pricing. Amazon have an interesting blog post titled Amazon DynamoDB auto scaling: Performance and cost optimization at any scale, where they explain how you can reserve some capacity for a year, then automatically reserve capacity for 15 minute intervals (so-called autoscaling), and use on-demand pricing just to demand exceeding those. In such a setup, the cheapest prices are the long-term (yearly, and even 3 year) reservations. And having two separate tables may complicate that reservation.
The benefit of having one table would be especially pronounced if your application's usage of the two different tables fluctuates up and down over the day. The sum of the two demands will usually be flatter than each of the two demands, allowing the cheaper capacity to be used more and on-demand to be used less.
The reason why I answered "no, but" and not "yes" is that it's not clear how much these effects are important in real applications, and how much can you save - in practice - by using one instead of two tables. If the number of tables is not two but rather ten, or the number of tables changes over the evolution of the application, maybe the saving can be even greater.

Azure CosmosDB - partition strategy for dictionary-like object collections

We need to move out a huge amount of data from our memory cache as it takes too much space. For that purpose, we are considering CosmosDB. The data structure and use cases are provided at the bottom. While testing it I get a few issues I can't solve: Single item retrieval takes too long (around 2 seconds), transactions seem like costing more RU then it should and can't decide on optimal throughput.
So, I have these questions:
How partitioning should be handled with the provided data structure? And if it even would have an effect?
General throughput during the week should be low (few hundreds of requests per second), but we anticipate that in a timely manner there will be spikes on requests (dozens of times more). How can we configure the container to bypass the risk of throttling and not overpay when usage is low?
Should I consider an alternative?
[
{
id: '<unique_id>',
hash: '<custom_hash>'
data: [{}, {},...]
},
...
]
There are three use cases for the collection:
Read whole collection and taking id's and hash'es to identify which items changed
Replace/insert batch of items if there are changes
Read single item retrieving data property values

Performance limitations of Scrapy (and other non-service scraping/extraction solutions)

I'm currently using a service that provides a simple to use API to set up web scrapers for data extraction. The extraction is rather simple: grab the title (both text and hyperlink url) and two other text attributes from each item in a list of items that varies in length from page to page, with a max length of 30 items.
The service performs this function well, however, the speed is somewhat slow at about 300 pages per hour. I'm currently scraping up to 150,000 pages of time sensitive data (I must use the data within a few days or it becomes "stale"), and I predict that number to grow several fold. My workaround is to clone these scrapers dozens of times and run them simultaneously on small sets of URLs, but this makes the process much more complicated.
My question is whether writing my own scraper using Scrapy (or some other solution) and running it from my own computer would achieve a performance greater than this, or is this magnitude simply not within the scope of solutions like Scrapy, Selenium, etc. on a single, well-specced home computer (attached to an 80mbit down, 8mbit up connection).
Thanks!
You didn't provide the site you are trying to scrape, so I can only answer according to my general knowledge.
I agree Scrapy should be able to go faster than that.
With Bulk Extract import.io is definitely faster, I have extracted 300 URLs in a minute, you may want to give it a try.
You do need to respect the website ToUs.

Handling extremely large amounts of data in web-based applications

What would be the best way to store a very large amount of data for a web-based application?
Each record has just 3 fields, but there will be around 144 million records a day - stored for one month - 4,464,000,000 records total. Let's round up to 5 billion.
Data has to be searchable on keyword & return results as fast as possible to the end user.
Which programming language?
JSON / XML / Some Database System I've Never Heard Of?
What sort of infrastructure? Imagine this system is only serving the needs of a maximum of 1,000 users at the same time.
I assume the code is the same whether you're searching 10 records or 10 billion, you just have to be a whole lot more efficient. I also assume mySQL/PHP doesn't stand a chance, and we're going to be paying out a very large sum for a hosting solution.
Just need some guidance on where to start, really. Thank you!
There are many tools in the Big Data ecosystem (NoSQL databases, distributed computing, machine learning, search, etc) which can form an answer to your question. Since your application will be write-heavy, I would advocate Apache Cassandra for its excellent write-performance (although it requires more data modeling than a NoSQL/document database such as MongoDB). You also need a Solr or ElasticSearch based search solution, and Map/Reduce for indexes and queries.
The programming language doesn't matter unless you have business end-users which will be writing queries against your Big Data in which case you can use something very SQL-like such as Hive or Pig. To get you started, the following (recent) link might give you some idea on how to pick an analytics stack based on your needs - please note that every database or distributed computing paradigm specializes for some particular use case:
How we picked our analytics stack
Also look at High Scalability for various use cases on how companies tackle their scalability problems.

Resources