Cost efficiencies and scalability in firestore with reports [closed] - firebase

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
In firestore sub-collections should be generated to save costs. In this case I intend to have a document for each user who registers and to go nesting the other data that I need from these. In this way, I will have only one reading per user who enters and the other information will be included in the sub-collections, according to the firestore documentation.
The future problem is the reports. If there are 100 users, if it is required to have a complete report with these, it would be 100 readings, when there are 50,000 users it will be the same. Additionally, although I do not know the topic of snapshots well, each of these will generate an additional cost for updates.
I would like if someone can support me with suggestions or help me clarify this:
Is it possible to have a main document that contains all the information and is this the one that is used for reports and for users get the data? That is, instead of having N documents for each user, have a single document "maindoc" and this would have subcollections with all the user data
Note: to complement solutions for reports such as data export to Bigquery and the API service, I do not consider them relevant since they also occupy N reads according to the number of documents

I will have only one reading per user who enters and the other information will be included in the sub-collections
Not really, since Firestore queries are shallow by nature, which means it does not return the value of the sub-collections when you get the document. Sub-collections are there to make data easier to understand, not to save cost. Maybe check this question out for more infomation.
The future problem is the reports
You get billed 0.06$ for every 100,000 document reads (that the price for my region, yours may differs), so unless you need to use the reports function mutiple times a day and having millions of documents, I think it's ok.
Is it possible to have a main document that contains all the information and is this the one that is used for reports and for users get the data?
This is a really bad idea, because you not only get billed for document reads, you also get billed for network egress, a.k.a the amount of network bandwidth that you use. Doing things this way mean every user have to download a giant document which slows down the app and take a lot of bandwidth.
Would it be a better option to look at SQL alternatives that have their cost base based on data size and not reads/writes?
This comes down to your use case. But for me, the different in pricing is not that much considering other BaaS options where Firebase documentations is very hard to beat.

Related

Cosmos DB flat VS nested design [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Is this Cosmos DB (SQL API) query:
SELECT * FROM c WHERE c.Name = 'John'
Faster or cheaper than
SELECT * FROM c WHERE c.Personal.Name = 'John'
I'm trying to understand the consequences of designing my data flat VS nested (not normalized vs de-normalized).
Thanks
The two versions of query you mention are probably very close in terms of cost, but the more important impact of model complexity in my experience is the cost of writes. Cosmos creates an index by default for every possible path from the item root. That means the more complex your model, the more paths will be indexed, which directly increases the cost of a write operation. As the indexing docs note:
By optimizing the number of paths that are indexed, you can
substantially reduce the latency and RU charge of write operations.
So if you embed a Personal item within your root item with multiple properties, you make your item more expensive to write.
There are also quite a few questions on StackOverflow from people asking how to write queries for their complicated object models, who never like the comment "why not a more straightforward model?" If you have the chance, avoid that fate. :)
In general, keeping items as simple and small as possible seems like the rule of thumb to follow. As always, test and see. The RU cost of queries are deterministic so you can directly know the impact of a change just by tweaking variables and running a quick test.

Best way to persist data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have a complex JSON which I need to persist over two POST requests. Currently i'm storing the serialized JSON in tempdata though the second post never succeeds before of Error 400(The size of headers is too long). In this image I viewed the cookies in Chrome Debugger
Am I doing MVC wrong? The data is probably too complex to be stored in tempdata. However for this example this JSON is only 234 line(Unsure if this reflexes cookie size accurately). I know I could increase the size of the cookie but this wouldn't fix the real issues.
Should I be storing the data in a different method?
Basically in my project i'm posting a value to the controller(Many times via POST) which then uses the value to get a certain part of the JSON. Is Session the only alternative?
I'm still a novice to MVC so forgive me if i've made a simple mistake
First, TempData and Session are the same thing. The only difference is the length of persistence: in the former, just until the next request, while in the latter for the life of the session.
Second, session storage has to be configured. If you don't configure it, then something like TempData will attempt to use cookies to persist the data. Otherwise, it will use your session store. Basically, by using any actual session store, you should have no issues with the size of the data.
Third, you have not provided much information about what you're actually doing here, but for the most part, sessions (Session or TempData) are a poor choice for persistence. The data you're trying to store between requests does not sound like it is user-specific, which makes sessions a particular poor choice. Most likely, you want a distributed cache here, though you could potentially get by with an in-memory cache. You should also consider whether you need to persist this data at all. It's far too common to over-optimize by worrying about running the same query against at database, for example, multiple times. Databases are designed to efficiently retrieve large amounts of data, and properly set up, can handle many thousands of simultaneous queries without breaking a sweat. Ironically, sometimes a caching a query doesn't actually save you anything over actually running the query, especially with distributed caching mechanisms.
Simple is better than complex. Start simple. Solve the problem in the most straight-forward way possible. If that involves issuing the same query multiple times, do so. It doesn't matter. Then, once you have a working solution, profile. If it's running slower than you like, or starts to fall down when fielding 1000s of requests, then look into ways to optimize it by caching, etc. Developers waste an enormous amount of time and energy trying to optimize things that aren't actually even problems.

Is web scraping allowed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a project that requires certain statistics from another website, and I've created an HTML scraper that gets this data every 15 minutes, automatically. However, I stopped the bot now, as in their terms of use, they mention they do not allow it.
I really want to respect this, and especially if there's a law prohibiting me from taking this data, but I've been contacting them through email several times without a single answer, so now I've come to the conclusion that I'll simply grab the data, if it is legal.
On certain forums I've read that it IS legal, but I would much rather get a more "precise" answer here on StackOverflow.
And let's say that this is in fact not illegal, would they have any software to spot my bot making several connections every 15 minutes?
Also, when talking about taking their data, we're talking about a single number for each "team", and this number I will transfer in to our own number.
I'll quote Pablo Hoffman's (Scrapinghub co-founder) answer to "What is the legality of web scraping?", I found on other site:
First things first: I am not a lawyer and these comments are solely
based on my experience working at Scrapinghub, please seek legal
assistance accordingly.
Here are a few things to consider when scraping public data from websites (note that the following addresses only US law):
As long as they don't crawl at a disruptive rate, scrapers do not breach any contract (in the form of terms of use) or commit a crime
(as defined in the Computer Fraud and Abuse Act).
Website's user agreement is not enforceable as a browsewrap agreement because companies do not provide sufficient notice of the
terms to site visitors.
Scrapers accesses website data as a visitor,
and by following paths similar to a search engine. This can be done
without registering as a user (and explicitly accepting any terms).
In Nguyen v. Barnes & Noble, Inc. the courts ruled that simply placing a
link to a terms of use at the bottom of webpage is not sufficient to
"give rise to constructive notice." In other words, there is nothing
on a public page that would imply that merely accessing the
information is subject to any contractual terms. Scrapers gives
neither explicit nor implicit assent to any agreement, therefore
breaches no contract.
Social networks, for example, assign the value of becoming a user (based on call-to-action on public page), as the ability to: i) Gain access to full profiles, ii) Identify common friends/connections, iii) Get introduced to others, and iv) Contact members directly. As long as scrapers makes no attempt to perform any of these actions they do not gain "unauthorized access" to their services and thus does not violate CFAA
A thorough evaluation of the legal issues involved can be seen here: http://www.bna.com/legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes
There must be robots.txt file in root folder of that site.
There are specified paths, that are forbidden to harass with scrappers, and those, which is allowed (with acceptable timeouts specified).
If that file doesn't exists - anything is allowed, and you take no responsibility for website owners fail to provide that info.
Also, here you can find some explanation about robots exclusion standard.

How to store big data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
Suppose we have a web service that aggregates 20 000 users, and each one of them is linked to 300 unique user data entities containing whatever. Here's naive approach on how to design an example relational database that would be able to store above data:
Create table for users.
Create table for user data.
And thus, user data table contains 6 000 000 rows.
Querying tables that have millions of rows is slow, especially since we have to deal with hierarchical data and do some uncommon computations much different from SELECT * FROM userdata. At any given point, we only need specific user's data, not the whole thing - getting it is fast - but we have to do weird stuff with it later. Multiple times.
I'd like our web service to be fast, so I thought of following approaches:
Optimize the hell out of queries, do a lot of caching etc. This is nice, but these are just temporary workarounds. When database grows even further, these will cease to work.
Rewriting our model layer to use NoSQL technology. This is not possible due to lack of relational database features and even if we wanted this approach, early tests made some functionalities even slower than they already were.
Implement some kind of scalability. (You hear about cloud computing a lot nowadays.) This is the most wanted option.
Implement some manual solution. For example, I could store all the users with names beginning with letter "A..M" on server 1, while all other users would belong to server 2. The problem with this approach is that I have to redesign our architecture quite a lot and I'd like to avoid that.
Ideally, I'd have some kind of transparent solution that would allow me to query seemingly uniform database server with no changes to code whatsoever. The database server would scatter its table data on many workers in a smart way (much like database optimizers), thus effectively speeding everything up. (Is this even possible?)
In both cases, achieving interoperability seems like a lot of trouble...
Switching from SQLite to Postgres or Oracle solution. This isn't going to be cheap, so I'd like some kind of confirmation before doing this.
What are my options? I want all my SELECTs and JOINs with indexed data to be real-time, but the bigger the userdata is, the more expensive queries get.
I don't think that you should use NoSQL by default if you have such amount of data. Which kind of issue are you expecting that it will solve?
IMHO this depends on your queries. You haven't mentioned some kind of massive writing so SQL is still appropriate so far.
It sounds like you want to perform queries using JOINs. This could be slow on very large data even with appropriate indexes. What you can do is to lower your level of decomposition and just duplicate a data (so they all are in one database row and are fetched together from hard drive). If you concern latency, avoid joining is good approach. But it still does not eliminates SQL as you can duplicate data even in SQL.
Significant for your decision making should be structure of your queries. Do you want to SELECT only few fields within your queries (SQL) or do you want to always get the whole document (e.g. Mongo & Json).
The second significant criteria is scalability as NoSQL often relaxes usual SQL things (like eventual consistency) so it can provide better results using scaling out.

asp .net Application performance [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have an asp .net 4.0 application. I have an mdf file in my app_data folder that i store some data. There is a "User" table with 15 fields and an "Answers" table with about 30 fields. In most of the scenarios in my website, the user retrieves some data from "User" table and writes some data to "Answers" table.
I want to test the performance of my application when about 10000 users uses the system.What will happen if 10000 users login and use the system at the same time and how will the performance is affected ? What is the best practice to test my system performance of asp .net pages in general?
Any help will be appreciated.
Thanks in advanced.
It reads like performance testing/engineering is not your core discipline. I would recommend hiring someone to either run this effort or assist you with it. Performance testing is a specialized development practice with specific requirement sets, tool expertise and analytical methods. It takes quite a while to become effective in the discipline even in the best case conditions.
In short, you begin with your load profile. You progress to definitions of the business process in your load profile. You then select a tool that can exercise the interfaces appropriately. You will need to set a defined initial condition for your testing efforts. You will need to set specific, objective measures to determine system performance related to your requirements. Here's a document which can provide some insight as a benchmark on the level of effort often required, http://www.tpc.org/tpcc/spec/tpcc_current.pdf
Something which disturbs me greatly is your use case of "at the same time," which is a practical impossibility for systems where the user agent is not synchronized to a clock tick. Users can be close, concurrent within a defined window, but true simultaneity is exceedingly rare.

Resources