Performance of Firebase with large data sets - firebase

I'm testing firebase for a project that may have a reasonably large numbers of keys, potentially millions.
I've tested loading a few 10k of records using node, and the load performance appears good. However the "FORGE" Web UI becomes unusably slow and renders every single record if I expand my root node.
Is Firebase not designed for this volume of data, or am I doing something wrong?

It's simply the limitations of the Forge UI. It's still fairly rudimentary.
The real-time functions in Firebase are not only suited for, but designed for large data sets. The fact that records stream in real-time is perfect for this.
Performance is, as with any large data app, only as good as your implementation. So here are a few gotchas to keep in mind with large data sets.
DENORMALIZE, DENORMALIZE, DENORMALIZE
If a data set will be iterated, and its records can be counted in thousands, store it in its own path.
This is bad for iterating large data sets:
/users/uid
/users/uid/profile
/users/uid/chat_messages
/users/uid/groups
/users/uid/audit_record
This is good for iterating large data sets:
/user_profiles/uid
/user_chat_messages/uid
/user_groups/uid
/user_audit_records/uid
Avoid 'value' on large data sets
Use the child_added since value must load the entire record set to the client.
Watch for hidden value operations on children
When you call child_added, you are essentially calling value on every child record. So if those children contain large lists, they are going to have to load all that data to return. Thus, the DENORMALIZE section above.

Related

Firebase BigQuery Export Schema Size Difference

We have migrated all of our old Firebase BigQuery events tables to the new schema using the provided script. One thing we noticed was that the size of the daily tables increased dramatically.
For example, the data from 4/1/18 in the old schema was 3.5MM rows and 8.7 Gig. Once migrated, the new table from the same date is 32.3MM rows and 27 Gig. This is nearly 10 times larger in terms of number of rows and over 3X larger by space size.
Can someone tell me why the same data is so much larger in the new schema?
The result is that we are getting charged significantly more in BigQuery query costs when reading the tables from the new schema versus the old schema.
firebaser here
While increasing the size of the exported data definitely wasn't a goal, it is an expected side-effect of the new schema.
In the old storage format the events were stored in bundles. While I don't exactly know how the events are bundled, it was definitely always a bunch of events with their own unique and with shared properties. This meant that you frequently had to unnest the data in your query or cross join the tables with themselves, to get to the raw data, and then combine and group it again to fit your requirements.
In the new storage format, each event is stored separately. This definitely increases the storage size, since properties that were shared between events in a bundle, now are duplicated for each event. But the queries you write on the new format should be easier to read and can process the data faster, since they don't have to unnest it first.
So the larger storage size should come with a slightly faster processing speed. But I can totally imagine the sticker shock when you see the difference, and realize the improved speed doesn't always make up for that. I apologize if that is the case, and have been assured that don't have any other big schema changes planned from here on.

Write millions of documents into Riak

What is the best way to add huge amount of documents into riak? Let's say there are millions of product records, which change very often (prices, ...) and we want to update all of them very frequently. Is there a better way than replace keys one by one in Riak? Something as bulk set of 1000 documents at once...
There are unfortunately not any bulk operations available in Riak, so this has to be done by updating each object individually. If your updates however arrive in bulks, it may be worthwhile revisiting your data model. If you can de-normalise your products, perhaps by storing a range of products in a single object, it might be possible to reduce the number of updates that need to be performed by grouping them, thereby reducing the load on the cluster.
When modelling data in Riak you usually need to look at access and query patterns in addition to the structure of the data, and make sure that the model supports all types of queries and latency requirements. This quite often means de-normalising your model by either grouping or duplicating data in order to ensure that updates and queries can be performed as efficiently as possible, ideally through direct K/V access.

How well does scriptDB work as substitute for storing data directly in Google Spreadsheets?

I want to retrieve data from Google Analytics API, create custom calculations and then push the aggregations to a Google Spreadsheets in order to reuse in Google Visualisation API app. My concern is that I'll hit the Spreadsheet cell quota very quickly with the raw data needed for the calculation.
I know scriptDB quota is 100MB but before I invest time and resources in learning how it works I'd like to get an idea whether it's feasible for storing raw analytics data (provided it's not too granular and it's just designed to answer specific questions) and how much of it I could realistically store in scriptDB (relative to spreadsheets) before I hit the quota.
Thanks
For bulk data access (e.g. reading a table for Visualization), a spreadsheet will have a speed advantage over ScriptDb. What is faster: ScriptDb or SpreadsheetApp? If you wish to support more sophisticated queries though, to "answer specific questions" as you mention, then ScriptDb will give you an edge, as query times vary with the number of results but should be unaffected by the query criteria themselves.
With data in a spreadsheet, you will be able to obtain a DataTable for Visualization with a single Range.getDataTable() operation. With ScriptDb, you will need to write a script to build your DataTable.
Regarding size constraints, it's not possible to really compare the two without knowing the size of your individual data elements. You're already aware of the general constraints:
Spreadsheet, 40K cells, but may hit (unspecified) size limit before that, depending on data element sizes.
ScriptDb, 50MB, 100MB or 200MB depending on account type. The number of objects that can be stored is affected by the complexity (depth) of the object and the size of the property names, and of course the size of data contained in the objects.
Ultimately, the question of which is best for your application is a matter of opinion, and of which factors matter most for the application. If the analytics data is tabular, then a spreadsheet offers advantages for implementation largely because of Range.getDataTable(), and is faster for bulk access. I'd recommend starting there, and considering a move to ScriptDb if and when you actually hit spreadsheet size or query performance limitations.

Pulling all the data in SQL database and then filter it using LINQ

I have a list of user permissions. I'm thinking of pulling all the data from the UserPermission table, put it in cache and then filter it by userID using LINQ.
So the next time somebody tries to access the User Permission Screen, I'll have the data in cache and just filter and display the necessary base on the userID.
Is there any performance in doing it? Or still faster to filter it in the database/data layer?
There are 2 factors at play here
The speed of filtering the data via LINQ
The speed of filtering the data via SQL and returning that data from your data store to the application
Generally speaking, for small sets of data, the difference between 1. and 2. will be negligible.
For large sets of data with a small result, the 2. will have better performance
For large sets of data with a large result, the 1. will have better performance
But of course this generalisation is totally useless as it all depends on your environment and your code.
You need to weigh up latency vs performance vs stale data.
It's always different for different use cases...
But this kind of thing is already built into .net - look up ASP.NET Authorization Providers

Alternatives of Datatable

In my web application, I have a dynamic query that returns huge data to datatable, and this query is often recalled with different parameters. So database is exhausted.
I want to get all record with no parameters to an object, and perform queries (may be with linq) on this object. So database will not be exthausted.
Which objects can be used instead of datatable?
This is one of my pet peeves - people who return all the data from the database.
There is absolutely no need for this unless you are doing reporting.
If you are doing reporting, then you need to increase your hardware capability so that the database can cope. This may also include tuning your database, rearranging tables, reindexing, regular rebuilding of indexes, updating statistics, archiving out old data, etc.
If you are NOT doing reporting, then start limiting how much data can be queried at any one time. Users DO NOT need to see massive quantities of data all at once. They need to see discrete amounts of data presented in a manageable and coherent way.
Another rule of thumb i like to observe is: let your database server do the work, it is made to manipulate lots of data, it is what it is good at, and it should have the power to do it. Pulling back loads of data to the client, and then trying to manipulate that data on the client is a foolish thing to do. If your client machines are more powerful than the database server then you have issues.
Never ever perform this(except cache)!!!
You are trying to implement DB mechanisms, like
persistent storage
index search and query strategy
replication
and so on
Spend your time on db optimization(optimal scheme, indexes, query, partitioning).

Resources