Loading Bulk data in Firebase - firebase

I am trying to use the set api to set an object in firebase. The object is fairly large, the serialized json is 2.6 mb in size. The root node has around 90 chidren, and in all there are around 10000 nodes in the json tree.
The set api seems to hang and does not call the callback.
It also seems to cause problems with the firebase instance.
Any ideas on how to work around this?

Since this is a commonly requested feature, I'll go ahead and merge Robert and Puf's comments into an answer for others.
There are some tools available to help with big data imports, like firebase-streaming-import. What they do internally can also be engineered fairly easily for the do-it-yourselfer:
1) Get a list of keys without downloading all the data, using a GET request and shallow=true. Possibly do this recursively depending on the data structure and dynamics of the app.
2) In some sort of throttled fashion, upload the "chunks" to Firebase using PUT requests or the API's set() method.
The critical components to keep in mind here is that the number of bytes in a request and the frequency of requests will have an impact on performance for others viewing the application, and also count against your bandwidth.
A good rule of thumb is that you don't want to do more than ~100 writes per second during your import, preferably lower than 20 to maximize your realtime speeds for other users, and that you should keep the data chunks in low MBs--certainly not GBs per chunk. Keep in mind that all of this has to go over the internets.

Related

How to delete many of data in "Realtime Database"

I want to delete all data in "Realtime Database", without increasing "Usage Load" in "Realtime Database".
any idea for deleting that data?
is 420,000+ data in realtime database
Here is image
If you can help me, its very usefull..
Image Usage Load
There is support for deleting large nodes built into the Firebase CLI these days as explained in this blog How to Perform Large Deletes in the Realtime Database:
If you want to delete a large node, the new recommended approach is to use the Firebase CLI (> v6.4.0). The CLI automatically detects a large node and performs a chunked delete efficiently.
$ firebase database:remove /path/to/delete
My initial write-up is below. 👇 I'm pretty sure the CLI mentioned above implements precisely this approach, so that's likely a faster way to accomplish this, but I'm still leaving this explanation as it may be useful as background.
Deleting data is a write operation, so it's by definition going to put load on the database. Deleting a lot of data causes a lot of load, either as a spike in a short period or (if you spread it out) as a lifted load for a longer period. Spreading the load out is the best way to minimize impact for your regular users.
The best way to delete a long, flat list of keys (as you seem to have) is to:
Get a list of that keys, either from a backup of your database (which happens out of band), or by using the shallow parameter on the REST API.
Delete the data in reasonable batches, where reasonable depends on the amount of data you store per key. If each key is just a few properties, you could start deleting 100 keys per batch, and check how that impacts the load to determine if you can ramp up to more keys per batch.

Firebase DB slow download

I have a Firebase realtime DB i am using to track user analytics. Currently there is about 11 000 users and each of them has quite a bit of entries ( from ten to few hundreds based on how long they interacted with the app ). Json file is 76MBs when i export whole DB.
I am using this data only for analytics, so i will have a look once per day or so on all of the data. Ie i need to download whole DB to get all the data.
When i do that, it takes about 3-5 minutes to actually load the data. I can imagine that if there were ten times more users, it would not be usable then anymore, because of load time.
So i am wondering if these load times are normal and if this is realy bad practice to do such thing? The reason i always download whole DB, is that i want to get overall data, ie how many users is registered and then for example how many ads were watched. To do that, i need to go into each user and see how many ads he watched and count them up. I cant do that without having access to data of all users.
This is first time i am doing something like this on a bit larger scale and those 76MBs are a bit surprising to me as well as the load times to get the data. It seems like its not feasable long term to use this setup.
If you only need this data yourself, consider using the automated backups to get access to the JSON. These backups are made out-of-band, meaning that they (unlike your current process) don't interfere with the handling of other client requests.
Additionally, if you're only using the database for gathering user analytics, consider offloading the data to a database that's more suitable for this purpose. So: use Realtime Database for the user's to send the data to you, but remove it from there to a cheaper/better place after that.
For example, it is quite common to transfer the data to BigQuery, which has much better ad-hoc querying capabilities than Realtime Database.

Huge amount of RU to write document of 400kb - 600kb on Azure cosmos db

This is the log of my azure cosmos db for last write operations:
Is it possible that write operations of documents with size between 400kb to 600kb have this costs?
Here my document (a list of coordinate):
Basically I thought at the beginning it was a hotPartition problem, but afterwards I understood (I hope) that it is a problem in the loading of documents ranging in size from 400kb to 600kb. I wanted to understand if there was something wrong in the database setting, in the indexing policy or other as it seems to me anomalous that about 3000 ru are used to load a json of 400kb, when in the documentation it is indicated that to load a file of equal size at 100kb it takes about 50ru. Basically the document to be loaded is a road route and therefore I would not know in what other way to model it.
This is my indexing policy:
Thanks to everybody. I spent months behind this problem without having solutions...
It's hard to know for sure what the expected RU/s cost should be to ingest a 400KB-600KB item. The cost of this operation will depend on the size of the item, your indexing policy and the structure of the item itself. Greater hierarchy depth is more expensive to index.
You can get a good estimate for what the cost for a single write for an item will be using the Cosmos Capacity Calculator. In the calculator, click Sign-In, cut/paste your index policy, upload a sample document, reduce the writes per second to 1, then click calculate. This should give you the cost to insert a single item.
One thing to note here, is if you have frequent updates to a small number of properties I would recommend you split the documents into two. One with static properties, and another that is frequently updated. This can drastically reduce the cost for updates on large documents.
Hope this is helpful.
You can also pull the RU cost for a write using the SDK.
Check storage consumed
To check the storage consumption of an Azure Cosmos container, you can run a HEAD or GET request on the container, and inspect the x-ms-request-quota and the x-ms-request-usage headers. Alternatively, when working with the .NET SDK, you can use the DocumentSizeQuota, and DocumentSizeUsage properties to get the storage consumed.
Link.

BigQuery streaming best practice

I am using Google BigQuery for sometime now, using upload files,
As I get some delays with this method I am now trying to convert my code into streaming.
Looking for best solution here, what is more correct working with BQ:
1. Using multiple (up to 40) different streaming machines ? or directing traffic to single or more endpoints to upload data?
2. Uploading one row at a time or stacking to a list of 100-500 events and uploading it.
3. is streaming the way to go, or stick with files uploading - in terms of high volumes.
some more data:
- we are uploading ~ 1500-2500 rows per second.
- using .net API.
- Need data to be available within ~ 5 minutes
Didn't find such reference elsewhere.
The big difference between streaming data and uploading files is that streaming is intended for live data that is being produced on real time while being streamed, whereas with uploading files, you would upload data that was stored previously.
In your case, I think Streaming makes more sense. If something goes wrong, you would only need to re-send the failed rows, instead of the whole file. And it adapts more to the growing files that I think you're getting.
The best practices in any case are:
Trying to reduce the number of sources that send the data.
Sending bigger chunks of data in each request instead of multiple tiny chunks.
Using exponential back-off to retry those requests that could fail due to server errors (These are common and should be expected).
There are certain limits that apply to Load Jobs as well as to Streaming inserts.
For example, when using streaming you should insert less than 500 rows per request and up to 10,000 rows per second per table.

What is the best way to handle time consuming dynamic generated reports downloads?

A website is serving continuously updated content (think stock exchange), is required to generate reports on-demand and files get downloaded by users. Users can customize the downloaded report based on lots of parameters.
What is the best practice in handling highly customized reports downloaded files as (.xls)?
How to cache and improve performance ?
It might be good to mention that the data is stored in RavenDb and the reports are expected to handle 100K results sizes.
Here are some pointers:
Make sure you haven static indexes defined in RavenDB to match all possible reports. You don't want to use dynamically generated temp indexes for this.
Probably one or more parameters will drastically change the query, so you may have some conditional logic to choose which of several query to run. This is especially true for different groupings, as they'll require a different map-reduce index.
Choose whether you want to limit your result set using standard paging with Skip and Take operators, or whether you are going to stream unbounded result sets.
However you build the actual report, do it in memory. Do not try to write it to disk first. Managing file permissions, locks, and cleanup is not worth the hassle. Plus, you risk taking servers down if they run out of disk space.
Preferably you should build the response and stream it out to your user in a single step, as to not require large amounts of memory on the server. Make sure you understand the yield keyword in C#, and that you work with IEnumerable and IQueryable directly whenever possible. Don't try to use .ToList() or .ToArray(), which will put the whole result set into memory.
With regard to caching, you could consider using a front-end cache like Memcached, but I'm not sure if it will help you here or not. You probably want as accurate of data that's possible from your database. Introducing any sort of cache will require you understand how and when to reset that cache. Keep in mind that Raven has several caching layers built in already. Build your solution without cache first, and then add caching if you need it.

Resources