Uncontrolled Realm file growth - realm

I am using realm.js 2.3.3 with node.js app for storing latest state of iot devices.
At present I have 3 schemas (10 columns each - all numbers and booleans ) each with 1 row / device. We have around 1600 devices live and they send data every second. At max 200 devices are online at any given time.
I always keep realm in write mode by calling beginTransaction at start of the app and then every second I call commitTransaction to flush the latest state to disk followed by beginTransaction. This also ensures that realm is always in write mode. I never call compact as it freezes the entire app for some time.
In all there are close to 5000 rows which should be around 1MB of data (verified by calling compact). But the realm file has grown to 290 MB in 2 days.
How can I keep the file size realistic?

Related

Firebase Get All Auth Users taking Forever in cloudrun

We're trying to fetch 86k of our firebase users, on local & firebase functions it takes 2 minutes for all, but in cloud run it is taking on average 20 seconds per call (you can only request 1k users per calls according to firebase docs).
Interestingly, get all firebase real time database uses takes 15s, but in cloud run it took 365s.
2022-06-17T00:03:04.986000061Zgrabbed users data from db, total: 86442 in 364.015s
2022-06-17T00:03:05.732000112ZProgress 1000 0.746s
2022-06-17T00:03:15.131999969ZProgress 2000 9.847s
2022-06-17T00:03:39.332999944ZProgress 3000 34.347s
2022-06-17T00:04:03.832999944ZProgress 4000 58.846s
2022-06-17T00:04:28.433000087ZProgress 5000 83.447s
2022-06-17T00:04:51.733000040ZProgress 6000 106.747s
2022-06-17T00:05:58.332000017ZProgress 7000 172.947s
Any thoughts on how to solve this? No special network settings in place on cloud run.
Background Info:
Cloud run instance is using NodeJS 14. 2GB Memory which stays at 8% usage. CPU usage stays around 10%. The user object is relatively small, but across all these users it's about 60-70 MB. In firebase functions, only 256 MB of memory are required to do the fetching.
PS: I've yet to test if region makes a difference, as cloud run is in us-east1 and functions are in us-central1. Will be testing soon.

Azure Cosmos DB Emulator slow (100 ms / request)

I am trying to set up the Azure Cosmos DB Emulator to work locally with integration tests but I found that it is very slow.
I am reading a ~1KB JSON document with the container.ReadItemAsync<T> method, and awaiting the answer. I am calling this method in a loop, for 100 times.
The execution time is consistently around 9.5-10 seconds, so one request takes around 100 milliseconds which is very slow compared to the fact that this service is running locally.
Why is this so slow and how can I make it faster?
I expect at most 1 ms / request considering it is all disk I/O.
I tried the following but they didn't work:
Turning Rate Limiting on/off
creating the database/collection with various provisioning settings, it has zero effect on performance (even 100k RU)
creating the db and collection manually vs with the client SDK
"Reset Data" menu in the emulator tray menu
Further information:
The emulator version is 2.14.6.0 (68d4ca59)
I start the emulator from the start menu, but starting it from the command line doesn't change anything
I am using the Microsoft.Azure.Cosmos nuget package, version 3.22.1
my CPU is i7-8565U, but it isn't even fully used while the test is running
my system has 16 GB RAM
my system is running on a fast enough SSD: "NVMe SK hynix BC501 H", but while running the test the SSD usage is between 0 and 2%.
the performance is the same if I increase the document size to 100 KB or even 1 MB.
Creating your CosmosClientOptions with the AllowBulkExecution = true setting can cause this.
the SDK will construct batches and group operations, when the batch is full, it will get dispatched, but if the batch doesn’t fill up, there is a timer that will dispatch it to make sure they complete. This timer currently is 100 milliseconds. So if the batch does not get filled up (for example, you are just sending 50 concurrent operations), then the overall latency might be affected.
Source: Introducing Bulk support in the .NET SDK

Cosmos DB Emulator hangs when pumping continuation token, segmented query

I have just added a new feature to an app I'm building. It uses the same working Cosmos/Table storage code that other features use to query and pump results segments from the Cosmos DB Emulator via the Tables API.
The emulator is running with:
/EnableTableEndpoint /PartitionCount=50
This is because I read that the emulator defaults to 5 unlimited containers and/or 25 limited and since this is a Tables API app, the table containers are created as unlimited.
The table being queried is the 6th to be created and contains just 1 document.
It either takes around 30 seconds to run a simple query and "trips" my Too Many Requests error handling/retry in the process, or hangs seemingly forever and no results are returned, the emulator has to be shut down.
My understanding is that with 50 partitions I can make 10 unlimited tables, collections since each is "worth" 5. See documentation.
I have tried with rate limiting on and off, and jacked the RU/s to 10,000 on the table. It always fails to query this one table. The data, including the files on disk, has been cleared many times.
It seems like a bug in the emulator. Note that the "Sorry..." error that I would expect to see upon creation of the 6th unlimited table, as per the docs, is never encountered.
After switching to a real Cosmos DB instance on Azure, this is looking like a problem with my dodgy code.
Confirmed: my dodgy code.
Stand down everyone. As you were.

Design advice on Process, parallelly large volume files

I am looking for design advise on below use case.
I am designing an application which can process EXCEL/CSV/JSON files. They all contain
same columns/attributes. There are about 72 columns/attributes. These files may contain up to 1 million records.
Now i have two options to process those files.
Option 1
Service 1: Read the content from given file, convert each row into JSON save the records into SQL table by batch processing (3k records per batch).
Service 2: Fetch those JSON records from database table (which are saved in step 1), process (validation and calculation) them and save final results into separate table.
Option 2 (using Rabbit MQ)
Service 1: Read the content from given file, and send every row as a message into Queue. Let say if file contains 1 million records then this service will be sending 1 million messages into Queue.
Service 2: Listen to Queue created in step 1, and process those messages (Validation and calculation) and save final results into separate table.
POC experience with Option 1:
It took 5 minutes to read and batch saving the data into table for 100K records. (job of service 1)
If application is trying to process multiple files parallelly which contain 200K records in each of them some times i am seeing deadlocks.
No indexes or relation ships are created on this batch processing table.
Saving 3000 records per batch to avoid table locks.
While services are processing, results are trackable and query the progress. Let say, For "File 1.JSON" - 50000 records are processed successfully and remaining 1000 are IN progress.
If Service 1 finish the job correctly and if something goes wrong with service 2 then we still have better control to reprocess those records as they are persisted in the database.
I am planning to delete the data in batch processing table with a nightly SQL job if all records are already processed by service 2 so this table will be fresh and ready to store the data for the next day processing.
POC experience with option 2:
To produce (service 1) and consume messages (service 2) for 100k record file it took around 2 hr 30 mins.
No storage of file data into the database so no deadlocks (like option 1)
Results are not trackable as much as option 1, while services are processing the records. - To share the status with clients who sent the file for processing.
We can see the status of messages on Rabbit MQ management screen for monitoring purpose.
If service 1 partially read the data from a given file and error out due to some issues then there is no chance of roll back already published messages in Rabbit MQ per my knowledge so consumer keep working on those published messages..
I can horizontally scale the application on both of these options to speed up the process.
Per above facts both options have advantages and disadvantages. Is it a good use case to use Rabbit MQ ? Is it advisable to produce and consume millions of records through RabbitMQ ? Is there a better way to deal with this use case apart from these 2 options.
Please advise.
*** I am using .Net Core 5.0 and SQL server 2019. Service 1 and Service 2 are .net core worker services (windows jobs). All tests are done on my local machine and Rabbit MQ is installed on Docker (docker is on my local machine).

How many clients are connected to my firestore?

I am working on a flutter app that fetches 341 documents from the firestore, after 2 days of analysis I found out that my read requests are increasing too much. So I made a chart on the stackdriver metrics explorer from which I get to know that my app is just reading 341 docs a single time, it's the firebase console which is increasing my reads.
Now, comes to what are the questions that are bothering me,
1)How reads are considered when we see data on the console and how can I reduce my read requests? Basically there are 341 docs but it is showing more than 600 reads whenever I refresh my console.
2)As you can see in the picture there are two types of document reads 'LOOKUP' and 'QUERY', what's the exact difference between them?
3)I am getting data from the firestore with a single instance and when I open my app the chart shows 1 active client which is cool but in the next 5 minutes, the number of active clients starts to increase.
Can anybody please explain to me why this is happening?
For the last question, I tried to disable all the service accounts and then again opened my app but got the same thing again.
Firestore.instance.collection("Lectures").snapshots(includeMetadataChanges: true).listen((d){
print(d.metadata.isFromCache);//prints false everytime
print(d.documents.length);// 341
print(d.documentChanges.length);//341
});
This is the snippet I am using. When the app starts it runs only once.
I will try to answer your questions:
How reads are considered when we see data on the console and how can I
reduce my read requests? Basically there are 341 docs but it is
showing more than 600 reads whenever I refresh my console.
Reads are considered depending on your how you query your Firestore database in addition to your access to this database from the console so using of the Firebase console will incur reads and even if you leave the console open to do other stuff, when new changes to database occured these changes will incur reads also, automatically.and any document read from the server is going to be billed. It doesn't matter where the read came from. The console should be included in that.
Check this official documentation under the "Manage data" title you can see there is a note : "Note: Read, write, and delete operations performed in the console count towards your Cloud Firestore usage."
Saying that if you think there is an issue with this, you can contact Firebase support directly to have more detailed answers.
However, If you check the free plan of Firebase you can see that you have 50K free reads per day.
A workaround that I found for this (thanks to Dependar Sethi)
Bookmarking the Usage tab of the Firestore page. (So you basically
'Skip' the Data Tab)
Adding a dummy collection in a certain way that ensures it is the
first collection(alphabetically) which gets loaded by default on
the Firestore page.
you can find his full solution here.
Also, you can optimise your queries however you want to retreive only the data that you want using where() method and pagination with Firebase
As you can see in the picture there are two types of document reads
'LOOKUP' and 'QUERY', what's the exact difference between them?
I guess there are no important difference between them but "QUERY" is getting the actual data(when you call data() method) while "LOOKUP" is getting a reference of these data(without calling data() method).
I am getting data from the firestore with a single instance and when I
open my app the chart shows 1 active client which is cool but in the
next 5 minutes, the number of active clients starts to increase.
For this question, considering the metrics that you are choosing in Stackdriver I can see 3 connected clients. and as per the decription of "connected client" metric:
The number of active connections. Each mobile client will have one connection. Each listener in admin SDK will be one connection. Sampled every 60 seconds. After sampling, data is not visible for up to 240 seconds.
So please check: how many mobiles are connected to this instance and how many listeners do you have in your app. The sum of all of them is the actual number of connected clients that you are seeing in Stackdriver.

Resources