Azure Cosmos DB Emulator slow (100 ms / request) - azure-cosmosdb

I am trying to set up the Azure Cosmos DB Emulator to work locally with integration tests but I found that it is very slow.
I am reading a ~1KB JSON document with the container.ReadItemAsync<T> method, and awaiting the answer. I am calling this method in a loop, for 100 times.
The execution time is consistently around 9.5-10 seconds, so one request takes around 100 milliseconds which is very slow compared to the fact that this service is running locally.
Why is this so slow and how can I make it faster?
I expect at most 1 ms / request considering it is all disk I/O.
I tried the following but they didn't work:
Turning Rate Limiting on/off
creating the database/collection with various provisioning settings, it has zero effect on performance (even 100k RU)
creating the db and collection manually vs with the client SDK
"Reset Data" menu in the emulator tray menu
Further information:
The emulator version is 2.14.6.0 (68d4ca59)
I start the emulator from the start menu, but starting it from the command line doesn't change anything
I am using the Microsoft.Azure.Cosmos nuget package, version 3.22.1
my CPU is i7-8565U, but it isn't even fully used while the test is running
my system has 16 GB RAM
my system is running on a fast enough SSD: "NVMe SK hynix BC501 H", but while running the test the SSD usage is between 0 and 2%.
the performance is the same if I increase the document size to 100 KB or even 1 MB.

Creating your CosmosClientOptions with the AllowBulkExecution = true setting can cause this.
the SDK will construct batches and group operations, when the batch is full, it will get dispatched, but if the batch doesn’t fill up, there is a timer that will dispatch it to make sure they complete. This timer currently is 100 milliseconds. So if the batch does not get filled up (for example, you are just sending 50 concurrent operations), then the overall latency might be affected.
Source: Introducing Bulk support in the .NET SDK

Related

Firebase Get All Auth Users taking Forever in cloudrun

We're trying to fetch 86k of our firebase users, on local & firebase functions it takes 2 minutes for all, but in cloud run it is taking on average 20 seconds per call (you can only request 1k users per calls according to firebase docs).
Interestingly, get all firebase real time database uses takes 15s, but in cloud run it took 365s.
2022-06-17T00:03:04.986000061Zgrabbed users data from db, total: 86442 in 364.015s
2022-06-17T00:03:05.732000112ZProgress 1000 0.746s
2022-06-17T00:03:15.131999969ZProgress 2000 9.847s
2022-06-17T00:03:39.332999944ZProgress 3000 34.347s
2022-06-17T00:04:03.832999944ZProgress 4000 58.846s
2022-06-17T00:04:28.433000087ZProgress 5000 83.447s
2022-06-17T00:04:51.733000040ZProgress 6000 106.747s
2022-06-17T00:05:58.332000017ZProgress 7000 172.947s
Any thoughts on how to solve this? No special network settings in place on cloud run.
Background Info:
Cloud run instance is using NodeJS 14. 2GB Memory which stays at 8% usage. CPU usage stays around 10%. The user object is relatively small, but across all these users it's about 60-70 MB. In firebase functions, only 256 MB of memory are required to do the fetching.
PS: I've yet to test if region makes a difference, as cloud run is in us-east1 and functions are in us-central1. Will be testing soon.

Cosmos DB Emulator hangs when pumping continuation token, segmented query

I have just added a new feature to an app I'm building. It uses the same working Cosmos/Table storage code that other features use to query and pump results segments from the Cosmos DB Emulator via the Tables API.
The emulator is running with:
/EnableTableEndpoint /PartitionCount=50
This is because I read that the emulator defaults to 5 unlimited containers and/or 25 limited and since this is a Tables API app, the table containers are created as unlimited.
The table being queried is the 6th to be created and contains just 1 document.
It either takes around 30 seconds to run a simple query and "trips" my Too Many Requests error handling/retry in the process, or hangs seemingly forever and no results are returned, the emulator has to be shut down.
My understanding is that with 50 partitions I can make 10 unlimited tables, collections since each is "worth" 5. See documentation.
I have tried with rate limiting on and off, and jacked the RU/s to 10,000 on the table. It always fails to query this one table. The data, including the files on disk, has been cleared many times.
It seems like a bug in the emulator. Note that the "Sorry..." error that I would expect to see upon creation of the 6th unlimited table, as per the docs, is never encountered.
After switching to a real Cosmos DB instance on Azure, this is looking like a problem with my dodgy code.
Confirmed: my dodgy code.
Stand down everyone. As you were.

Uncontrolled Realm file growth

I am using realm.js 2.3.3 with node.js app for storing latest state of iot devices.
At present I have 3 schemas (10 columns each - all numbers and booleans ) each with 1 row / device. We have around 1600 devices live and they send data every second. At max 200 devices are online at any given time.
I always keep realm in write mode by calling beginTransaction at start of the app and then every second I call commitTransaction to flush the latest state to disk followed by beginTransaction. This also ensures that realm is always in write mode. I never call compact as it freezes the entire app for some time.
In all there are close to 5000 rows which should be around 1MB of data (verified by calling compact). But the realm file has grown to 290 MB in 2 days.
How can I keep the file size realistic?

Google Cloud Bigtable: repeated grpc error code 13, then suddenly success

In short, we are sometimes seeing that a small number of Cloud Bigtable queries fail repeatedly (for 10s or even 100s of times in a row) with the error rpc error: code = 13 desc = "server closed the stream without sending trailers" until (usually) the query finally works.
In detail, our setup is as follows:
We are running a collection (< 10) of Go services on Google Compute Engine. Each service leases tasks from a pair of PULL task queues. Each task contains an ID of a bigtable row. The task handler executes the following query:
row, err := tbl.ReadRow(ctx, <my-row-id>,
bigtable.RowFilter(bigtable.ChainFilters(
bigtable.FamilyFilter(<my-column-family>),
bigtable.LatestNFilter(1))))
If the query fails then the task handler simply returns. Since we lease tasks with a lease time between 10 and 15 minutes, a little while later the lease will expire on that task, it will be lease again, and we'll retry. The tasks have a max retry of 1000 so they can be retried many times over a long period. In a small number of cases, a particular task will fail with the grpc error above. The task will typically fail with this same error every time it runs for hours or days on end, before (seemingly out of the blue) eventually succeeding (or the task runs out of retries and dies).
Since this often takes so long, it seems unrelated to server load. For example right now on a Sunday morning, these servers are very lightly loaded, and yet I see plenty of these errors when I tail the logs. From this answer, I had originally thought that this might be due to trying to query for a large amount of data, perhaps near the max limit that cloud bigtable will support. However I now see that this is not the case; I can find many examples where tasks that have failed many times finally succeed and report only a small amount of data (e.g. <1 MB) was retrieved.
What else should I be looking at here?
edit: From further testing I now know that this is completely machine (client) independent. If I tail the log on one of the task leasing machines, wait for a "server closed the stream without sending trailers" error, and then try a one-off ReadRow query to the same rowId from another, unrelated, totally unused machine, I get the same error repeatedly.
This error is typically caused by having more than 256MB of data in your reply.
However, there is currently a bug in our server side error handling code that allows some invalid characters in HTTP/2 trailers which is not allowed by the spec. This means that some error messages that have invalid characters will be seen as this kind of error. This should be fixed early next year.

How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution

I am working on an asp.net mvc-5 web application, and I am facing a problem in using Hangfire tool to run long running background jobs. the problem is that if the job execution exceed 30 minutes, then hangfire will automatically initiate another job, so I will end up having two similar jobs running at the same time.
Now I have the following:-
Asp.net mvc-5
IIS-8
Hangfire 1.4.6
Windows server 2012
Now I have defined a hangfire recurring job to run at 17:00 each day. The background job mainly scan our network for servers and vms and update the DB, and the recurring job will send an email after completing the execution.
The recurring job used to work well when its execution was less than 30 minutes. But today as our system grows, the recurring job completed after 40 minutes instead of 22-25 minutes as it used to be. and I received 2 emails instead of one email (and the time between the emails was around 30 minutes). Now I re-run the job manually and I have noted that that the problem is as follow:-
"when the recurring job reaches 30 minutes of continuous execution, a
new instance of the recurring job will start, so I will have two
instances instead of one running at the same time, so that why I received 2 emails."
Now if the recurring job takes less than 30 minutes (for example 29 minute) I will not face any problem, but if the recurring job execution exceeds 30 minutes then for a reason or another hangfire will initiate a new job.
although when I access the hangfire dashboard during the execution of the job, I can find that there is only one active job, when I monitor our DB I can see from the sql profiler that there are two jobs accessing the DB. this happens after 30 minutes from the beginning of the recurring job (at 17:30 in our case), and that why I received 2 emails which mean 2 recurring jobs were running in the background instead of one.
So can anyone advice on this please, how I can avoid hangfire from automatically initiating a new recurring job if the current recurring job execution exceeds 30 minutes?
Thanks
Did you look at InvisibilityTimeout setting from the Hangfire docs?
Default SQL Server job storage implementation uses a regular table as
a job queue. To be sure that a job will not be lost in case of
unexpected process termination, it is deleted only from a queue only
upon a successful completion.
To make it invisible from other workers, the UPDATE statement with
OUTPUT clause is used to fetch a queued job and update the FetchedAt
value (that signals for other workers that it was fetched) in an
atomic way. Other workers see the fetched timestamp and ignore a job.
But to handle the process termination, they will ignore a job only
during a specified amount of time (defaults to 30 minutes).
Although this mechanism ensures that every job will be processed,
sometimes it may cause either long retry latency or lead to multiple
job execution. Consider the following scenario:
Worker A fetched a job (runs for a hour) and started it at 12:00.
Worker B fetched the same job at 12:30, because the default invisibility timeout was expired.
Worker C (did not fetch) the same job at 13:00, because (it
will be deleted after successful performance.)
If you are using cancellation tokens, it will be set for Worker A at
12:30, and at 13:00 for Worker B. This may lead to the fact that your
long-running job will never be executed. If you aren’t using
cancellation tokens, it will be concurrently executed by WorkerA and
Worker B (since 12:30), but Worker C will not fetch it, because it
will be deleted after successful performance.
So, if you have long-running jobs, it is better to configure the
invisibility timeout interval:
var options = new SqlServerStorageOptions
{
InvisibilityTimeout = TimeSpan.FromMinutes(30) // default value
};
GlobalConfiguration.Configuration.UseSqlServerStorage("<name or connection string>", options);
As of Hangfire 1.5 this option is now Obsolete. Jobs that are being worked on are invisible to other workers.
Say goodbye to confusing invisibility timeout with unexpected
background job retries after 30 minutes (by default) when using SQL
Server. New Hangfire.SqlServer implementation uses plain old
transactions to fetch background jobs and hide them from other
workers.
Even after ungraceful shutdown, the job will be available for other
workers instantly, without any delays.
I was having trouble finding documentation on how to do this properly for a Postgresql database, every example I was see is using sqlserver, I found how the invisibility timeout was a property inside the PostgreSqlStorageOptions object, I found this here : https://github.com/frankhommers/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlStorageOptions.cs#L36. Luckily through trial and error I was able to figure out that the UsePostgreSqlStorage has an overload to accept this object. For .Net Core 2.0 when you are setting up the hangfire postgresql DB in the ConfigureServices method in the startup class add this(the default timeout is set to 30 mins):
services.AddHangfire(config =>
config.UsePostgreSqlStorage(Configuration.GetConnectionString("Hangfire1ConnectionString"), new PostgreSqlStorageOptions {
InvisibilityTimeout = TimeSpan.FromMinutes(720)
}));
I had this problem when using Hangfire.MemoryStorage as the storage provider. With memory storage you need to set the FetchNextJobTimeout in the MemoryStorageOptions, otherwise by default jobs will timeout after 30 minutes and a new job will be executed.
var options = new MemoryStorageOptions
{
FetchNextJobTimeout = TimeSpan.FromDays(1)
};
GlobalConfiguration.Configuration.UseMemoryStorage(options);
Just would like to point out that even though, it is stated the thing below:
As of Hangfire 1.5 this option is now Obsolete. Jobs that are being worked on are invisible to other workers.
Say goodbye to confusing invisibility timeout with unexpected background job retries after 30 minutes (by default) when using SQL Server. New Hangfire.SqlServer implementation uses plain old transactions to fetch background jobs and hide them from other workers.
Even after ungraceful shutdown, the job will be available for other workers instantly, without any delays.
It seems that for many people using MySQL, PostgreSQL, MongoDB, InvisibilityTimeout is still the way to go: https://github.com/HangfireIO/Hangfire/issues/1197

Resources