Is there a formula to adjust the ConnectionPoolSettings and throughput for Cosmos DB Gremlin API? - azure-cosmosdb

I am using the Cosmo's DB Gremlin API to insert a bunch of data into the cosmos DB database. However, it is very inconsistent as the data sometimes insert successfully and sometimes times out or have too many requests. My throughput is set manually at 10,000 RU/s. Here are the settings that I am using for the connection pool properties
ConnectionPoolSettings = new ConnectionPoolSettings()
{
MaxInProcessPerConnection = 10,
PoolSize = 30,
ReconnectionAttempts = 3,
ReconnectionBaseDelay = TimeSpan.FromMilliseconds(500)
};
If I increase the MaxInProcessPerConnection or the Poolsize, it sometimes results in 429. If I lower it by too much, it sometimes results in time out errors. Is there a formula or a rule of thumb on how to adjust these settings properly?

Related

Why does Local Variables include other partitions when selecting?

Currently, I have created the following partition:
dbDate = database("", VALUE, 2010.01.01..2050.01.01)
dbSymbol = database("", HASH, [SYMBOL, 40])
db=database(db_name, COMPO, [dbDate, dbSymbol]);
When using the following code to query, I found that Local Variables will also load other date partitions
db_name = "dfs://tick_database"
tb_name = "stock";
tb = loadTable(db_name, tb_name)
select * from tb where date(time)== 2021.08.24 and symbol == `000001.SZ 
As the amount of data increases, Local Variables will soon reach the upper limit of 4g when selecting, how can I avoid this, or the database will automatically handle this situation, and there will be no Out or memory situation.
After executing tb = loadTable(db_name, tb_name), the information of tb will be displayed, but this only displays the partition information. Only double-clicking it will load the specific partition data from the disk to the memory, so that users can query a partitioned table more conveniently. loadTable only loads metadata, which can be understood as a table object and occupies very little memory. Then after executing the SQL query on it, the data of the involved partition will be loaded into the memory. You can use getMemoryStat() to compare the memory used by the system before and after.

query cosmos db without enabling cross partition query

When query cosmos db, there is an option of setting enableCrossPartitionQuery as true.
I am wondering what happens that if I did not set it? Which partition will be used for the query?
thanks
If your collection is partitioned, then the query,update, delete opeartions need partition key setting.
If you don't set, perhaps you could see below error:
For this situation, if you don't want to set any partition key or you don't know which partition the row data belongs to, then you could set enableCrossPartitionQuery= true to avoid the error. If you set enableCrossPartitionQuery= true, it means this request will scan all the partitions to filter the data. Of course,it's query performance is bound to decline.
BTW,if your data size is small,i think the impact may be small. However,if the data size is large, i suggest you trying your best to avoid setting this property.
I tested the sample project : https://github.com/Azure-Samples/azure-cosmos-db-sql-api-nodejs-getting-started.git and it doesn't require partition key indeed when the container is partitioned.
However, based on the statements in the cosmos db rest api :
I tested java sdk and it requires the partition key when i query partitioned container. Anyway,i want to say that if you met the error which indicates the lack of partition key, you could try to add the property enableCrossPartitionQuery = true to solve it. Mostly, i still suggest you providing partition key for the query performance.

Why Not Always Use EnableCrossPartitionQuery

If my cosmos DB has multiple partitions is there any reason to NOT set EnableCrossPartitionQuery to true?
I know it is necessary if running a query that could hit multiple partitions. But what if the query uses a valid partition key and definitely will only hit one partition, is there any performance loss or increased cost because I set that flag to true?
But what if the query uses a valid partition key and definitely will
only hit one partition, is there any performance loss or increased
cost because I set that flag to true?
Per my knowledge, you need set the partition key for partitioned collection and the cost will not change even if you still set the EnableCrossPartitionQuery as true.Because the request only scans the specific partition you already set. I did a sample test and try to verify it.
FeedOptions feedOptions = new FeedOptions();
PartitionKey partitionKey = new PartitionKey("A");
feedOptions.setPartitionKey(partitionKey);
feedOptions.setEnableCrossPartitionQuery(true);
FeedResponse<Document> queryResults = client.queryDocuments(
"/dbs/db/colls/part",
"SELECT * FROM c",
feedOptions);
System.out.println("Running SQL query...");
for (Document document : queryResults.getQueryIterable()) {
System.out.println(String.format("\tRead %s", document));
}
System.out.println(queryResults.getRequestCharge());
I think maybe you don't have to struggle with this problem. EnableCrossPartitionQuery option only need to be used if the query for partitioned collection is not scoped to single partition key value. If you know the specific partition key,then no need to set EnableCrossPartitionQuery.

Dealing with Indexer timeout when importing documentdb into Azure Search

I'm trying to import a rather large (~200M docs) documentdb into Azure Search, but I'm finding the indexer times out after ~24hrs. When the indexer restarts, it starts again from the beginning, rather than from where it got to, meaning I can't get more than ~40M docs into the search index. The data source has a highwater mark set like this:
var source = new DataSource();
source.Name = DataSourceName;
source.Type = DataSourceType.DocumentDb;
source.Credentials = new DataSourceCredentials(myEnvDef.ConnectionString);
source.Container = new DataContainer(myEnvDef.CollectionName, QueryString);
source.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("_ts");
serviceClient.DataSources.Create(source);
The highwater mark appears to work correctly when testing on a small db.
Should the highwater mark be respected when the indexer fails like this, and if not how can I index such a large data set?
The reason the indexer is not making incremental progress even while timing out after 24 hours (the 24 hour execution time limit is expected) is that a user-specified query (QueryString argument passed to the DataContainer constructor) is used. With a user-specified query, we cannot guarantee and therefore cannot assume that the query response stream of documents will be ordered by the _ts column, which is a necessary assumption to support incremental progress.
So, if a custom query isn't required for your scenario, consider not using it.
Alternatively, consider partitioning your data and creating multiple datasource / indexer pairs that all write into the same index. You can use Datasource.Container.Query parameter to provide a DocumentDB query that partitions your data using a WHERE filter. That way, each of the indexers will have less work to do, and with sufficient partitioning, will fit under the 24 hour limit. Moreover, if your search service has multiple search units, multiple indexers will run in parallel, further increasing the indexing throughout and decreasing the overall time to index your entire dataset.

PostgreSQL stack depth limit exceeded in hstore query though query is < 2 MB

I run into stack depth limit exceeded when trying to store a row from R to PostgreSQL. In order to address bulk upserts I have been using a query like this:
sql_query_data <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
ts_updates(ts_key varchar, ts_data hstore, ts_frequency integer) ON COMMIT DROP;
INSERT INTO ts_updates(ts_key, ts_data) VALUES %s;
LOCK TABLE %s.timeseries_main IN EXCLUSIVE MODE;
UPDATE %s.timeseries_main
SET ts_data = ts_updates.ts_data,
ts_frequency = ts_updates.ts_frequency
FROM ts_updates
WHERE ts_updates.ts_key = %s.timeseries_main.ts_key;
INSERT INTO %s.timeseries_main
SELECT ts_updates.ts_key, ts_updates.ts_data, ts_updates.ts_frequency
FROM ts_updates
LEFT OUTER JOIN %s.timeseries_main ON (%s.timeseries_main.ts_key = ts_updates.ts_key)
WHERE %s.timeseries_main.ts_key IS NULL;
COMMIT;",
values, schema, schema, schema, schema, schema, schema, schema)
}
So far this query worked quite well for updating millions of records while holding the number of inserts low. Whenever I ran into stack size problems so far I simply split my records into multiple chunks and go on from there.
However, this strategy faces some trouble now. I don't have a lot of records anymore, but a handful in which the hstore is a little bit bigger. But it's not really 'large' by any means. I read suggestions by #Craig Ringer who advises not to near the limit of 1GB. So I assume the size of the hstore itself is not the problem, but I receive this message:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: stack depth limit exceeded
HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
)
EDIT: I did increase the limit to 7 MB and ran into the same error stating 7 MB is not enough. This is really odd to me, because I the query itself is only 1.7 MB (checked it by pasting it to a text file). Can anybody shed some light on this?
Increase the max_stack_depth as suggested by the hint. [From the official documentation]
(http://www.postgresql.org/docs/9.1/static/runtime-config-resource.html):
The ideal setting for this parameter is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equivalent), less a safety margin of a megabyte or so.
and
The default setting is two megabytes (2MB), which is conservatively small and unlikely to risk crashes.
Super Users can alter this setting per connection, or it can be set for all users through the postgresql.conf file (requires postgres server restart).

Resources