Unordered insert into Mongodb using mongolite - r

The following example performs an unordered insert of three documents. With unordered inserts, if an error occurs during an insert of one of the documents, MongoDB continues to insert the remaining documents in the array:
db.products.insert(
[
{ _id: 20, item: "lamp", qty: 50, type: "desk" },
{ _id: 21, item: "lamp", qty: 20, type: "floor" },
{ _id: 22, item: "bulk", qty: 100 }
],
{ ordered: false }
)
Is this possible with mongolite? I am using a dataframe to insert data into mongo.

The mongo shell converts multiple insert statements into a bulk insert operation, which is where the ordered vs unordered behaviour applies. The Bulk API was introduced in MongoDB 2.6; older versions of MongoDB had a batch insert API which had an option to "continue on error" that defaulted to false.
The mongolite R package builds on the officially supported libmongoc driver, but as at mongolite 1.2 does not properly expose an option to control the behaviour of bulk inserts. However, the underlying mongolite C functions do have a stop_on_error boolean (default: TRUE) which maps to ordered vs unordered inserts.
I've submitted a pull request (mongolite #99) which will pass through the stop_on_error parameter for bulk inserts.
This doesn't change the default mongolite behaviour, which will be to stop at the first error encountered in a bulk insert. With stop_on_error set to FALSE, errors will be summarised for each batch of bulk inserts.
Sample usage (where data could be any valid parameter for insert() such as a data frame, named list, or character vector with JSON strings):
coll$insert(data, stop_on_error = FALSE)
It may make more sense to rename the parameter from stop_on_error to ordered for consistency with the bulk API, but I'll leave that to the discretion of the mongolite maintainer.

Related

How to specify CosmosDb Synapse Link types when parquet type is incorrect?

I have a CosmosDb and a Synapse workspace linked. Everything almost works using Synapse to create SQL views to the Cosmos data.
In Cosmos I have one data set with a property that is always a zero. I know it is actually a decimal because it is a price and future data is likely to contain decimal prices.
In Synapse I need to project this data into an SQL view where that column is correctly a decimal(19,4).
When I run an OpenRowSet query into the Cosmos data and attempt to specify the type for this property I get the following error.
select *
from OPENROWSET(
'CosmosDb',
'account=myaccount;database=myDatabase;region=theRegion;key=xxxxxxxxxxxxxxx',
[myCollection])
with (
[salesPrice] float '$.salesPrice')
as testQuery
I get the error:
Column 'salesPrice' of type 'FLOAT' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'.
Obviously a BIGINT here is going to fail as soon as I get a true decimal price.
I think the parquet type is getting set to BIGINT because in Cosmos all the values for this column are zero. I guess more generally it would be the same problem if the Cosmos property was all non-zero integers.
How can I force the type of salesPrice to be a decimal or float?
(I don't want to get side tracked here on float vs decimal for monetary values, I understand the difference; this error happens either way)
UPDATE
This problem manifests itself also in another way without specifying a schema with OPENROWSET.
In a new CosmosDb collection insert a document such as:
{
"myid" : 1,
"price" : 0
}
If I wait a minute or so I can query this document from Synapse with:
select *
from OPENROWSET(
'myCosmosDb',
'account=myAccount;database=myDatabase;region=myRegion;key=xxxxxxxxxxxxxxxxxxx',
[myCollection])
as testQuery;
and I get the expected results.
Now add a second document:
{
"myid" : 1,
"price" : 1.1
}
and re-run the query and I get the same error:
Column 'price' of type 'FLOAT' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'
Is there any way to work around or prevent these kinds of errors?
How about set the document like
{
"myid" : "1",
"price" : "1.1"
}

Using the sort() cursor method without the default indexing policy in Azure Cosmos DB for MongoDB API

With Cosmos DB for MongoDB API (Version 3.4), the following find query in combination with the method cursor sort seems to behave incorrectly:
db.test.find({"field1": "value1"}).sort({"field2": 1})
The error occurs, if all of the following conditions are met:
the default indexing policy were discarded - regardless of whether custom indexes were created afterwards using createIndex().
The find() query does not return any documents (Find(filter).Count() == 0)
The Sort document defining the sort order contains only one field. It doesn't matter, whether this field exists or has been indexed. Using two fields in the sort document returns 0 hits which is the correct behavior.
The error also occurs, if all of the following conditions are met:
the default indexing policy were discarded
The find() query returns one or more documents
The Sort document contains exactly one field. This field has not been indexed.
The error message:
The index path corresponding to the specified order-by item is excluded.
The malfunction occurs only when using the CosmosDB, with native MongoDB (mongoDB Atlas, v4.0) it behaves correctly.
Azure Cosmos DB for MongoDB API with MongoDB 3.4 wire protocol (preview feature) is used. The problem occurs with both a MongoDB C#/.NET driver and the mongo shell.
In addition, the problem only occurs with find(). An equivalent aggregation pipeline containing $match and $sort behaves correctly.
Reproduction
Create an Azure Cosmos DB Account with the "Azure Cosmos DB for MongoDB API". Enable the preview feature MongoDB 3.4 (Version 3.2 has not been tested).
Create a new database
Create a new collection, define a shard key
Drop the default indexing policy (using db.test.dropIndexes() )
(Optional) Create new custom indexes
(Optional) Insert documents
Execute command in mongo shell (or the equivalent code with mongoDB C#/.NET driver):
db.test.find({"field1": "value1"}).sort({"field2": 1})
Expected result
All documents that match the query criteria. If there are none, no documents should be returned.
Actual result
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 2,
"errmsg" : "Message: {\"Errors\":[\"The index path corresponding to the specified order-by item is excluded.\"]}\r\nActivityId: c50cc751-0000-0000-0000-000000000000, Request URI: /apps/[...]/, RequestStats: \r\nRequestStartTime: 2019-07-11T08:58:48.9880813Z, RequestEndTime: 2019-07-11T08:58:49.0081101Z, Number of regions attempted: 1\r\nResponseTime: 2019-07-11T08:58:49.0081101Z, StoreResult: StorePhysicalAddress: rntbd://[...]/, LSN: 359549, GlobalCommittedLsn: 359548, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 400, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#359549, UsingLocalLSN: True, TransportException: null, ResourceType: Document, OperationType: Query\r\n, SDK: Microsoft.Azure.Documents.Common/2.4.0.0", [...]
Workaround
Adding an additional "dummy" field to the sort document prevents the error:
db.test.find({"field1": "value1"}).sort({"field2": 1, "dummyfield": 1}).count()
The workaround is not satisfactory. It could falsify the result.
Am I doing something wrong, or is Cosmos DB behaving flawed here?
According to Microsoft support, an index needs to be created on the field being sorted. The default indexes can be dropped and custom indexes created. As for the issue of not modifying the index every time a new field is added, there is no other alternative other than performing a client side sort. Unfortunately, client side sorting would take lot of CPU memory on the client side and the sort on index would take work when you would get more fields to index.
Thus I did not find a really satisfying solution:
Using the Default Indexing Policy. However, this can lead to a huge index.
Indexing all elements that need to be sorted. Every time a new element has to be indexed, this leads to a manual modification of the indexing policy.
Only use Client-side sort. In my opinion this leads to a strong limitation of MongoDB functionality.
Using aggregation frameworks instead of the find method. This leads to increased complexity and traffic.
Migrating to native MongoDB.
db.collection.createIndex ({ "$**" : 1 });

Cosmos DB, C# SQL Api - case-insensitive WHERE clause

I am working on a project with Azure Cosmos DB using the C# SQL Api (DocumentDB) and need to know if it's possible to have a case-insensitive WHERE clause. From what I can find online it doesn't appear to be possible yet.
I want to write a query like:
SELECT l.CustomerName, l.LogDetail
FROM Logs l
WHERE l.CustomerName = 'Acme'
and have documents returned with CustomerName equal to "ACME", "Acme", or even "aCmE". I don't want to take a performance hit of a scan. I'd prefer to have the query use an index.
I know I could create a second CustomerName field with all lowercase values to filter on, but I'm looking to see if I can avoid that. Is this possible?
Unfortunately, unless it was added in the past two months, this is not possible.
If you use ToLower() or ToUpper() on an indexed field it will result in a scan, so that is not an option.
Some valid solutions are like you said to add another field with a case-insensitive string, or to only insert data with a certain case. It sounds like your DB is case insensitive anyway, so why not ensure that the cases really are insensitive?
At the time of this writing, there is now a LOWER function that can be used Cosmos SQL API queries. This would enable you to write your query like this:
SELECT l.CustomerName, l.LogDetail
FROM Logs l
WHERE LOWER(l.CustomerName) = 'acme'
Here are the docs for the LOWER function.
There is a StringEquals function now which can be used to do case insensitive compares.
SELECT STRINGEQUALS("abc", "abc", false) AS c1, STRINGEQUALS("abc", "ABC", false) AS c2, STRINGEQUALS("abc", "ABC", true) AS c3
returns
[{
"c1": true,
"c2": false,
"c3": true
}]
Here is the documentation - https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-stringequals

Azure CosmosDB IS_DEFINED vs NOT IS_DEFINED

I was trying to query a collection, which had few documents. Some of the collections had "Exception" property, where some don't have.
My end query looks some thing like:
Records that do not contain Exception:
**select COUNT(1) from doc c WHERE NOT IS_DEFINED(c.Exception)**
Records that contain Exception:
**select COUNT(1) from doc c WHERE IS_DEFINED(c.Exception)**
But this seems not be working. When NOT IS_DEFINED is returning some count, IS_DEFINED is returning 0 records, where it actually had data.
My data looks something like (some documents can contain Exception property & others don't):
[{
'Name': 'Sagar',
'Age': 26,
'Exception: 'Object reference not set to an instance of the object', ...
},
{
'Name': 'Sagar',
'Age': 26, ...
}]
Update
As Dax Fohl said in an answer NOT IS_DEFINED is implemented now. See the the cosmos dev blob April updates for more details.
To use it properly the queried property should be added to the index of the collection.
Excerpt from the blog post:
Queries with inequality filters or filters on undefined values can now
be run more efficiently. Previously, these filters did not utilize the
index. When executing a query, Azure Cosmos DB would first evaluate
other less expensive filters (such as =, >, or <) in the query. If
there were inequality filters or filters on undefined values
remaining, the query engine would be required to load each of these
documents. Since inequality filters and filters on undefined values
now utilize the index, we can avoid loading these documents and see a
significant improvement in RU charge.
Here’s a full list of query filters with improvements:
Inequality comparison expression (e.g. c.age != 4)
NOT IN expression (e.g. c.name NOT IN (‘Luis’, ‘Andrew’, ‘Deborah’))
NOT IsDefined
Is expressions (e.g. NOT IsDefined(c.age), NOT IsString(c.name))
Coalesce operator expression (e.g. (c.name ?? ‘N/A’) = ‘Thomas’)
Ternary operator expression (e.g. c.name = null ? ‘N/A’ : c.name)
If you have queries with these filters, you should add an index for
the relevant properties.
The main difference between IS_DEFINED and NOT IS_DEFINED is the former utilizes the index while the later does not (same w/ = vs. !=). It's most likely the case here is IS_DEFINED query finishes in a single continuation and thus you get the full COUNT result. On the other hand, it seems that NOT IS_DEFINED query did not finish in a single continuation and thus you got partial COUNT result. You should get the full result by following the query continuation.

CosmoDB applying a sort by field removes all documents that do not have that field

We are migrating from mongoDB to CosmoDB using the Mongo API.
We have encountered the following difference in query behavior around sorting.
Using the CosmoDB mongo API sorting by a field removes all documents that don't have that field. Is it possible to modifying the query to including the nulls to replicate the mongo behavior?
For example if we have the following 2 documents
[{
"id":"p1",
"priority":1
},{
"id":"p2"
}]
performing:
sort({"priority":1})
cosmoDB will return a single result 'p1'.
mongo will return both results in the order 'p2', 'p1', the null documents will be first.
As far as I know, the null value will not include in the query result sort scan.
Here is a workaround, you could set a not exists field in the sort method to force the engine scan all the data.
Like this:
db.getCollection('brandotestcollections').find().sort({"test": 1, "aaaa":1})
The result is like this:
I had the same problem and got solved after some reading
Refer the document...
You have to update the indexing policy of the container to change the default way of Cosmos DB sorting!

Resources