cosmos database to be indexed in pull approach + files - azure-cosmosdb

I have items and files. There is a 1:m relationship between items and files. Items are stored in a relational database and files in folders. The association between items and files is stored in the relational database. Files can be pdfs, word docs, email etc. I intend to POC cognitive search to be able to search items and associated documents.
My current understanding is, that a pull approach might be cheaper in comparison to the push approach when using cognitive search (the latency requirements are not stringent and eventual consistency is OK). Hence, I intend to move the data into a cosmos database, which can then be indexed via the pull approach. Curious, how does this work with the documents? Would I need to crack them on prem?
There is also the option of attachments and blob storage of documents. The latter is most likely more future proofed. I would think that if I put documents into blob storage, cognitive search indexing would still need to crack the documents and apply skills?

This sounds like a good approach. In terms of data sources, Cognitive Search supports CosmosDB and blob storage and some relationship databases. I would probably:
Create a new Cognitive Search resource in the Azure portal.
In that Cognitive Search resource, click "Import data" to create a new indexer (this is the "pull" option that you mention above). You may want to do this twice, assuming that your items are in CosmosDB or a relational DB, and your documents are stored separately in blob storage.
The first indexer has a data source which points to your items/relationship data in whatever DB you decide to put them, applies any skills that you want, and puts everything in an index.
The second indexer has a different data source which points to your documents in blob storage, applies any skills that you want, and puts everything in the same index.
If you use indexers, they will take care of the document cracking. If you push data directly into the index, you will need to crack the documents yourself.
This gives a simple walkthrough of creating an indexer with the portal (skillset is optional, and change the data source to your own data): https://learn.microsoft.com/en-us/azure/search/cognitive-search-quickstart-blob

Related

Creating a Firestore Database from existing files

I have a few music albums - basically just files in folders - that I want to upload to Firebase Storage.
One would usually run a function after a file has been uploaded to create a Document containing the metadata about the Song but that's where Im stuck.
I can get most infos I need by reading the Tracks ID3 Tags but in a NoSql Database I think im supposed to not only create a Document for the Track but also a Document for each album with an array of all tracks - or at least an array with all track ids.
But when or how do I create the Album Document? Another example is the Album Cover.. I want to save the Url inside the Track Document as well as in the corresponding Album but that means that the Artwork is the first thing I need to upload because I can't add an URL because it doesn't exist yet.
I feel like I have to get this right before I start because updating everything afterwards is a pain.
Is using upload functions really the way to go here or is there really a tool or another way im missing.
thank you very much
You mentioned Firebase Storage wich is a just a cover for Cloud Storage and it's a obejct managment system not a Database, however I think you are refering to Firebase Firestore.
On firestore since as you mentioned is a NoSQL DB and the schema structure your Db should have, There no correct way to do this and will defitetly depend on each specific use case. However you can take a look at this docs where it's expalined how to arquitecture your schema thinking from a SQL to a NoSQL format.
Among other information the main pointsa are:
In general, you can treat documents as lightweight JSON records
You have complete freedom over what fields you put in each document
After you create the first document in a collection, the collection exists. If you delete all of the documents in a collection, it no longer exists.
You can use sub collections inside of collections
Deleting a document does not delete its subcollections!
And finally to have an idea on how to structure the information, you can take a look at this repo where "NoSQL-Spotify by Luke Halley" explains a NoSQL schema based on spotify so I think it shoudl fit your need or at least give you a starting point.

Resolve FK in firestore

I have some documents in firestore have some fields in it. like collection "details" looks like this
{
id: "",
fields1: "",
userFK: Reference to users collection
}
Now I need to resolve userFK on the fly means that I don't want first fetch all the documents then query to userFk userFK.get()
Is there any method, its like doing a $lookup whick is supported in mongodb
Even In some case I want to fetch documents from "details" collection based of some specific fields in users
There is no way to get documents of multiple types from Firestore with a single read operation. To get the user document referenced by userFK you will have to perform a separate read operation.
This is normal when using NoSQL databases like Cloud Firestore, as they typically don't support any server-side equivalent of a SQL JOIN statement. The performance of loading these additional details is not as bad as you may think though, so be sure to measure how long it takes for your use-case before writing it off as not feasible.
If this additional load is prohibitive for a scenario, an alternative is to duplicate the necessary data of the user into each details document. So instead of only storing the reference to their document, you'd for example also store the user name.
This puts more work on the write operation, but makes the read operations simpler and more scalable. This is the common trade-off of space vs time, where in NoSQL databases you'll often find yourself trading time for space: so storing duplicate data.
If you're new to NoSQL data modeling, I highly recommend:
NoSQL data modeling
Getting to know Cloud Firestore

CosmosDB API selection: does it dictate how the data is stored, or only how we communicate with the instance?

When creating a CosmosDB instance, we can choose the API that we will use to communicate with the instance (e.g. SQL, MongoDB, Cassandra, etc.)
What is not clear to me is - does this selection dictates how the data is stored, or only the way we communicate with the instance? For example, if we choose MongoDB, does it mean that CosmosDB will store data in a MongoDB fashion?
The choice of API does not change how the data is stored. Cosmos DB always stores data using something called atom-record-sequence (ARS) which is essentially a set of primitive types, structs and arrays. The database engine translates the native ARS format into the data structures used by the various APIs (i.e. json documents, table rows, etc.)
So the answer to your question is that the choice of API only impacts how you communicate with the databases for that Cosmos DB account.
As David Makogon points out in his comment on another answer, while the way the data is stored is the same regardless of the API used, the content of the data will be different because each API requires it's own metadata so that the underlying data can be projected into the format expected by each API.
Here is a good technical overview of how Cosmos works under the hood.
https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/
Data is always stored in the same fashion (as a bunch of json documents), only the way you interact with the data changes
https://learn.microsoft.com/en-us/azure/cosmos-db/introduction#develop-applications-on-cosmos-db-using-popular-open-source-software-oss-apis

Relational behavior against a NoSQL document store for ODBC support

The first assertion is that document style nosql databases such as MarkLogic and Mongo should store each piece of information in a nested/complex object.
Consider the following model
<patient>
<patientid>1000</patientid>
<firstname>Johnny</firstname>
<claim>
<claimid>1</claimid>
<claimdate>2015-01-02</claimdate>
<charge><amount>100</amount><code>374.3</code></charge>
<charge><amount>200</amount><code>784.3</code></charge>
</claim>
<claim>
<claimid>2</claimid>
<claimdate>2015-02-02</claimdate>
<charge><amount>300</amount><code>372.2</code></charge>
<charge><amount>400</amount><code>783.1</code></charge>
</claim>
</patient>
In the relational world this would be modeled as a patient table, claim table, and claim charge table.
Our primary desire is to simultaneously feed downstream applications with this data, but also perform analytics on it. Since we don't want to write a complex program for every measure, we should be able to put a tool on top of this. For example Tableau claims to have a native connection with MarkLogic, which is through ODBC.
When we create views using range indexes on our document model, the SQL against it in MarkLogic returns excessive repeating results. The charge numbers are also double counted with sum functions. It does not work.
The thought is that through these index, view, and possibly fragment techniques of MarkLogic, we can define a semantic layer that resembles a relational structure.
The documentation hints that you should create 1 object per table, but this seems to be against the preferred document db structure.
What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?
If the ODBC connection is going to always return bad data and not be aware of relationships, then all of the tools claiming to have ODBC support against NoSQL is not true.
References
https://docs.marklogic.com/guide/sql/setup
https://docs.marklogic.com/guide/sql/tableau
http://www.marklogic.com/press-releases/marklogic-and-tableau-build-connection/
https://developer.marklogic.com/learn/arch/data-model
For your question: "What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?"
The rule of thumb I use is that when I want to count "objects", I model them as separate documents. So if you want to run queries that count patients, claims, and charges, you would put them in separate documents.
That doesn't mean we're constraining MarkLogic to only relational patterns. In UML terms, a one-to-many relationship can be a composition or an aggregation. In a relational model, I have no choice but to model those as separate tables. But in a document model, I can do separate documents per object or roll them all together - the choice is usually based on how I want to query the data.
So your first assertion is partially true - in a document store, you have the option of nesting all your related data, but you don't have to. Also note that because MarkLogic is schema-agnostic, it's straightforward to transform your data as your requirements evolve (corb is a good option for this). Certain requirements may require denormalization to help searches run efficiently.
Brief example - a person can have many names (aliases, maiden name) and many addresses (different homes, work address). In a relational model, I'd need a persons table, a names table, and an addresses table. But I'd consider the names to be a composite relationship - the lifecycle of a name equals that of the person - and so I'd rather nest those names into a person document. An address OTOH has a lifecycle independent of the person, so I'd make that an address document and toss an element onto the person document for each related address. From an analytics perspective, I can now ask lots of interesting questions about persons and their names, and persons and addresses - I just can't get counts of names efficiently, because names aren't in separate documents.
I guess MarkLogic is a little atypical compared to other document stores. It works best when you don't store an entire table as one document, but one record per document. MarkLogic indexing is optimized for this approach, and handles searching across millions of documents easily that way. You will see that as soon as you store records as documents, results in Tableau will improve greatly.
Splitting documents to such small fragments also allows higher performance, and lower footprints. MarkLogic doesn't hold the data as persisted DOM trees that allow random access. Instead, it streams the data in a very efficient way, and relies on index resolution to pull relevant fragments quickly..
HTH!

DocumentDb and how to create folder?

New to documentdb and I am trying to determine the best way to store documents. We are uploading documents every 15 minutes and I need to keep them as easily separated by upload as possible. At first glance, I thought I could have a database and a collection for each upload. Then, I discovered you can only have 3 collections per database. This leaves me with either adding a naming convention or trying to use folders and paths. According to the same source (http://azure.microsoft.com/en-us/documentation/articles/documentdb-limits/), we are limited to 100 paths per collection. This leaves folders. I have been looking, but I haven't found anything concrete on creating folders within a collection. The object API doesn't have an obvious add/create method.
Is this possible? If so, are we limited to how many (assuming I stay within the allowed collection/database size)?
You could define a sequential naming convention and create a range index on the collection indexing policy. In this way, if you need to retrieve a range of documents, you can do it in this way, which will leverage the indexing capabilities of docdb efficiently.
As a recommendation, you can examine the charge response header on the requests you fire off during your tests. This allows you to gauge how efficient your setup is (how stringent it is against the Db, which will translate into your cost structure for the service)
Sorry about the comment. What we ended up doing was just dumping everything into one collection. The azure documentdb query language (i.e. sql like) seems robust enough to handle detailed queries. Though I am not sure what the efficiency will be like once we have a ton of documents in there.

Resources