Particularly, Binary data? CosmosDB's Core SQL/Document API supports JSON, which does not allow binary data on the wire easily.
As i read, Support is Minimal. From the docs,
Some of Azure Cosmos DB's internal formats for encoding information,
such as binary fields, are currently not as efficient as one might
like. Therefore this can cause unexpected limitations on data size.
For example, currently one couldn't use the full one Meg of a table
entity to store binary data because the encoding increases the data's
size
Related
I have 50-100mb dataset that users need to have access to. It's static, so doesn't make sense to host a server for it. There are two kinds of operations I'll perform on the data:
Reading objects by unique ObjectId. Each object is ~3kb.
Full text search through ~300.000 strings. Each string is 4-60 characters.
I'm considering to store data as JSON files. The 300k strings will be stored separately. I'll use https://github.com/nextapps-de/flexsearch or something similar to perform search over it. I've done something similar before with ~10mb dataset back in 2016. I used just regex search and it was working flawlessly.
Are there reasons to use RealmDB, SQLite, PouchDB or something else instead of just JSON?
I wish I did this question an year ago...
In the office I currently work we tried creating an app by using PouchDB and react native, we basically saw PouchDB as an advantage because it wouldn't require our API to send all data over and over again on every refresh triggered by the user, it would only send the data that changed based on the client's checkpoint. As the data in the server was quite heavy (around 6k entries with more than 200 attributes each) we tried at all costs to go easy on the client's data plan.
Months after this implementation was in place we implemented a search functionality with many different options for sorting and filtering, and not only we had to throw away all our implementation of PouchDB, but we had to start from scratch replacing all its logic with indexed JSON values. PouchDB performance was extremely slow, it was taking more than 5 seconds or so to retrieve results, and we just couldn't afford to delay this time on our scope.
In the end we accomplished to reach a very quick search running flex search inside our indexed JSONs. Don't do the same mistake we did, PouchDB costed us too much budget and precious time. It was a terrible choice.
Unfortunately I cannot offer proof or more details from a reputable source, I can only share the own personal terrible experience I had when I thought we were reaching the end of a project and we had to start from scratch. it was a mess.
Oh boy, a bountied, opinion based question!
I have about 5 years experience with pouchDB specifically, a little with SQLite. I have but a cursory experience with RealmDB - I tried it out and decided it was not a good fit for my hybrid/mobile needs.
pouchDB exceeds in on one area hands down - synchronization/replication just like it's big brother CouchDB. Providing interaction with an offline database that synchronizes with a remote database is huge for many mobile apps. pouchDB is schemaless, leveraging JSON documents. With pouchDB one may choose among several data stores via adapters. As there can be quota headaches1 for your data size the right choice may likely be the SQLite adapter. pouchDB does not support full text search.
SQLite is what its name implies - a relational database, requiring a schema. An advantage to SQLite is platform support and the size of the database is not subject to quota headaches like web storage (e.g. IndexedDB). SQLite supports full text search, and apps can deploy with a canned database.
Between pouchDB and SQLite lies RealmDB - it is a schema based object database that supports synchronization/replication. Like pouchDB, it does not support full text search.
Now your requirements
Looking up object by id
300k static text
full-text search
I read 'static' to mean immutable.
Since your data does not change and full-text search is required, pouchDB and RealmDB would not be good choices. If there is a requirement to enhance, remove or add to the data, either would make sense as changes to data on a single server would replicate changes to the local database, practically in a seamless fashion.
SQLite might be a reasonable choice since it supports search and it is possible to deploy a canned database with the app. However, SQLite can be slow in hybrid apps.
So,
pouchDB and RealmDB would be massive overkill and not a good fit.
SQLite would add a fair bit of complexity.
For your specific requirements I'd stay on your path, though I have a care as it appears flexsearch loads its index into memory - if its performance returns some penalty then SQLite, with it's ability to deploy a canned database and providing a search facility may prove a reasonable trade off versus complexity.
Good luck!
1 Quota Headaches
I would say it really just depends on whether you want and NEED to leverage the power of relational queries. Because your data is never changing I would use JSON unless you are trying to perform complex comparisons between your data. In your case it sounds like you are just going to be searching for the particular ObjectId so JSON is your best bet especially because you are saying you won't need to change the data later.
If you organize your JSON so that your ObjectId are in a sorted order you will easily be able to search quickly.
When creating a CosmosDB instance, we can choose the API that we will use to communicate with the instance (e.g. SQL, MongoDB, Cassandra, etc.)
What is not clear to me is - does this selection dictates how the data is stored, or only the way we communicate with the instance? For example, if we choose MongoDB, does it mean that CosmosDB will store data in a MongoDB fashion?
The choice of API does not change how the data is stored. Cosmos DB always stores data using something called atom-record-sequence (ARS) which is essentially a set of primitive types, structs and arrays. The database engine translates the native ARS format into the data structures used by the various APIs (i.e. json documents, table rows, etc.)
So the answer to your question is that the choice of API only impacts how you communicate with the databases for that Cosmos DB account.
As David Makogon points out in his comment on another answer, while the way the data is stored is the same regardless of the API used, the content of the data will be different because each API requires it's own metadata so that the underlying data can be projected into the format expected by each API.
Here is a good technical overview of how Cosmos works under the hood.
https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/
Data is always stored in the same fashion (as a bunch of json documents), only the way you interact with the data changes
https://learn.microsoft.com/en-us/azure/cosmos-db/introduction#develop-applications-on-cosmos-db-using-popular-open-source-software-oss-apis
It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?
Here's my case in a nutshell:
I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.
This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file.
This still works great using <1gb input files on the Data Lake.
However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.
I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas.
Therefore, raising the upper-limit would be great.
I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.
U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{ ...
Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).
I want to index blob of type image and video.
From what I have read Azure Search cannot index image and video types.
What I have done is that I was thinking of using the blob's metadata_storage_path. However that is my key and it is encoded.
Decoding it is really a performance killer.
Is there any way I can index images and videos, using azure search index?
If not, is there any other way?
IIUC, you want to index the metadata attached to the blob but not its content, correct? If so, set dataToExtract parameter to storageMetadata as described in Controlling which parts of the blob are indexed.
The cost of base64-decoding the encoded metadata_storage_path to correlate with the rest of your system is likely to be negligible compared to other work your app is doing, such as calls to the database or Azure Search. However, you can avoid the need for decoding if you fork metadata_storage_path into a new non-key field in your index, which won't need to be encoded. You can use field mappings to fork the field.
In my mobile app (hybrid), I want to allow the user to take his data to another device. There will be no server side components from my end. The data user would carry would contain images, audio, video along with text and timestamps etc. My design evolved as below
1. Store each entry in a JSON file with image, audio and video as Data URI and export this file to cloud sync platforms. The problem with this approach is that, even though JSON is better than XML, there could be better options. See below
2. Store each entry in a BSON file with image, audio and video as Data URI and export this file to cloud sync platforms. The problem with this approach is that as mentioned in its site still the field names will be repeated and protobuf could be a better fit.
3. Store each entry in a protocol buffer file with image, audio and video as Data URI and export this file to cloud sync platforms.
Then when I stumbled across greenDAO they were mentioning
greenDAO lets you persist protocol buffer (protobuf) objects directly
into the database.
What is the benefit I will be getting by storing the protobuf object in sqlite DB? Will be able to export sqlite file instead of file containing object in protobuf format?
Well, the data still has to be serialized somehow into the database. greenDAO just hides the serialization from you. Since you have specific needs, you are probably best building your own solution, better tailored for your needs.
If you don't anticipate the field names changing, why not just store the entries as database rows? This has a number of nice advantages, including the ability to have sortable and searchable entries.