Get all values of some parameter for all documents in Marklogic - xquery

I'm trying to get 'xxx' parameter of all documents in Marklogic using query like:
(/doc/document)/xxx
But since we have very big documents database I get an error "Expanded tree cache full on host". I don't have admin rights for this server, so I can't change configuration. I suggest that I can use ranges while getting documents like:
(/doc/document)[1 to 1000]/xxx
and then
(/doc/document)[1000 to 2000]/xxx
etc, but I'm concerned that I do not know how it works, for example, what will happen if during this process database will be changed (f.e. a new document will be added), how will it affect the result documents list? Also I don't know which order it uses in case when I use ranges...
Please clarify, is this way can be appropriate or is there any other ways to get some parameter of all documents?

Depending on how big your database is there may be no way to get all the values in one transaction.
Suppose you have a trillion documents, the result set will be bigger then can be returned in one transaction.
Is that important ? Only your business case can tell.
The most efficient way of getting all "xxx" values is with a range index. You can see how this works
with cts:element-values ( https://docs.marklogic.com/cts:element-values )
You do need to be able to create a range index over the element "xxxx" to do this (ask your DBA).
Then cts:element-values() returns only those values and the chances of being able to return most or all of them
in memory in a signle transaction is much higher then using xpath (/doc/document/xxx) which as you wrote actualy returns all the "xxx" elements (not just their values). The most likely requires actually loading every document matching /doc and then parsing it and returning the xxx element. That can be both slow and inefficient.
A range index just stores the values and you can retrieve those without ever having to load the actual document.
In general when working with large datasets learning how to access data in MarkLogic using only indexes will produce the fastest results.

Related

Large arrays in Firestore Database (Best practices)

I am populating a series of dates and temperatures that I was thinking of storing in a Firestore Database to later be consumed by the front-end with the following structure:
{
date: ['1920-01-01', '1920-01-02', '1920-01-03', '1920-01-04', '1920-01-05', ...],
values: [20, 18, 19.5, 20.5, ...]
}
The array may consider a lot of years, so it turns huge, with thousands of entries. Firestore database started complaining about returning the too many index entries for entity error, and even if I get the data uploaded, the user interface Firebase -> Firebase Database -> Panel View collapses. That happens even with less than 3000 entries array.
The fact is that the data is consumed in the front-end with an array structure very similar to the one described above (I want to plot it using Echarts library). This way, I found this structure to be the more natural way, as any other alternative will require reversing the structure to arrays in the front-end.
Nevertheless, I see that Firestore Database very clearly does not like this structure. What should I do? What is the best practice for dealing with this kind of data in Firestore?
The indexes required for the most basic queries in Firestore are automatically created for you. However, there are some limits involved. So you're getting the following error:
too many index entries for entity
Because you hit the maximum number of index entries for a document, which is 40,000. If you add too many elements into an array or you add too many fields to a document, then you can reach the maximum limit.
So most likely the number of elements that exist in the date array + the number of elements that exist in the values array is bigger than 40k, hence the error.
To solve this, you might consider creating two separate documents, one for each array. If you still hit the maximum limit, then you might consider creating a document for each hour, and not for an entire day. In this way, you'll drastically reduce the number of elements that exist in an array.
If you don't find these solutions useful, then you have to set some "Single-field index exemptions" to avoid the above error.
Firestore is not the best tool to deal with time series. The best solution I found in Firestore was creating an independent document for each day in my data. Nevertheless, that raises the number of documents I need to fetch from the front-end side and, therefore, the costs.
By using large arrays in Firestore, you easily reach the index limit, and you are forced to remove the index, which I feel is a big red flag, suggesting checking another tool.
The solution I found, in case is useful for anyone, was building my API in Flask using MongoDB as a database. Although it takes more effort than just using Firestore, it deals better with time series and brings more flexibility.

How to structure data Firestore, for multiple user enteries

This is my first time using a NOSQL database and I'm really struggling to work out how to structure my data.
I have an app that predicts a users mood and then the user can select if that's right or not. So I need to save both the prediction and the actual result. I want to be able to pull the latest result from firebase and display it on the app.
I understand how I'd do this on an SQL DB and understand how to write an SQL query to get that data back out.
For my Firebase DB I thought of the following structure
the document name is the usersID and store multiple arrays based on the timestamp but I can't seem to user OrderBy on a document only a collection so not sure how to get this back.
The fact that this seems so difficult less me to believe I've implemented the DB wrong to begin with.
Structure of DB is as follows:
I should add that it all works fine for the USER_TABLE as its one document id and a single entry, so I've no problem retrieving that.
Thanks for your help!
orderBy is an instruction to the database to order documents on the server, before it returns them to your app. To store the fields inside the document, you can just do that inside your application code after it receives the document(s).
There is in itself nothing wrong with storing these entries in a single document, Just keep in mind that:
A document can be at most be 1MB in size, so make sure this fits your maximum number of entries.
Firestore only ever returns full documents, so you will either get all entries in a document, or none of them.
You won't be able to order or filter the entries inside a single document. If that is a requirement for you, consider storing each entry in its own document in a subcollection. Note that this will increase the number of documents each user reads though, which will increase the cost.

How to delete Single-field indexes that generated automatically by firestore?

update:
TLDR;
if you reached here, you should recheck the way you build your DB.
Your document(s) probably gets expended over time (due to nested list or etc.).
Original question:
I have a collection of documents that have a lot of fields. I do not query documents even no simple queries-
I am using only-
db.collection("mycollection").doc(docName).get().then(....);
in order to read the docs,
so I don't need any indexing for this collection.
The issue is that firestore generates Single-field indexes automatically, and due to the amount of fields cause limitation exceeding of indexing:
And if I trying to add a field to one of the documents it throws me an error:
Uncaught (in promise) Error: Too many indexed properties for entity: app: "s~myapp",path < Element { type: "tags", name: "aaaa" }>
at new FirestoreError (index.cjs.js:346)
at index.cjs.js:6058
at W.<anonymous> (index.cjs.js:6003)
at Ab (index.js:23)
at W.g.dispatchEvent (index.js:21)
at Re.Ca (index.js:98)
at ye.g.Oa (index.js:86)
at dd (index.js:42)
at ed (index.js:39)
at ad (index.js:37)
I couldn't find any way to delete these single-field-indexing or to tell firestore to stop generating them.
I found this in firestore console:
but there is no way to disable this, and to disable auto indexing for a specific collection.
Any way to do it?
You can delete simple Indexes in Firestore firestore.
See this answer for more up to date information on creating and deleting indexes.
Firestore composite index permutation explosion?
If you go in to Indexes after selecting the firestore database and then select "single" indexes there is an Add exemption button which allows you to specify which fields in a Collection (or Sub-collection) have simple indexes generated by Firestore. You have to specify the Collection followed by the field. You then specify every field individually as you cannot specify a whole collection. There does not seem to be any checking on valid Collections or field names.
The only way I can think to check this has worked is to do a query using the field and it should fail.
I do this on large string fields which have normal text in them as they would take a long time to index and I know I will never search using this field.
Firestore creates two indexes for every simple field (ascending and descending) but it is also possible to create an exemption which removes one of these if you will never need the second one which helps improve performance and makes it less likely to hit the index limits. In addition you can select whether arrays are indexed or not. If you create a lot of entries it an Array, then this can very quickly hit the firestore limits on the number of indexes, so care has to be taken when using indexes and it will often be best to take the indexes off Arrays since the designer may have no control over how many Array data items are added with the result that the maximum index limit is reached and the application will get an error as the original poster explained.
You can also remove any simple indexes if you are not using them even if a field is included in a complex index. The complex index will still work.
Other things to keep an eye on.
If you are indexing a timestamp field (or any field that increases or decreases sequentially between documents) and you are not using this to force a sequence in queries, then there is a maximum write rate of 500 writes per second for the collection. In this case, this limit can be removed by removing the increasing and decreasing indexes.
Note that unlike the Realtime Database, fields created with Auto-ID do not guarantee any ordering as they are generated by firestore to spread writes and avoid hotspots or bottlenecks where all writes (and therefore reads) end up at a single location. This means that a timestamp is often needed to generate ordering but you may be able to design your collections / sub-collections data layout to avoid the need for a timestamp. For example, if you are using a timestamp to find the last document added to a collection, it might be better to just store the ID of the last document added.
Large array or map fields can also cause the 20,000 index entries per document limit to be reached, so you can exempt the array from indexing (see screenshot below).
Once you have added one exemption, then you will get this screen.
See this link as well.
https://firebase.google.com/docs/firestore/query-data/index-overview
The short answer is you can't do that right now with Firebase. However, this is a good signal that you need to restructure your database models to avoid hitting limits such as the 1MB per document.
The documentation talks about the limitations on your data:
You can't run queries on nested lists. Additionally, this isn't as
scalable as other options, especially if your data expands over time.
With larger or growing lists, the document also grows, which can lead
to slower document retrieval times.
See this page for more information about the advantages and disadvantages on the different strategies for structuring your data: https://firebase.google.com/docs/firestore/manage-data/structure-data
As stated in the Firestore documentation:
Cloud Firestore requires an index for every query, to ensure the best performance. All document fields are automatically indexed, so queries that only use equality clauses don't need additional indexes. If you attempt a compound query with a range clause that doesn't map to an existing index, you receive an error. The error message includes a direct link to create the missing index in the Firebase console.
Can you update your question with the structure data you are trying to save?
A workaround for your problem would be to create compound indexes, or as a last resource, Firestore may not be suited to the needs for your app and Firebase Realtime Database can be a better solution.
See tradeoffs:
RTDB vs Firestore
I don't believe that there currently exists the switch that you are looking for, so I think that leaves the following,
Globally disable built-in indexes and create all indexes explicitly. Painful and they have limits too.
A workaround where you treat your Cloud Firestore unfriendly content like a BLOB, like so:
To store,
const objIn = {text: 'my object with a zillion fields' };
const jsonString = JSON.stringify(this.objIn);
const container = { content: this.jsonString };
To retrieve,
const objOut = JSON.parse(container.content);

A limit clarification for the new Firestore

So in the limits section (https://firebase.google.com/docs/firestore/quotas) of the new Firestore product from Firebase it says:
Maximum write rate to a collection in which documents contain
sequential values in an indexed field: 500 per second
We're pretty confused as to what that actually entails.
If we have, say, a root-level collection called users with 10 million entries in it, will this rate affect this collection in such a way, so only 500 users can update their data in any given second?
Can anyone clarify?
Sorry for the confusion; an example might help.
If your user documents contained a last-updated timestamp and you index on that timestamp then each new write would end up clustering around the same value (now) creating a hotspot in the index.
Similarly if you somehow assigned users a sequential value like a place in line or something like that this would also create a hotspot.
Incidentally this is why generated document IDs are random strings. This evenly distributes the writes on the primary key index.
If you avoid these kinds of patterns the sky's the limit, though during beta you'd hit the database-wide limit.
A quick additional note : for the moment all properties are indexed by default, so if you had a last-updated timestamp it would necessarily be indexed - so you would not be able to avoid the hotspoting.
Index disablement will be available down the road though.

Get an object from a bucket in riak without knowing its key

I am using a riak bucket to store a list of messages, using a UUID as the key and a json message as value. This is working fine.
What I need is an efficient way to get a single message from the bucket without knowing its key, at least in one of these two scenarios:
Get the last inserted object (this is my prefered approach).
Get a random object from the bucket (if the first alternative is not possible).
Is there any efficient way to achieve that?
I think one alternative could be to retrieve the keys in the bucket and then get the first one. But this means making two calls to riak, one to obtain all the keys (just to discard all but one) and a second one to obtain the object. It does not seem very efficient.
As Riak is a key-value store, the by far most efficient way to retrieve data is through the keys. Listing or retrieving all keys in a bucket, even if you only end up using the one returned first, is one of the least efficient operations you can perform as it causes Riak to scan ALL keys in the system (not just the bucket), and it is usually recommended NEVER to use this on a production system.
The most efficient way to get the last inserted object would probably be to store the id in a separate, known record in a different bucket. This would however require you to perform two writes on every insert and two reads for every read, but would do so in the most efficient way. You could possibly implement a post-commit hook (would have to be in Erlang as it is not currently not possible to write records using JavaScript functions) on the bucket containing messages to get the system to perform the update for you, which would remove the need for the last write.
If you write a lot of data to the bucket containing messages, you may want to adjust the separate bucket so that it does not allow multiple values and that the last value wins. This way you would reduce the risk of having lots of siblings created due to frequent updates to this single record across the system. This would always give you one of the last written records, but not necessarily the last one (especially if you frequently write messages to the database), as Riak does not support any type of atomicity and is an eventually consistent database.
You could also create one or more secondary indexes if you are using the leveldb backend, and use this to limit your scan to only recent records, which would be more efficient than a scann of all keys. You could then either select the most recent key or a random one through mapreduce, but this would be much less efficient than the previously described approach.
I can not think of any efficient way to retrieve a random record in a bucket from Riak unless you know the range of keys you have inserted and can decide randomly on the client which one to get. One way to do this would be to generate all keys in sequence rather than using a UUID, but that is naturally not a good idea in a highly concurrent distributed system.
1st task is pretty easy to implement:
Add post-commit hook that will write the last inserted key to some predefined key/bucket place
Get the key from that predefined key/bucket and issue a get query using them
It's still two operations but both are just gets that are fast. Plus additional overhead on hook but nothing too heavy either.
2nd scenario is also easy, but it is way too inefficient to be used practically:
Get all keys (extremely expensive operation)
Pick random
Issue get
I have come up with the same scenario. In My scenario I have to save the users. For that I required an auto increment Id. So what I did is, I placed the last inserted key in a separate bucket as like mentioned by "Christian Dahlqvist", every time I want to insert new record I fetch the last inserted key from that key bucket. Here we have only one value in that bucket with the key as "LastKey" which is always known to us. And I incremented the key based on the fetched key and again updated the key bucket. So always the key bucket contains the latest key in it.

Resources