Best way to update an entire document in marklogic - xquery

I would like to replace an xml document in a database without any metadata (e.g. permissions, properties or collections). Managed documents (dls) is not an option.
Using xdmp:document-insert() does not retain permissions, collections etc.
Using xdmp:node-replace() works well with parts of the document but requires knowing the root node in advance.
Is there a recommended way to update an entire document in MarkLogic?

You don't really need to know the root element itself. If you know the document URI, you can do something like:
xdmp:node-replace(fn:doc($uri)/*, $new-xml)
If you have any node of the document, you can also do:
xdmp:node-replace($node/fn:root(), $new-xml)
But just using xdmp:document-insert() isn't that much more difficult either:
xdmp:document-insert($uri, $new-xml, xdmp:document-get-permissions($uri), xdmp:document-get-collections($uri), xdmp:document-get-quality($uri))
Note: document-properties are preserved at document-insert. See also: http://docs.marklogic.com/xdmp:document-insert
Additionally, there is not much performance difference between these methods. The biggest difference in that respect is that xdmp:node-replace() requires a node from the original document, meaning it has to be retrieved from the database first. If the replacement does not depend on the original document, then xdmp:document-insert() would be fastest.
HTH!

+1 to #grtjn. Note that why using xdmp:node-replace is no more efficient then xdmp:document-insert is that all document updates update the entire document. It is a common understandable misconception that xdmp:node-replace would operate similar to say an RDBMS field update -- only 'touching' the affected field. In the RDBMS case that is often a mistaken misconception as well.
Similar to not needing to read the old document body, if you know what the permissions, collections, and quality should be you can supply those (or defaults) rather than querying them with xdmp:document-get-permissions() etc. It may not make a measurable difference, but as with xdmp:node-replace() if you don't need to query a value its simpler not to -- and removes unneeded dependencies and error opportunities (such as what if the document doesn't exist?)

Related

Is it always safe to use eventId as the Firestore document id?

This article here recommends using the eventId as the document id to prevent multiple creations of a document due to background process retries. Is it guaranteed that there will never be a collision?
Mentioned article is showing how to avoid duplicate item created by retires of unsuccessful function. In shortcut its saying that if you use add method (reference) and function is retried (but failed after Firestore write) you may have a problem with 2 documents identical created in Firestore with different IDs created automatically.
As solution to this author is proposing to create documentID with eventID and write to it using set (refrence).
This approach gives you 100% that retries of the same function invocation will not create duplicate items.
Backing to the question... I think you are afraid that 2 different invocation will want will have the same event_id and the document can be overwritten. This I think is possible, but in my opinion it's not in scope of this article as it's answers different question and creating as simple use case as possible to help understand the approch.
Lets imagine we have to different functions invoked by the same event writing different content to the same collection. The result will be unpredictable, I think. However in such situation you can use the same mechanism, little bit upgraded ex. like this <function_name>_<event_id>. Using the example from the article it will be small change like:
...
return db.collection('contents').doc('<function_name>_'+eventId).set(content).then
...
So in my understanding if you afraid of collision you should add additional elements to created document references, like in the example above.
From my point of view, an ability to use an event_id as a firestore document id depends on a your context and requirements.
For example - from the "business" point of view - is the message/event really a unique business related thing (thus you really would like to avoid duplication of messages)? Or are there some other business entity which is to be unique, but there can be more than one messages (with different event_id) about that business entity?
On top of that, from the best of my knowledge, it may be a good practice to generate/create the firestore document ids randomly (as a hash, of a guid, etc.). In that case, the search/retrieval from the firestore should work "faster". So, I don't know if the event_id is "random" enough in your context. Maybe it is Ok, may be not...
In my personal experience I try to generate a document id as a hex digest of a hash from a string (may be composed string), which supposed to be unique in the business context. For example, the event/message - is a google.storage.object.finalize event. In that case, I would use some metadata about the underlined object/file. Depends on the business context and requirements, or can be (or not be) a bucket name, object name, size, md5 or crc32c etc. or a combination of those elements... The chosen elements are concatenated into a string, then a hash is calculated, and a hex digest of that hash becomes a document id in the firestore collection.

Using Firestore document's auto-generated ID versus using a custom ID

I'm currently deciding on my Firestore data structure.
I'll need a products collection, and the products items will live inside of it as documents.
Here are my product's fields:
uniqueKey: string
description: array of strings
images: array of objects
price: number
QUESTION
Should I use Firestore auto-generated ID's to be the ID of my documents, or is it better to use my uniqueKey (which I'll query for in many occasions) as the document ID? Is there a best option between the 2?
I imagine that if I use my uniqueKey, it will make my life easier when retrieving a single document, but I'll have to query for more than 1 product on many occasions too.
Using my uniqueKey as ID:
db.collection("products").doc("myUniqueKey").get();
Using my Firestore auto-generated ID:
db.collection("products").where("uniqueKey", "==", "myUniqueKey").get();
Is this enough of a reason to go with my uniqueKey instead of the auto-generated one? Is there a rule of thumb here? What's the best practice in this case?
In terms of making queries from a client, using only the information you've given in the question, I don't see that there's much practical difference between a document get using its known ID, or a query on a field that is also unique. Either way, an index is used on the server side, and it costs exactly 1 document read. The document get() might be marginally faster, but it's not worthwhile to optimize like this (in my opinion).
When making decision about data modeling like this, it's more important to think about things like system behavior under load and security rules.
If you're reading and writing a lot of documents whose IDs have a sequential property, you could run into hotspotting on those writes. So, if you want to use your own ID, and you expect to be reading and writing them in that sequence under heavy load, you could have a problem. If you don't anticipate this to be the situation, then it likely doesn't matter too much whose ID you use.
If you are going to use security rules to limit access to documents, and you use the contents of other documents to help with that, you'll need to be able to uniquely identify those documents in your rule. You can't perform a query against a collection in rules, so you might need meaningful IDs that will give direct access when used by rules. If your own IDs can be used easily this way in security rules, that might be more convenient overall. If you're force to used Firestore's generated IDs, it might become inconvenient, difficult, or expensive to try to maintain a relationship between your IDs and Firestore's IDs.
In any event, the decision you're making is not just about which ID is "better" in a general sense, but which ID is better for your specific, anticipated situation, under load, with security in mind.

Creating Empty Documents in Firestore

I understand that empty documents within collections are removed automatically by the system in Firestore. However, I have a situation now where the name of the document serves a purpose. I have a collection named usernames, and within this, many documents with the ID being the username. For instance, usernames/bob_marley is what I might see in the database. The problem here is that, since the documents do not have any fields in them, they get removed automatically thereby defeating the purpose of the set-up. How should I be structuring my database in these cases?
Thank you
The easiest thing to do is simply not allow the document to ever become empty. Keep one property in it with (for example) "exists = true" and make sure it never gets removed. Use a security rule to prevent this, if you're concerned about users accidentally doing this to themselves.
Another thing to do is re-evaluate what exactly you're trying to do with an empty document in the system, and if it's worthwhile to think about how to structure your data in a way that best meets the queries you want to perform.

DocumentDb and how to create folder?

New to documentdb and I am trying to determine the best way to store documents. We are uploading documents every 15 minutes and I need to keep them as easily separated by upload as possible. At first glance, I thought I could have a database and a collection for each upload. Then, I discovered you can only have 3 collections per database. This leaves me with either adding a naming convention or trying to use folders and paths. According to the same source (http://azure.microsoft.com/en-us/documentation/articles/documentdb-limits/), we are limited to 100 paths per collection. This leaves folders. I have been looking, but I haven't found anything concrete on creating folders within a collection. The object API doesn't have an obvious add/create method.
Is this possible? If so, are we limited to how many (assuming I stay within the allowed collection/database size)?
You could define a sequential naming convention and create a range index on the collection indexing policy. In this way, if you need to retrieve a range of documents, you can do it in this way, which will leverage the indexing capabilities of docdb efficiently.
As a recommendation, you can examine the charge response header on the requests you fire off during your tests. This allows you to gauge how efficient your setup is (how stringent it is against the Db, which will translate into your cost structure for the service)
Sorry about the comment. What we ended up doing was just dumping everything into one collection. The azure documentdb query language (i.e. sql like) seems robust enough to handle detailed queries. Though I am not sure what the efficiency will be like once we have a ton of documents in there.

Using Lucene.Net as a primary lookup for lists before heading to the database, is this a good idea?

First, I do not want to use Lucene as a database, per se, but rather as the primary look-up for displaying lists to the user. This would be a canned search to Lucene where we would pull, say, all user information to be displayed in a grid list. We are building an ASP.Net web application, first of all. Is it a good idea to pull, from Lucene initially, a list of items (that can be paged) to display to the user in some sort of grid format? The only time we would call the database is when a user selects a specific record to view or update.
My concern is stale data coming from Lucene. I have been looking for information about add and updates to an index, but it is unclear to me if my scenario is better suited for a database rather than Lucene. My other developers and I have been going back and forth about this, but unfortuneatley, we don't know enough about how Lucene handles writes and reads.
I'm not sure if it's a good or bad fit for your use case. Hopefully I can give you some insight on how Lucene stores its data, and you can make a decision from that.
Lucene is extremely quick if you want to search for an item in its index. The time it takes to index its items isn't so quick. It's by no means slow if you look at everything its doing, but it adds complexity to know what you need to do about it.
Lucene is essentially a document store. So each item in Lucene is a Document, which can hold a certain amount of fields. Those fields are essentially key value pairs, though right now, Lucene only supports types of string and byte[] as values, and strings only as keys. Each field can be index and/or analyzed (or neither). Indexing simply means you can search against that field's data, generally only via exact matches and wildcards. Analyzing gives you better searching capabilities, since it will take the string and tokenize it. Depending on the analyzer it will tokenize it differently. The most common is whitespace and stopwords; essentially marking each word as a term unless its something like (a, an, the, as, etc...).
The real killer when used for many pieces, you can't update a document in an index. When you pull out a document to update it and change the field, the call to UpdateDocument() actually marks the old document as deleted and inserts a new document.
Notice I said it marks it as deleted. That introduces another thing related to Lucene indexes: Optimization of the index. When you write to an index, every so often a segment of the index is written to disk. (It's temporarily stored in RAM for fast indexing) When you run a search on an index, lucene needs to open all those different segments to find the terms to search against (it has to order them in a way too). This means if you have many segments, searching can be slow. A call to Optimize() will not only merge the segments together, it will also remove any documents marked for deletion, thus lowering your index size, as well.
However, optimizing your index requires around 1.5x more space while the optimization is being done, sometimes more. Fortunately, Lucene.net is transactional during an optimization, which means not only will your index not be corrupt if an optimization fails, any existing IndexReader you have open will still be able to search and read from the index when you're optimizing it.
In short, if it were me, if you were expecting only get one result from a search each time, I may not recommend lucene. Lucene especially shines when you're searching through many documents for many documents. It's an inverted index and it's good at that. For a single lookup, you may be better off with a database. Unfortunately, the only way you'll really find out is to benchmark it. Fortunately, at least Lucene.Net is very easy to setup for something like that.
Also, if you do use Lucene.Net, consider our 2.9.4g branch. You may not be able to use it, since it is technically not release code, but it is a bit faster than normal lucene, as we've added generics and removed a bit of the costly boxing done in previous versions.
Lucene is not a good fit for the scenario you're describing. You're looking at caching data.
Why not use the Asp.net cache? If you need a more robust caching solution, there's memcached and a whole host of other ones ... even NoSql stores like mongo, redis, etc.
Obviously, you'll need to manually remove items from the cache on updates to stop serving stale data.
I think this is a viable solution, and I say this because there is a major open source content management system that is using a technique very similar to what you've described. It's called Umbraco, and it's version 5 is going to be using a customized version of Lucene.NET for a sort of cache.
you can look at the project and source here: http://umbraco.codeplex.com/SourceControl/changeset/view/5a7c9af9bbf9

Resources