I have one million XML documents like this in my MarkLogic staging database.
<Details>
<Name>AA</Name>
<EmpId>123</EmpId>
<Account>
<AccountNo>111</AccountNo>
<IFSC>ABC</IFSC>
</Account>
<Account>
<AccountNo>222</AccountNo>
<IFSC>DEF</IFSC>
</Account>
</Details>
In this XML, an employee has multiple account numbers. From this, I want to identify any of the employees that have the same account number. Finding out unique account number from all 1M documents, and then check if the account number is matching to multiple employee ids.
How do I achieve this?
One way in which you can list all of the AccountNo values that appear in more than one employee document would be to use the cts:value-co-occurences() method with references to an element range index of AccountNo and cts:uri-reference() (which is available when the URI lexicon is enabled). Return the results as a map, with the AccountNo as the key and the document URI(s) as the value. Then filter the items in the map and report which AccountNo is associated with more than one document URI.
let $accountNumber-to-URI :=
cts:value-co-occurrences(
cts:element-reference(xs:QName("AccountNo")),
cts:uri-reference(),
"map")
for $accountNumber in map:keys($accountNumber-to-URI)
where tail(map:get($accountNumber-to-URI, $accountNumber))
return $accountNumber
Note that in order to be able to do this, you would need to have a range index on the AccountNo element.
Related
I have an XML like below in my database:
<PersonalData>
<Person>
<Name></Name>
<Age></Age>
<AccountNo>
<Number>123<Number>
<SwiftCode>1235<SwiftCode>
</AccountNo>
<AccountNo>
<Number>15523<Number>
<SwiftCode>188235<SwiftCode>
</AccountNo>
</Person>
</PersonalData>
in this XML, I have multiple AccountNo nodes and I have around 1M similar records in my database. I want to identify the count of AccountNo nodes in my entire database.
One way in which you can report the count of AccountNo elements would be to use an XPath and count:
count(//AccountNo)
You can also use cts:search and specify the AccountNo in the $expression XPath, and then count() the results:
count(cts:search(//AccountNo, cts:true-query()))
Another way to get a count of all the distinct AccountNo elements would be to run a CoRB job to select the docs that have those elements, and then in the process module return a line for every element in the doc and write the results to a text file. Below is an example OPTIONS-FILE that could be used to achieve that:
URIS-MODULE=INLINE-XQUERY|let $uris := cts:uris('',(),cts:element-query(xs:QName("AccountNo"), cts:true-query())) return (count($uris), $uris)
PROCESS-MODULE=INLINE-XQUERY|declare variable $URI external; doc($URI)//AccountNo ! 1
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
EXPORT-FILE-NAME=AccountNoCounts.txt
DISK-QUEUE=true
Then you could get the line count from the result file, which would tell you have many elements there are: wc -l AccountNoCounts.txt
If you need to be able to get this count often, and need the response to be fast, you could create a TDE that projects rows for each of the AccountNo elements and then could and could select the count with SQL (e.g. SELECT count(1) FROM Person.AccountNo) or use the Optic API against that TDE and op.count().
Given two DynamoDB tables: Books and Words, how can I create an index that associates the two? Specifically, I'd like to query to get all Books that contain a certain Word, and query to get all Words that appear in a specific Book.
The objective is to avoid scanning an entire table for these queries.
Based on your question I can't tell if you only care about unique words or if you want every word including duplicates. I'll assume unique words.
This can be done with a single table and a Global Secondary Index.
Create a table called BookWords with a Hash key of bookId and a Sort key of word. If you Query this table with a bookId you will get all of the unique words in that book.
Create a Global Secondary Index with a Hash key of word and a Sort key of bookId. If you Query this index with a word you will get all of the bookIds of books that contain that word.
Depending of your use case, you will probably want to normalize the words. For example, is "Word" the same as "word"?
If you want all words, not just unique words, you can use a similar approach with a few small changes. Let me know
Folks,
Given we have to store the following shopping cart data:
userID1 ['itemID1','itemID2','itemID3']
userID2 ['itemID3','itemID2','itemID7']
userID3 ['itemID3','itemID2','itemID1']
We need to run the following queries:
Give me all items (which is a list) for a specific user (easy).
Give me all users which have itemID3 (precisely my question).
How would you model this in DynamoDB?
Option 1, only have the Hash key? ie
HashKey(users) cartItems
userID1 ['itemID1','itemID2','itemID3']
userID2 ['itemID3','itemID2','itemID7']
userID3 ['itemID3','itemID2','itemID1']
Option 2, Hash and Range keys?
HashKey(users) RangeKey(cartItems)
userID1 ['itemID1','itemID2','itemID3']
userID2 ['itemID3','itemID2','itemID7']
userID3 ['itemID3','itemID2','itemID1']
But it seems that range keys can only be strings, numbers, or binary...
Should this be solved by having 2 tables? How would you model them?
Thanks!
Rule 1: The range keys in DynamoDB table must be scalar, and that's why the type must be strings, numbers, boolean or binaries. You can't take a list, set, or a map type.
Rule 2: You cannot (currently) create a secondary index off of a nested attribute. From the Improving Data Access with Secondary Indexes in DynamoDB documentation. That means, you can not index the cartItems since it's not a top level JSON attribute. You may need another table for this.
So, the simple answer to your question is another question: how do you use your data?
If you query the users with input item (say itemID3 in your case) infrequently, perhaps a Scan operation with filter expression will work just fine. To model your data, you may use the user id as the HASH key and cartItems as the string set (SS type) attribute. For queries, you need to provide a filter expression for the Scan operation like this:
contains(cartItems, :expectedItem)
and, provide the value itemID3 for the placeholder :expectedItem in parameter valueMap.
If you run many queries like this frequently, perhaps you can create another table taking the item id as the HASH key, and set of users having that item as the string set attribute. In this case, the 2nd query in your question turns out to be the 1st query in the other table.
Be aware of that, you need to maintain the data at two tables for each CRUD action, which may be trivial with DynamoDB Streams.
I am new to idexes and DB optimization. I know there is simple index for one
CREATE index ON table(col)
possibly B-Tree will be created and search capabilities will be improved.
But what is happen for 2 columns index ? And why is the order of defnition important?
CREATE index ON table(col1, col2)
Yes, B-Tree index will be created in most of the database if you didn't specify other type of index. Composite index is useful when the combined selectivity of the composite columns happed on the queries.
The order of the columns on the composite index is important as searching by giving exact values for all the fields included in the index leads to minimal search time but search uses only the first field to retrieve all matched recaords if we provide the values partially with first field.
I found following example for your understanding:
In the phone book example with an composite index created on the columns (city, last_name, first_name), if we search by giving exact values for all the three fields, search time is minimal—but if we provide the values for city and first_name only, the search uses only the city field to retrieve all matched records. Then a sequential lookup checks the matching with first_name. So, to improve the performance, one must ensure that the index is created on the order of search columns.
I'm new to DynamoDB - I already have an application where the data gets inserted, but I'm getting stuck on extracting the data.
Requirement:
There must be a unique table per customer
Insert documents into the table (each doc has a unique ID and a timestamp)
Get X number of documents based on timestamp (ordered ascending)
Delete individual documents based on unique ID
So far I have created a table with composite key (S:id, N:timestamp). However when I come to query it, I realise that since my id is unique, because I can't do a wildcard search on ID I won't be able to extract a range of items...
So, how should I design my table to satisfy this scenario?
Edit: Here's what I'm thinking:
Primary index will be composite: (s:customer_id, n:timestamp) where customer ID will be the same within a table. This will enable me to extact data based on time range.
Secondary index will be hash (s: unique_doc_id) whereby I will be able to delete items using this index.
Does this sound like the correct solution? Thank you in advance.
You can satisfy the requirements like this:
Your primary key will be h:customer_id and r:unique_id. This makes sure all the elements in the table have different keys.
You will also have an attribute for timestamp and will have a Local Secondary Index on it.
You will use the LSI to do requirement 3 and batchWrite API call to do batch delete for requirement 4.
This solution doesn't require (1) - all the customers can stay in the same table (Heads up - There is a limit-before-contact-us of 256 tables per account)