In XQuery Marklogic how to sort dynamically?
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/$sortelement
return $doc
PS: Sorting will change based on user input, like data, name in place of salary.
If Salary is the name of the element, then you could more generically select any element in the XPath with * and then apply a predicate filter to test whether the local-name() matches the variable for the selected element value $sortelement:
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/*[local-name() eq $sortelement]
return $doc
This manner of sorting all items in the collection may work with smaller number of documents, but if you are working with hundreds of thousands or millions of documents, you may find that pulling back all docs is either slow or blows out the Expanded Tree Cache.
A more efficient solution would be to create range indexes on the elements that you intend to sort on, and could then perform a search with options specified to order the results by cts:index-order with an appropriate reference to the indexed item, such as cts:element-reference(), cts:json-property-reference(), cts:field-reference().
For example:
let $sortelement := 'Salary'
return
cts:search(doc(),
cts:collection-query("employee"),
cts:index-order(cts:element-reference(xs:QName($sortelement)))
)
Not recommended because the chances of introducing security issues, runtime crashes and just 'bad results' is much higher and more difficult to control --
BUT available as a last resort.
ALL XQuery can be dynamically created as a string then evaluated using xdmp:eval
Much better to follow the guidance of Mads, and use the search apis instead of xquery FLOWR expressions -- note that these APIs actually 'compile down' to a data structure. This is what the 'cts constructors' do : https://docs.marklogic.com/cts/constructors
I find it helps to think of cts searches as a structured search described by data -- which the cts:xxx are simply helper functions to create the data structure.
(they dont actually do any searching, they build up a data structure that is used to do the searching)
If you look at the source to the search:xxx apis you can see how this is done.
Related
I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not
So, I faced an interview recently with a well known company on Marklogic. He has asked me a question which I couldn't answer. There is an XML example data as below shown.
He asked me how can you get only employee id whose zipcode is 12345 and state is california using search? like cts:search
The thing which came into my mind is write XPath like below but since he asked me using search I couldn't answer
let $x :=//employee/officeAddress[zipCode="38023"]/../employeeId/string()
return $x
xml dataset:
<employees>
<employee>
<employeeId>30004</employeeId>
<firstName>crazy</firstName>
<lastName>carol</lastName>
<designation>Director</designation>
<homeAddress>
<address>900 clean ln</address>
<street>quarky st</street>
<city>San Jose</city>
<state>California</state>
<zipCode>22222</zipCode>
</homeAddress>
<officeAddress>
<address>000 washington ave</address>
<street>bonaza st</street>
<city>San Francisco</city>
<state>California</state>
<zipCode>12345</zipCode>
</officeAddress>
<employee>
</employees>
Using XPath is a natural initial thought for many familiar with XML technologies and starting with MarkLogic. It was what I first started to do when I was just starting out.
Some XPath expressions can be optimized by the database and perform fast and efficiently, but there are also others that cannot and may not perform well.
Using cts:search and the built-in query constructs allows for optimized expressions that will leverage indexes, and allows you to further tune by analyzing xdmp:plan, xdmp:query-meters, and xdmp:query-trace.
An equivalent cts:search expression for the XPath, specifying the path to /employees/employee in the first $path parameter and combining cts:element-value-query with cts:and-query in the second $query parameter would be:
cts:search(/employees/employee,
cts:and-query((
cts:element-value-query(xs:QName("zipCode"), "12345"),
cts:element-value-query(xs:QName("state"), "California") )))/employeeId
You could also use a more generic $path to search against all documents and use an xdmp:element-query() to surround the cts:element-value-query criteria to restrict the search to descendants of the employee element and then XPath into the resulting document(s):
cts:search(doc(),
cts:element-query(xs:QName("employee"),
cts:and-query((
cts:element-value-query(xs:QName("zipCode"), "12345"),
cts:element-value-query(xs:QName("state"), "California") ))
)
)/employees/employee/employeeId
xpath I would have tried (not tested):
/employees/employee[officeAddress/zipCode = '38023' and officeAddress/state = 'California']/employeeId/string()
Note that you can use xdmp:plan on xpath too; it's interesting to see how it works vs cts:search.
In general you're better off putting as much into cts:search as possible vs xpath (and I like xpath!).
The question is a little ambiguous. Are there many employees in one document? Or many employees documents? Both?
Also, don't forget to add the appropriate position indexes, or you won't get much unfiltered help. Look at the plan before and after adding the indexes.
See also https://help.marklogic.com/Knowledgebase/Article/View/queries-constrained-to-elements
I am trying to create a function that will accept name of tag and a datetime value and drop a extent within a specific table which has that tag and then ingest a new record into that table with the same tag and the input datetime value -- sort of 'update' simulation. I am not bothered about performance, it's just going to hold metadata -- maybe 20-30 rows at max.
So this is how the create table looks:-
.create table MyTable(sometext:string,somevalue:datetime)
And shown below is my function creation step, which is failing:-
.create-or-alter function MyFunction(arg_sometext:string,arg_somedate:datetime)
{
.drop extents <| .show table MyTable extents where tags has arg_sometext;
.ingest inline into table MyTable with (tags="[arg_sometext]") <| arg_somedate
}
So you can see I am trying to do something simple -- I am suspecting that Kusto won't allow commands in a function. Is there any workaround for achieving this?
Generally:
Kusto mandates that control commands start with a dot (.), and that this must be the first character in the text of the command. As queries, functions, etc. don't start with a dot, this precludes them from invoking control commands.
This is an intentional limitation that prevents a wide range of code injection attacks. By imposing this rule, Kusto makes it easy to guarantee that any query that does not begin with a dot will only have read access to the data and metadata, and never be able to alter them.
Specifically: with regards to your specific scenario:
I'm assuming it's triggered automatically (even if you did have the option to create a function), which suggests you should be able to achieve your goal using Kusto's API / Client libraries and a simple script/app.
An alternative, and perhaps even better approach, would be to re-consider if you actually need to delete or update specific records, or you can use summarize arg_max() in order to query for only the latest "versions" of the records (you could also create a function which encapsulates that logic and overrides the table, by naming the function with the table's name).
I have following declaration for collection
TYPE T_TABLE1 IS TABLE OF TABLE_1%ROWTYPE INDEX BY BINARY_INTEGER;
tbl1_u T_TABLE1;
tbl1_i T_TABLE1;
This table will keep growing and at the end, will be used in FORALL loop to do insert or update on TABLE_1.
Now there might be cases, where I want to delete a certain element. So i am planning to create a procedure, which will take the KEY (unique) and matched the element if that key is found
PSEDUO CODE
FOR i in tbl1_u.FIST..tbl1_u.LAST
LOOP
if tbl1_u(i).key = key then
tbl1.delete(i);
end if;
END LOOP;
My question is,
Once i delete the particular element, would be collection adjust automatically i.e., the index i would be replaced by next element or would that particular index will remain null/invalid and could possibly give me exception if i use it in FORALL INSERT/UPDATE?
I don't think that i can pass TABLE_1%ROWTYPE object to a procedure, do i have to create a record type ?
Any other tip regarding managing collection for bull delete/update/insert would be appreciate. Remeber, I would be dealing with 2 tables, if i am inserting/updating in table_1 then it means i am deleting it from table_2 and vice-versa.
Given that TABLE_1.KEY is unique you might consider using that as the index to your associative arrays. That way you can delete from the collections using the KEY value, which according to the pseudocode is available when doing the deletions. This would also save you having to iterate through the table to find the KEY you want, as the KEY would be the index - so your "deletion" pseudo-code would become:
tbl1_u.delete(key);
To answer your questions:
Since you're using associative arrays, when an element is deleted there is no "empty" space in the collection. The indexes for the elements, however, don't actually change. Therefore you need to use the collection.PRIOR and collection.NEXT methods to loop through the collection. But again, if you use the KEY value as the index you may not need to loop through the collections at all.
You can pass a TABLE_1%ROWTYPE as a parameter to a PL/SQL procedure or function.
You might want to consider using a MERGE statement which could handle doing the inserts and updates in one step. This might allow you to maintain only a single collection. Might be worth looking in to.
Share and enjoy.
I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.