So, I faced an interview recently with a well known company on Marklogic. He has asked me a question which I couldn't answer. There is an XML example data as below shown.
He asked me how can you get only employee id whose zipcode is 12345 and state is california using search? like cts:search
The thing which came into my mind is write XPath like below but since he asked me using search I couldn't answer
let $x :=//employee/officeAddress[zipCode="38023"]/../employeeId/string()
return $x
xml dataset:
<employees>
<employee>
<employeeId>30004</employeeId>
<firstName>crazy</firstName>
<lastName>carol</lastName>
<designation>Director</designation>
<homeAddress>
<address>900 clean ln</address>
<street>quarky st</street>
<city>San Jose</city>
<state>California</state>
<zipCode>22222</zipCode>
</homeAddress>
<officeAddress>
<address>000 washington ave</address>
<street>bonaza st</street>
<city>San Francisco</city>
<state>California</state>
<zipCode>12345</zipCode>
</officeAddress>
<employee>
</employees>
Using XPath is a natural initial thought for many familiar with XML technologies and starting with MarkLogic. It was what I first started to do when I was just starting out.
Some XPath expressions can be optimized by the database and perform fast and efficiently, but there are also others that cannot and may not perform well.
Using cts:search and the built-in query constructs allows for optimized expressions that will leverage indexes, and allows you to further tune by analyzing xdmp:plan, xdmp:query-meters, and xdmp:query-trace.
An equivalent cts:search expression for the XPath, specifying the path to /employees/employee in the first $path parameter and combining cts:element-value-query with cts:and-query in the second $query parameter would be:
cts:search(/employees/employee,
cts:and-query((
cts:element-value-query(xs:QName("zipCode"), "12345"),
cts:element-value-query(xs:QName("state"), "California") )))/employeeId
You could also use a more generic $path to search against all documents and use an xdmp:element-query() to surround the cts:element-value-query criteria to restrict the search to descendants of the employee element and then XPath into the resulting document(s):
cts:search(doc(),
cts:element-query(xs:QName("employee"),
cts:and-query((
cts:element-value-query(xs:QName("zipCode"), "12345"),
cts:element-value-query(xs:QName("state"), "California") ))
)
)/employees/employee/employeeId
xpath I would have tried (not tested):
/employees/employee[officeAddress/zipCode = '38023' and officeAddress/state = 'California']/employeeId/string()
Note that you can use xdmp:plan on xpath too; it's interesting to see how it works vs cts:search.
In general you're better off putting as much into cts:search as possible vs xpath (and I like xpath!).
The question is a little ambiguous. Are there many employees in one document? Or many employees documents? Both?
Also, don't forget to add the appropriate position indexes, or you won't get much unfiltered help. Look at the plan before and after adding the indexes.
See also https://help.marklogic.com/Knowledgebase/Article/View/queries-constrained-to-elements
Related
I want to perform dynamic combination (AND OR) search on the basis of the provided parameter by the user.
Search Combination Example:
( (title = "United States" or isbn = "2345371242192") and author ="Jhon" )
In the above query each parameter will look on their XPATH e.g. (item/tigroup/title, item/isbn), XPATH not provided by the user, i have to generate XPATH dynamically with search combination
How Combination query can be formed dynamically to pass it to the BaseX?
User can perform any kind of AND OR search, their can be multiple AND OR criteria
Any suggestions much appreciated
With xquery:eval, strings can be evaluated as XQuery expression (see the documentation for more examples):
declare variable $QUERY := 'text()';
db:open('db')//*[xquery:eval($QUERY, map { '': . })]
Please note that it’s very risky to evaluate arbitrary strings as XQuery code. If the string contains user input, malicious strings may be passed on that do unexpected things. In the example above, a malicious string could be a file operation (e.g., file:delete(.)), or a query that runs very long and blocks your system.
In XQuery Marklogic how to sort dynamically?
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/$sortelement
return $doc
PS: Sorting will change based on user input, like data, name in place of salary.
If Salary is the name of the element, then you could more generically select any element in the XPath with * and then apply a predicate filter to test whether the local-name() matches the variable for the selected element value $sortelement:
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/*[local-name() eq $sortelement]
return $doc
This manner of sorting all items in the collection may work with smaller number of documents, but if you are working with hundreds of thousands or millions of documents, you may find that pulling back all docs is either slow or blows out the Expanded Tree Cache.
A more efficient solution would be to create range indexes on the elements that you intend to sort on, and could then perform a search with options specified to order the results by cts:index-order with an appropriate reference to the indexed item, such as cts:element-reference(), cts:json-property-reference(), cts:field-reference().
For example:
let $sortelement := 'Salary'
return
cts:search(doc(),
cts:collection-query("employee"),
cts:index-order(cts:element-reference(xs:QName($sortelement)))
)
Not recommended because the chances of introducing security issues, runtime crashes and just 'bad results' is much higher and more difficult to control --
BUT available as a last resort.
ALL XQuery can be dynamically created as a string then evaluated using xdmp:eval
Much better to follow the guidance of Mads, and use the search apis instead of xquery FLOWR expressions -- note that these APIs actually 'compile down' to a data structure. This is what the 'cts constructors' do : https://docs.marklogic.com/cts/constructors
I find it helps to think of cts searches as a structured search described by data -- which the cts:xxx are simply helper functions to create the data structure.
(they dont actually do any searching, they build up a data structure that is used to do the searching)
If you look at the source to the search:xxx apis you can see how this is done.
I am using fn:distinct-values but I have faced case sensitive problems.
I need to remove the duplicate values in MarkLogic db.
Result :
Antony
antony
but I want to one result without any duplicate either:
Antony or antony.
If this is just a small set of values, you don't have to create a lexicon for this: distinct-values also takes a collation parameter:
distinct-values(("anthony","Anthony"),"http://marklogic.com/collation//S1")
It's all about collations.
I would suggest that you add a lexicon to whatever attribute or element or property you are referring to. When you set up the lexicon, you can then define the collation to take care of this. In the end, no 'distinct values' is needed because the lexicon will already have a distinct list.
You could use 'distinct values' if you were to normalize your content using uppercase or lowercase in a FLWOR statement in your code, but this is much more costly.
For your reference:
https://docs.marklogic.com/guide/search-dev/encodings_collations
https://docs.marklogic.com/guide/search-dev/lexicon
I have a Lucene index where one of the indexed fields contains a string that identifies the type of content.
For simplicity, say this field is called _type and will only ever contain typeone or typetwo.
I am using Lucene query parser syntax to query this index. Say my query is:
(+fieldone:term^3.0 +classname:term^2.0)
Is it possible to extend this to boost any results that have typeone in their _type field, whilst still returning typetwo records (albeit with a lower relevancy score)?
UPDATE
I've found a syntax which works but it uses the wildcard 'all documents' syntax which I suspect is not efficient. Advice appreciated.
(+fieldone:term^3.0 +classname:term^2.0) +(*:* _type:typeone^1.1)
Using just Lucene syntax you can simply keep the _type boost as SHOULD in the following way:
+fieldone:term^3.0 +classname:term^2.0 (_type:typeone)^2
you don't need wildcards.
Another solution would be using eDismax query parser then you can use the bq or bf parameter in order to boost a particular value for a field. You can use one of the following solutions:
Solution 1: you can boost your term in the following way:
defType=edismax&bq=_type:"typeone"^3
or
Solution 2: you can use a query function in the following way:
defType=edismax&bf=if(termfreq(_type,"typeone"),3,if(termfreq(_type,"typetwo"),2,1))
where the results having _type=typeone are boosted by 3, the ones having typetwo are boosted by 2, otherwise it will be 1. You can modify that query according your needs.
I am running the following query on Google BigQuery web interface, for data provided by Google Analytics:
SELECT *
FROM [dataset.table]
WHERE
hits.page.pagePath CONTAINS "my-fun-path"
I would like to save the results into a new table, however I am obtaining the following error message when using Flatten Results = False:
Error: Cannot query the cross product of repeated fields
customDimensions.value and hits.page.pagePath.
This answer implies that this should be possible: Is there a way to select nested records into a table?
Is there a workaround for the issue found?
Depending on what kind of filtering is acceptable to you, you may be able to work around this by switching to OMIT IF from WHERE. It will give different results, but, again, perhaps such different results are acceptable.
The following will remove entire hit record if (some) page inside of it meets criteria. Note two things here:
it uses OMIT hits IF, instead of more commonly used OMIT RECORD IF).
The condition is inverted, because OMIT IF is opposite of WHERE
The query is:
SELECT *
FROM [dataset.table]
OMIT hits IF EVERY(NOT hits.page.pagePath CONTAINS "my-fun-path")
Update: see the related thread, I am afraid this is no longer possible.
It would be possible to use NEST function and grouping by a field, but that's a long shot.
Using flatten call on the query:
SELECT *
FROM flatten([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],customDimensions)
WHERE
hits.page.pagePath CONTAINS "m"
Thus in the web ui:
setting a destination table
allowing large results
and NO flatten results
does the job correctly and the produced table matches the original schema.
I know - it is old ask.
But now it can be achieved by just using standard SQL dialect instead of Legacy
#standardSQL
SELECT t.*
FROM `dataset.table` t, UNNEST(hits.page) as page
WHERE
page.pagePath CONTAINS "my-fun-path"