MarkLogic - Search via Max/Min Filter - xquery

i want to search documents in MarkLogic.
My documents look like:
<product xmlns="myns/products">
<id>3114</id>
<materialNo xml:lang="en">1.1160</materialNo>
<steelName xml:lang="en">SWRCH24K</steelName>
<name xml:lang="en">wire, wire rod for cold heading</name>
<chemicalProperties>
<chemicalProperty>
<element>c</element>
<min>0.1900</min>
<max>0.2500</max>
</chemicalProperty>
<chemicalProperty>
<element>si</element>
<min>0.1000</min>
<max>0.3500</max>
</chemicalProperty>
<chemicalProperty>
<element>mn</element>
<min>1.3500</min>
<max>1.6500</max>
</chemicalProperty>
<chemicalProperty>
<element>p</element>
<max>0.0300</max>
</chemicalProperty>
</chemicalProperties>
</product>
So i want to search via max/min values of the chemical properties. To do so i use this xquery search (simple example):
cts:search(/, cts:and-query(
(cts:collection-query("test"),
cts:element-value-query(
fn:QName("myns/products", "name"),
"wire, wire rod for cold heading"),
cts:element-query(
fn:QName("myns/products", "chemicalProperty"),
cts:and-query(
(cts:element-value-query(
fn:QName("myns/products", "element"), "c"),
cts:or-query(
(cts:element-range-query(
fn:QName("myns/products", "max"), "<=", 0.2),
cts:and-not-query(
cts:element-range-query(
fn:QName("myns/products", "min"), "<=", 0.2),
cts:element-value-query(
fn:QName("myns/products", "max"), "*")))),
cts:or-query(
(cts:element-range-query(
fn:QName("myns/products", "min"), ">=", 0.1),
cts:and-not-query(
cts:element-range-query(
fn:QName("myns/products", "max"), ">=", 0.1),
cts:element-value-query(
fn:QName("myns/products", "min"), "*"))))))))))
The problem is that the query above will return the sample document.
The sub-queries (and-not) are there to check if max/min exists. In some cases there could be only min or only max value.
But this document is out of bounds?!
My database does have element range indexes on min and max. All other settings are defaults.
What is the problem? Any suggestions.
UPDATE
Ok, thanks for the suggestions but no. Enabling value position do not solve the problem. However a workaround is to remove the "and-not-query" and replace it with an "and-query" and to add new attributes to the documents:
<chemicalProperty hasMin="0" hasMax="1">...
indexing and querying those attributes works and returns the correct results.

It's possible that because of your index settings, cts:element-query returns true if the min and max queries match in any <chemicalProperty> in the same document, not constrained to a single <chemicalProperty>. I would only expect to see this in an unfiltered search, however, and I don't see that option in your call to cts:search.
First try enabling element value positions, which should allow the database to exclude matches from different elements using indexes.
An alternative solution is to use cts:near-query to constrain the values in the element query by position.

The problem seems to be that you are trying to use wildcards in the cts:element-value-query calls, but not declaring them wildcarded. Since nothing will match the literal "*", the cts:and-not-query does the opposite of what you intend.
You want something like this:
cts:element-value-query(
fn:QName("myns/products", "max"), "*", "wildcarded")
cts:element-value-query
Alternatively, you could enable one of the wildcard indexes, and ML will detect wildcard queries automatically.
If neither "wildcarded" nor "unwildcarded" is present, the database configuration and $text determine wildcarding. If the database has any wildcard indexes enabled ("three character searches", "two character searches", "one character searches", or "trailing wildcard searches") and if $text contains either of the wildcard characters '?' or '*', it specifies "wildcarded". Otherwise it specifies "unwildcarded".

Related

Neo4j - Project graph from Neo4j with GraphDataScience

So I have a graph of nodes -- "Papers" and relationships -- "Citations".
Nodes have properties: "x", a list with 0/1 entries corresponding to whether a word is present in the paper or not, and "y" an integer label (one of the classes from 0-6).
I want to project the graph from Neo4j using GraphDataScience.
I've been using this documentation and I indeed managed to project the nodes and vertices of the graph:
Code
from graphdatascience import GraphDataScience
AURA_CONNECTION_URI = "neo4j+s://xxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = "my_code:)"
# Client instantiation
gds = GraphDataScience(
AURA_CONNECTION_URI,
auth=(AURA_USERNAME, AURA_PASSWORD),
aura_ds=True
)
#Shorthand projection --works
shorthand_graph, result = gds.graph.project(
"short-example-graph",
["Paper"],
["Citation"]
)
When I do print(result) it shows
nodeProjection {'Paper': {'label': 'Paper', 'properties': {}}}
relationshipProjection {'Citation': {'orientation': 'NATURAL', 'aggre...
graphName short-example-graph
nodeCount 2708
relationshipCount 10556
projectMillis 34
Name: 0, dtype: object
However, no properties of the nodes are projected. I then use the extended syntax as described in the documentation:
# Project a graph using the extended syntax
extended_form_graph, result = gds.graph.project(
"Long-form-example-graph",
{'Paper': {properties: "x"}},
"Citation"
)
print(result)
#Errors
I get the error:
NameError: name 'properties' is not defined
I tried various variations of this, with or without " ", but none have worked so far (also documentation is very confusing because one of the docs always uses " " and in another place I did not see " ").
Also, note that all my properties are integers in the Neo4j db (in AuraDS), as I used to have the error that String properties are not supported.
Some clarification on the correct way of projecting node features (aka properties) would be very useful.
thank you,
Dina
The keys in the Python dictionaries that you use with the GraphDataScience library should be enclosed in quotation marks. This is different from Cypher syntax, where map keys are not enclosed with quotation marks.
This should work for you.
extended_form_graph, result = gds.graph.project(
"Long-form-example-graph",
{'Paper': {"properties": "x"}},
"Citation"
)
Best wishes,
Nathan

Use dictionary keys and associated values as wildcards in snakemake

I have a great number of analyses that need to be done in one go, and thus I thought that I can make a dictionary and parse the keys and values as wildcards (every snakemake run needs two wildcards to be used).
My dict will look like this:
myDict= {
"Apple": ["fruity","red","green"]
"Banana": ["fruity,"yellow"]
}
Here the first key in the dictionary will be wildcard1, here {Apple}, with the first value as wildcard2, here {fruity}, and run snakemake with these two until the final rule is has been run.
Then the same key will again be used ({Apple} as wildcard1) with the second associated value, here {red}, as wildcard2, and run snakemake until the last rule has been run.
Then after the final value belonging to {Apple} has been used as wildcard2, switch over to {Banana} as wildcard1 with its first value, {fruity} as wildcard2.
This will go on until all keys and their associated values have been used as wildcards and snakemake will stop. (That is keys as wildcard1, and their values as wildcard2).
My question is if this is possible, and if so, how can I achieve that?
I bet there is a way to do it with a single expand, but you can use a more verbose list comprehension. I'll take the files to be {wc1}_{wc2}.out for wildcards 1 and 2. Then you have
myDict= {
"Apple": ["fruity","red","green"],
"Banana": ["fruity","yellow"]
}
inputs = [expand('{wc1}_{wc2}.out',
wc1=key, wc2=value)
for key, value in myDict.items()]
# inputs = [['Apple_fruity.out', 'Apple_red.out', 'Apple_green.out'], ['Banana_fruity.out', 'Banana_yellow.out']]
rule all:
input: inputs
Edited to address comment:
To make two lists, keys and values, you can use
keys = []
values = []
for key, value in myDict.items():
for v in value:
keys.append(key)
values.append(v)
print(keys) # ['Apple', 'Apple', 'Apple', 'Banana', 'Banana']
print(values) # ['fruity', 'red', 'green', 'fruity', 'yellow']

How to add a value to the existing element value and return it as a new value

This is the xml file.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<AtcoCode> System-Start-Date= 2018-05-16T12:35:48.6929328-04:00, " ", System-End-Date = 9999-12-31, " ", 150042010003</AtcoCode>
<NaptanCode>esxatgjd</NaptanCode>
<PlateCode>
</PlateCode>
<CleardownCode>
</CleardownCode>
<CommonName>Upper Park</CommonName>
<CommonNameLang>
</CommonNameLang>
<ShortCommonName>
</ShortCommonName>
<ShortCommonNameLang>
</ShortCommonNameLang>
<Landmark>Upper Park</Landmark>
<LandmarkLang>
</LandmarkLang>
<Street>High Road</Street>
<StreetLang>
</StreetLang>
<Crossing>
</Crossing>
<CrossingLang>
</CrossingLang>
<Indicator>adj</Indicator>
<IndicatorLang>
</IndicatorLang>
<Bearing>NE</Bearing>
<NptgLocalityCode>E0046286</NptgLocalityCode>
<LocalityName>Loughton</LocalityName>
<ParentLocalityName>
</ParentLocalityName>
<GrandParentLocalityName>
</GrandParentLocalityName>
<Town>Loughton</Town>
<TownLang>
</TownLang>
<Suburb>
</Suburb>
<SuburbLang>
</SuburbLang>
<LocalityCentre>1</LocalityCentre>
<GridType>U</GridType>
<Easting>541906</Easting>
<Northing>195737</Northing>
<Co-ordinates>51.64255,0.04944</Co-ordinates>
<StopType>BCT</StopType>
<BusStopType>MKD</BusStopType>
<TimingStatus>OTH</TimingStatus>
<DefaultWaitTime>
</DefaultWaitTime>
<Notes>
</Notes>
<NotesLang>
</NotesLang>
<AdministrativeAreaCode>080</AdministrativeAreaCode>
<CreationDateTime>2006-11-06T00:00:00</CreationDateTime>
<ModificationDateTime>2010-01-16T07:58:02</ModificationDateTime>
<RevisionNumber>5</RevisionNumber>
<Modification>rev</Modification>
<Status>act</Status>
</root>
How to achieve this?
Question: Create the path range index for the status element and fetch all the documents that has status del
after fetching all the documents, you need to create the new element called currentreservationnumber under RevisionNumber element.
The value of the currentrevisionnumber will be +1 to the RevisionNumber.
I think the warning about sequential numbers is related to system-wide unique numbers/ids (like Oracle sequence), so not a worry in this case?
If you only ever have one RevisionNumber, and you can find it without a path index, you can maybe get by with element-value query on the RevisionNumber since it's already indexed.
Given that you get the document somehow, it could be as simple as:
let $doc := fn:doc ('/foo.xml')
let $rev-node := $doc/root/RevisionNumber
return xdmp:node-insert-after ($rev-node, <currentreservationnumber>{$rev-node + 1}</currentreservationnumber>)
though remember to consider locking if you are doing a big query/update. And you might need to switch to node-replace if there is already a currentreservationnumber.

Passing list of search string in contains in FilterExpression

Is there any way to pass a list of search strings in the contains() method of FilterExpression in DynamoDb?
Something like below:
search_str = ['value-1', 'value-2', 'value-3']
result = kb_table.scan(
FilterExpression="contains (title, :titleVal)",
ExpressionAttributeValues={ ":titleVal": search_str }
)
For now I can only think of looping through the list and scanning the table multiple times (as in below code), but I think it will be resource heavy.
for item in search_str:
result += kb_table.scan(
FilterExpression="contains (title, :titleVal)",
ExpressionAttributeValues={ ":titleVal": item }
)
Any suggestions.
For the above scenario, the CONTAINS should be used with OR condition. When you give array as input for CONTAINS, DynamoDB will check for the SET attribute ("SS", "NS", or "BS"). It doesn't looks for the sub-sequence on the string attribute.
If the target attribute of the comparison is of type String, then the
operator checks for a substring match. If the target attribute of the
comparison is of type Binary, then the operator looks for a
subsequence of the target that matches the input. If the target
attribute of the comparison is a set ("SS", "NS", or "BS"), then the
operator evaluates to true if it finds an exact match with any member
of the set.
Example:-
movies1 = "MyMovie"
movies2 = "Big New"
fe1 = Attr('title').contains(movies1)
fe2 = Attr('title').contains(movies2)
response = table.scan(
FilterExpression=fe1 or fe2
)
a little bit late but to allow people to find a solution i give here my method.
lets assume that in your DB you have a props called 'EMAIL you want to filter your scan on this EMAIL with a list of value. you can proceed as following.
list_of_elem=['mail1#mail.com','mail2#mail.com','mail3#mail.com']
#set an empty string to create your query
stringquery=""
# loop each element in your list
for index,value in enumerate(list_of_elem):
# add your query of contains with mail value
stringquery=stringquery+f"Attr('EMAIL').contains('{value }')"
# while your value is not the last element in list add the 'OR' operator
if index < len(list_of_elem)-1:
stringquery=stringquery+ ' | '
dynamodb = boto3.resource('dynamodb')
# Use eval of your query string to parse the string as filter expression
tableUser = dynamodb.Table('mytable')
tableUser.scan(
FilterExpression=eval(stringquery)
)

How to match space in MarkLogic using CTS functions?

I need to search those elements who have space " " in their attributes.
For example:
<unit href="http:xxxx/unit/2 ">
Suppose above code have space in the last for href attribute.
I have done this using FLOWER query. But I need this to be done using CTS functions. Please suggest.
For FLOWER query I have tried this:
let $x := (
for $d in doc()
order by $d//id
return
for $attribute in data($d//#href)
return
if (fn:contains($attribute," ")) then
<td>{(concat( "id = " , $d//id) ,", data =", $attribute)}</td>
else ()
)
return <tr>{$x}</tr>
This is working fine.
For CTS I have tried
let $query :=
cts:element-attribute-value-query(xs:QName("methodology"),
xs:QName("href"),
xs:string(" "),
"wildcarded")
let $search := cts:search(doc(), $query)
return fn:count($search)
Your query is looking for " " to be the entirety of the value of the attribute. If you want to look for attributes that contain a space, then you need to use wildcards. However, since there is no indexing of whitespace except for exact value queries (which are by definition not wildcarded), you are not going to get a lot of index support for that query, so you'll need to run this as a filtered search (which you have in your code above) with a lot of false positives.
You may be better off creating a string range index on the attribute and doing value-match on that.

Resources