Elasticsearch PHP longest prefix match - symfony

I am currently using the FOSElasticaBundle in Symfony2 and I am having a hard time trying to build a search to match the longest prefix.
I am aware of the 100 examples that are on the Internet to perform autocomplete-like searches using this. However, my problem is a little different.
In an autocomplete type of search the database holds the longest alphanumeric string (in length of characters) and the user just provides the shortest portion, let's say the user types "jho" and Elasticsearch can easily provide "Jhon, Jhonny, Jhonas".
My problem is backwards, I would like to provide the longest alphanumeric string and I want Elasticsearch to provide me the biggest match in the database.
For example: I could provide "123456789" and my database can have [12,123,14,156,16,7,1234,1,67,8,9,123456,0], in this case the longest prefix match in the database for the number that the user provided is "123456".
I am just starting with Elasticsearch so I don't really have a close to working settings or anything.
If there is any information not clear or missing let me know and I will provide more details.
Update 1 (Using Val's 2nd Update)
Index: Download 1800+ indexes
Settings:
curl -XPUT localhost:9200/tests -d '{
"settings": {
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
}
}
},
"mappings": {
"test": {
"properties": {
"my_string": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "edge_ngram_analyzer"
}
}
}
}
}
}
}'
Query:
curl -XPOST localhost:9200/tests/test/_search?pretty=true -d '{
"size": 1,
"sort": {
"_script": {
"script": "doc.my_string.value.length()",
"type": "number",
"order": "desc"
},
"_score": "desc"
},
"query": {
"filtered": {
"query": {
"match": {
"my_string.prefix": "8092232423"
}
},
"filter": {
"script": {
"script": "doc.my_string.value.length() <= maxlength",
"params": {
"maxlength": 10
}
}
}
}
}
}'
With this configuration the query returns the following results:
{
"took" : 61,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1754,
"max_score" : null,
"hits" : [ {
"_index" : "tests",
"_type" : "test",
"_id" : "AU8LqQo4FbTZPxBtq3-Q",
"_score" : 0.13441172,
"_source":{"my_string":"80928870"},
"sort" : [ 8.0, 0.13441172 ]
} ]
}
}
Bonus question
I would like to provide an array of numbers for that search and get the matching prefix for each one in an efficient way without having to perform the query each time

Here is my take at it.
Basically, what we need to do is to slice and dice the field (called my_string below) at indexing time with an edgeNGram tokenizer (called edge_ngram_tokenizer below). That way a string like 123456789 will be tokenized to 12, 123, 1234, 12345, 123456, 1234567, 12345678, 123456789 and all tokens will be indexed and searchable.
So let's create a tests index, a custom analyzer called edge_ngram_analyzer analyzer and a test mapping containing a single string field called my_string. You'll note that the my_string field is a multi-field declaring a prefixes sub-field which will contain all the tokenized prefixes.
curl -XPUT localhost:9200/tests -d '{
"settings": {
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
}
}
},
"mappings": {
"test": {
"properties": {
"my_string": {
"type": "string",
"fields": {
"prefixes": {
"type": "string",
"index_analyzer": "edge_ngram_analyzer"
}
}
}
}
}
}
}
Then let's index a few test documents using the _bulk API:
curl -XPOST localhost:9200/tests/test/_bulk -d '
{"index":{}}
{"my_string":"12"}
{"index":{}}
{"my_string":"1234"}
{"index":{}}
{"my_string":"1234567890"}
{"index":{}}
{"my_string":"abcd"}
{"index":{}}
{"my_string":"abcdefgh"}
{"index":{}}
{"my_string":"123456789abcd"}
{"index":{}}
{"my_string":"abcd123456789"}
'
The thing that I found particularly tricky was that the matching result could be either longer or shorter than the input string. To achieve that we have to combine two queries, one looking for shorter matches and another for longer matches. So the match query will find documents with shorter "prefixes" matching the input and the query_string query (with the edge_ngram_analyzer applied on the input string!) will search for "prefixes" longer than the input string. Both enclosed in a bool/should and sorted by a decreasing string length (i.e. longest first) will do the trick.
Let's do some queries and see what unfolds:
This query will return the one document with the longest match for "123456789", i.e. "123456789abcd". In this case, the result is longer than the input.
curl -XPOST localhost:9200/tests/test/_search -d '{
"size": 1,
"sort": {
"_script": {
"script": "doc.my_string.value.length()",
"type": "number",
"order": "desc"
}
},
"query": {
"bool": {
"should": [
{
"match": {
"my_string.prefixes": "123456789"
}
},
{
"query_string": {
"query": "123456789",
"default_field": "my_string.prefixes",
"analyzer": "edge_ngram_analyzer"
}
}
]
}
}
}'
The second query will return the one document with the longest match for "123456789abcdef", i.e. "123456789abcd". In this case, the result is shorter than the input.
curl -XPOST localhost:9200/tests/test/_search -d '{
"size": 1,
"sort": {
"_script": {
"script": "doc.my_string.value.length()",
"type": "number",
"order": "desc"
}
},
"query": {
"bool": {
"should": [
{
"match": {
"my_string.prefixes": "123456789abcdef"
}
},
{
"query_string": {
"query": "123456789abcdef",
"default_field": "my_string.prefixes",
"analyzer": "edge_ngram_analyzer"
}
}
]
}
}
}'
I hope that covers it. Let me know if not.
As for your bonus question, I'd simply suggest using the _msearch API and sending all queries at once.
UPDATE: Finally, make sure that scripting is enabled in your elasticsearch.yml file using the following:
# if you have ES <1.6
script.disable_dynamic: false
# if you have ES >=1.6
script.inline: on
UPDATE 2 I'm leaving the above as the use case might fit someone else's needs. Now, since you only need "shorter" prefixes (makes sense !!), we need to change the mapping a little bit and the query.
The mapping would be like this:
{
"settings": {
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
}
}
},
"mappings": {
"test": {
"properties": {
"my_string": {
"type": "string",
"fields": {
"prefixes": {
"type": "string",
"analyzer": "edge_ngram_analyzer" <--- only change
}
}
}
}
}
}
}
And the query would now be a bit different but will always return only the longest prefix but shorter or of equal length to the input string. Please try it out. I advise to re-index your data to make sure everything is setup properly.
{
"size": 1,
"sort": {
"_script": {
"script": "doc.my_string.value.length()",
"type": "number",
"order": "desc"
},
"_score": "desc" <----- also add this line
},
"query": {
"filtered": {
"query": {
"match": {
"my_string.prefixes": "123" <--- input string
}
},
"filter": {
"script": {
"script": "doc.my_string.value.length() <= maxlength",
"params": {
"maxlength": 3 <---- this needs to be set to the length of the input string
}
}
}
}
}
}

Related

How to find the parent keys of a jsonpath result

I have this json obj
{
"abc": {
"some_field": {
"title": "Token",
"type": "password"
},
"some_field_2": {
"title": "Domain",
"type": "text"
},
"some_field_3": {
"title": "token2",
"type": "password"
}
}
}
And I want to get a list of keys [some_field,some_field_3] where type=password
This jsonpath $..[?(#.type=='password')] returns:
[
{
"title": "Token",
"type": "password"
},
{
"title": "token2",
"type": "password"
},
]
What should I do?
Thanks.
You are almost there, just need to add a ~ to the end to get the key(s) instead of the value(s):
$..[?(#.type=='password')]~
Note: this might not work depending on your jsonpath engine. It works on https://jsonpath.com.

How to include synonyms in Elasticsearch using the R package elastic

I would like to include synonyms in Elasticsearch using the R package elastic, preferably at search time only. I can't get this working. Hope someone can help me out. Thanks!
Here I give one example assuming that brain, mind, and smart are synonyms.
My code in R...
library(elastic)
connection <- connect()
#index_delete(connection,"test")
index_create(connection, "test")
properties <-
'{
"properties": {
"sentence": {
"type": "text",
"position_increment_gap": 100
}
}
}'
mapping_create(connection, "test", body = properties)
sentences <- data.frame(sentence = c("This is a brain","This a a mind","This is fun","This is smart"))
document <- cbind(1,sentences)
colnames(document)[1] <- "document"
docs_bulk(connection,document,"test")
emptyBody <-
'{
"query": {
"match_phrase": {
"sentence": {
"query": "this mind",
"slop": 100
}
}
}
}'
Search(connection,"test",body=emptyBody)
... returns...
"This a mind"
But I want...
"This is a brain"
"This is a mind"
"This is smart"
Settings?...
Based on the documentations of the R package elastic and some general searches, I experimented with the following code block, putting it before the 'properties' code block, but that did not have any effect. :(
settings <- '{
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym_graph",
"synonyms": [
"brain, mind, smart"
]
}
}
}
}
}'
index_analyze(connection, "test", body = settings)
Are you using the synonyms analyzer in the mapping field?
"mappings": {
"properties": {
"name": {
"type": "text",
"search_analyzer": "synonym_analyzer"
}
}
}
I found the solution
I had to create the index with particular settings (instead of using the index_analyze function.
settings <- '
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_graph_synonyms": {
"type": "synonym_graph",
"synonyms": [
"mind, brain",
"brain storm, brainstorm, envisage"
]
}
},
"analyzer": {
"my_index_time_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"stemmer"
]
},
"my_search_time_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"stemmer",
"my_graph_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"sentence": {
"type": "text",
"analyzer": "my_index_time_analyzer",
"search_analyzer": "my_search_time_analyzer"
}
}
}
}'
index_create(connection, "test", body = settings)
Using the example shared by Alexander Marquardt.

jq error cannot iterate over null but need to make a different choice

I have a jq filter that selects the rows I need. But sometimes these lines can be empty, and then everything breaks and the rule does not work. I tried to use the if-then-else construct but to no avail.
A rule that works if you process the following json:
.metadata.namespace as $ns | (.spec.rules[0].match.any[].resources.kinds[] / "/") | [select(.[1])[0] // null, select(.[2])[1] // null, last] as [$version,$group,$kind] | {namespace: $ns, kind: $kind, group: $version, version: $group} | with_entries(select(.value!=null))
suitable json:
{
"apiVersion": "kyverno.io/v1",
"kind": "posdfsdf",
"metadata": {
"name": "e-eion",
"namespace": "kke",
"annotations": {
"policies.kyverno.io/title": "Dation",
"policies.kyverno.io/category": "Pod Security Standards (Restricted)",
"policies.kyverno.io/severity": "medium",
"policies.kyverno.io/subject": "Pod",
"kyverno.io/kyverno-version": "1.6.0",
"kyverno.io/kubernetes-version": "1.22-1.23",
"policies.kyverno.io/description": "se`. "
}
},
"spec": {
"validationFailureAction": "audit",
"background": true,
"rules": [
{
"name": "tion",
"match": {
"any": [
{
"resources": {
"kinds": [
"Pod"
]
}
}
]
},
"validate": {
"message": "Prisd",
"pattern": {
"spec": {
"=(eners)": [
{
"secxt": {
"altion": "false"
}
}
],
"=(i)": [
{
"sext": {
"alcalation": "false"
}
}
],
"containers": [
{
"setext": {
"an": "false"
}
}
]
}
}
}
}
]
}
}
example on which the rule stops working:
{
"apiVersion": "k/v1",
"kind": "Picy",
"metadata": {
"name": "denylation",
"namespace": "what",
},
"spec": {
"validationFailureAction": "audit",
"background": true,
"rules": [
{
"name": "deny-privilege-escalation",
"match": {
"resources": {
"kinds": [
"Pod"
]
}
},
"validate": {
"message": "Priviles[*].securityContext.allowPrind spec.initContalse`.",
"pattern": {
"spec": {
"=(iners)": [
{
"=(seext)": {
"=(aln)": "false"
}
}
],
"containers": [
{
"=(stext)": {
"=(al)": "false"
}
}
]
}
}
}
}
]
}
}
how can this be fixed? I need the rule to work out in any cases and give output
This should work :
.metadata.namespace as $ns |
((.spec.rules[0].match | .. | (objects | .resources.kinds[]?)) / "/") |
[select(.[1])[0] // null, select(.[2])[1] // null, last] as [$version,$group,$kind] |
{namespace: $ns, kind: $kind, group: $version, version: $group} |
with_entries(select(.value!=null))
The failing json gives the following error:
parse error: Expected another key-value pair at line 7, column 3
This is caused by the trailing comma found here:
"metadata": {
"name": "denylation",
"namespace": "what",
},
Removing that comma, the following error is thrown:
jq: error (at :44): Cannot iterate over null (null)
This is caused by the missing any key inside the match object.
We can 'catch' that error using a ? (docs):
(.spec.rules[0].match.any[]?.resources.kinds[] / "/")
^
But due to the missing key, the filter does not find anything and the output is empty.
Updated jqPlay

Elasticsearch & Elasticpress search by math_phrase with & inside query

I have problem with my queries when I'm using " or ' - then I expect match_phrase, but I don't know how I can retrieve posts when I'm using match_phrase with &
For example I'm using Something & Something as phrase, and when I'm didn't using ' and " I can see posts with Something & Something but there I'm using multi_match.
Something what I've tried:
{
"from": 0,
"size": 10,
"sort": {
"post_date": {
"order": "desc"
}
},
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"match_phrase": {
"query": "Something & Something"
}
}
]
}
},
"exp": {
"post_date_gmt": {
"scale": "270d",
"decay": 0.5,
"offset": "90d"
}
},
"score_mode": "avg",
"boost_mode": "sum"
}
},
"post_filter": {
"bool": {
"must": [
{
"terms": {
"post_type.raw": [
"post"
]
}
},
{
"terms": {
"post_status": [
"publish"
]
}
}
]
}
}
}
But this doesn't return any post, and returning hits total 0. Anyone have any idea, or suggestions, what I'm doing wrong ?
match_phrase is very restrictive and in most of cases is recommended to use it inside a should clause to increase the score instead of a must, because it requires the user to type the value exactly as it is.
Example document
POST test_jakub/_doc
{
"query": "Something & Something",
"post_type": {
"raw": "post"
},
"post_status": "publish",
"post_date_gmt": "2021-01-01T12:10:30Z",
"post_date": "2021-01-01T12:10:30Z"
}
With this document searching for "anotherthing Something & Something" will return no results, that's why is a bad idea to use match_phrase here.
You can take 2 approaches
If you need this kind of tight queries take a look to the slop parameter that adds some flexibility to the match_phrase query allowing omit or transpose words in the phrase
Switch to a regular match query (recommended). In most cases this will work fine, but if you want to do extra score to the phrase matches you can add it as a should clause.
POST test_jakub/_search
{
"from": 0,
"size": 10,
"sort": {
"post_date": {
"order": "desc"
}
},
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match_phrase": {
"query": {
"query": "anotherthing something & something",
"slop": 2
}
}
}
],
"must": [
{
"match": {
"query": "anotherthing something & something"
}
}
]
}
},
"exp": {
"post_date_gmt": {
"scale": "270d",
"decay": 0.5,
"offset": "90d"
}
},
"score_mode": "avg",
"boost_mode": "sum"
}
},
"post_filter": {
"bool": {
"must": [
{
"terms": {
"post_type.raw": [
"post"
]
}
},
{
"terms": {
"post_status": [
"publish"
]
}
}
]
}
}
}
Last advice is to avoid using "query" as field name because leads to confusion and will break Kibana autocomplete on Dev Tools.

Azure Time Series Insights query: how to denote a variable such as 'cpu.thermal.average' in the query?

My series' got the value: cpu.thermal.average. I see the values in the TSI explorer for the given range of time.
By executing the following request, for "temperatures" I only get "null".
I am not sure about how to write $event.cpu.thermal.average equivalent.
Any idea?
{
"aggregateSeries": {
"timeSeriesId": [
"MySeries",
],
"searchSpan": {
"from": "2021-03-10T07:00:00Z",
"to": "2021-03-10T20:00:50Z"
},
"interval": "PT3600M",
"filter": null,
"inlineVariables": {
"latitudes": {
"kind": "numeric",
"value": {
"tsx": "$event.latitude"
},
"filter": null,
"aggregation": {
"tsx": "avg($value)"
}
},
"temperatures": {
"kind": "numeric",
"value": {
"tsx": "$event['cpu.thermal.average']"
},
"filter": null,
"aggregation": {
"tsx": "avg($value)"
}
}
},
"projectedVariables": [
"latitudes",
"temperatures"
]
}
}
From here: https://learn.microsoft.com/en-us/rest/api/time-series-insights/reference-time-series-expression-syntax#value-expressions
When accessing nested properties, the Type is required.
It should work if you use $event.cpu.thermal.average.Double

Resources