Creating a custom tokenizer in ElasticSearch NEST - asp.net

I have a custom class in ES 2.5 of the following:
Title
DataSources
Content
Running a search is fine, except with the middle field - it's built/indexed using a delimiter of '|'.
ex: "|4|7|8|9|10|12|14|19|20|21|22|23|29|30"
I need to build a query that matches some in all fields AND matches at least one number in the DataSource field.
So to summarize what I currently have:
QueryBase query = new SimpleQueryStringQuery
{
//DefaultOperator = !operatorOR ? Operator.And : Operator.Or,
Fields = LearnAboutFields.FULLTEXT,
Analyzer = "standard",
Query = searchWords.ToLower()
};
_boolQuery.Must = new QueryContainer[] {query};
That's the search words query.
foreach (var datasource in dataSources)
{
// Add DataSources with an OR
queryContainer |= new WildcardQuery { Field = LearnAboutFields.DATASOURCE, Value = string.Format("*{0}*", datasource) };
}
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] {queryContainer};
}
That's for the datasources query. There can be multiple datasources.
It doesn't work, and returns on results with the filter query added on. I think I need some work on the tokenizer/analyzer, but I don't know enough about ES to figure that out.
EDIT: Per Val's comments below I have attempted to recode the indexer like this:
_elasticClientWrapper.CreateIndex(_DataSource, i => i
.Mappings(ms => ms
.Map<LearnAboutContent>(m => m
.Properties(p => p
.String(s => s.Name(lac => lac.DataSources)
.Analyzer("classic_tokenizer")
.SearchAnalyzer("standard")))))
.Settings(s => s
.Analysis(an => an.Analyzers(a => a.Custom("classic_tokenizer", ca => ca.Tokenizer("classic"))))));
var indexResponse = _elasticClientWrapper.IndexMany(contentList);
It builds successfully, with data. However the query still isn't working right.
New query for DataSources:
foreach (var datasource in dataSources)
{
// Add DataSources with an OR
queryContainer |= new TermQuery {Field = LearnAboutFields.DATASOURCE, Value = datasource};
}
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Must = new QueryContainer[] {queryContainer};
And the JSON:
{"learnabout_index":{"aliases":{},"mappings":{"learnaboutcontent":{"properties":{"articleID":{"type":"string"},"content":{"type":"string"},"dataSources":{"type":"string","analyzer":"classic_tokenizer","search_analyzer":"standard"},"description":{"type":"string"},"fileName":{"type":"string"},"keywords":{"type":"string"},"linkURL":{"type":"string"},"title":{"type":"string"}}}},"settings":{"index":{"creation_date":"1483992041623","analysis":{"analyzer":{"classic_tokenizer":{"type":"custom","tokenizer":"classic"}}},"number_of_shards":"5","number_of_replicas":"1","uuid":"iZakEjBlRiGfNvaFn-yG-w","version":{"created":"2040099"}}},"warmers":{}}}
The Query JSON request:
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"fields": [
"_all"
],
"query": "\"housing\"",
"analyzer": "standard"
}
}
],
"filter": [
{
"terms": {
"DataSources": [
"1"
]
}
}
]
}
}
}

One way to achieve this is to create a custom analyzer with a classic tokenizer which will break your DataSources field into the numbers composing it, i.e. it will tokenize the field on each | character.
So when you create your index, you need to add this custom analyzer and then use it in your DataSources field:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"number_analyzer": {
"type": "custom",
"tokenizer": "number_tokenizer"
}
},
"tokenizer": {
"number_tokenizer": {
"type": "classic"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"DataSources": {
"type": "string",
"analyzer": "number_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
As a result, if you index the string "|4|7|8|9|10|12|14|19|20|21|22|23|29|30", you DataSources field will effectively contain the following array of token: [4, 7, 8, 9, 10, 12, 14, 191, 20, 21, 22, 23, 29, 30]
Then you can get rid of your WildcardQuery and simply use a TermsQuery instead:
terms = new TermsQuery {Field = LearnAboutFields.DATASOURCE, Terms = dataSources }
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] { terms };

At an initial glance at your code I think one problem you might have is that any queries placed within a filter clause will not be analysed. So basically the value will not be broken down into tokens and will be compared in its entirety.
It's easy to forget this so any values that require analysis need to be placed in the must or should clauses.

Related

Pacts: Matching rule for non-empty map (or a field which is not null) needed

I need help with writing my consumer Pacts using pact-jvm (https://github.com/DiUS/pact-jvm).
My problem is I have a field which is a list (an array) of maps. Each map can have elements of different types (strings or sub-maps), eg.
"validatedAnswers": [
{
"type": "typeA",
"answers": {
"favourite_colour": "Blue",
"correspondence_address": {
"line_1": "Main St",
"postcode": "1A 2BC",
"town": "London"
}
}
},
{
"type": "typeB",
"answers": {
"first_name": "Firstname",
"last_name": "Lastname",
}
}
]
but we're only interested in some of those answers.
NOTE: The above is only an example showing the structure of validatedAnswers. Each answers map has dozens of elements.
What we really need is this: https://github.com/pact-foundation/pact-specification/issues/38, but it's planned for v.4. In the meantime we're trying a different approach. What I'm attempting to do now is to specify that each element of the list is a non-empty map. Another approach is to specify that each element of the list is not null. Can any of this be done using Groovy DSL?
This:
new PactBuilder().serviceConsumer('A').hasPactWith('B')
.port(findAvailablePort()).uponReceiving(...)
.willRespondWith(status: 200, headers: ['Content-Type': 'application/json'])
.withBody {
validatedAnswers minLike(1) {
type string()
answers {
}
}
}
doesn't work because it mean answers is expected to be empty ("Expected an empty Map but received Map( [...] )", see also https://github.com/DiUS/pact-jvm/issues/298).
So what I would like to do is something like this:
.withBody {
validatedAnswers minLike(1) {
type string()
answers Matchers.map()
}
}
or:
validatedAnswers minLike(1) {
type string()
answers {
keyLike 'title', notNull()
}
}
or:
validatedAnswers minLike(1) {
type string()
answers notNull()
}
Can it be done?
I would create two separate tests for this, one test for each of the different response shapes and have a provider state for each e.g. given there are type b answers.
This way when you verify on provider side, it will only send those two field types.
The union of the two examples gives a contract that allows both.
You can do it without DSL, sample Groovy script:
class ValidateAnswers {
static main(args) {
/* Array with some samples */
List<Map> answersList = [
[
type: 'typeA',
answers: [
favourite_colour: 'Blue',
correspondence_address: [
line_1: 'Main St',
postcode: '1A 2BC',
town: 'London'
]
]
],
[
type: 'typeB',
answers: [
first_name: 'Firstname',
last_name: "Lastname"
]
],
[
type: 'typeC',
answers: null
],
[
type: 'typeD'
],
[
type: 'typeE',
answers: [:]
]
]
/* Iterating through all elements in list above */
for (answer in answersList) {
/* Print result of checking */
println "$answer.type is ${validAnswer(answer) ? 'valid' : 'not valid'}"
}
}
/**
* Method to recursive iterate through Map's.
* return true only if value is not an empty Map and it key is 'answer'.
*/
static Boolean validAnswer(Map map, Boolean result = false) {
map.each { key, value ->
if (key == 'answers') {
result = value instanceof Map && value.size() > 0
} else if (value instanceof Map) {
validAnswer(value as Map, false)
}
}
return result
}
}
Output is:
typeA is valid
typeB is valid
typeC is not valid
typeD is not valid
typeE is not valid

Querying Cosmos Nested JSON documents

I would like to turn this resultset
[
{
"Document": {
"JsonData": "{\"key\":\"value1\"}"
}
},
{
"Document": {
"JsonData": "{\"key\":\"value2\"}"
}
}
]
into this
[
{
"key": "value1"
},
{
"key": "value2"
}
]
I can get close by using a query like
select value c.Document.JsonData from c
however, I end up with
[
"{\"key\":\"value1\"}",
"{\"key\":\"value2\"}"
]
How can I cast each value to an individual JSON fragment using the SQL API?
As David Makogon said above, we need to transform such data within our app. We can do as below:
string data = "[{\"key\":\"value1\"},{\"key\":\"value2\"}]";
List<Object> t = JsonConvert.DeserializeObject<List<Object>>(data);
string jsonData = JsonConvert.SerializeObject(t);
Screenshot of result:

DyanamoDB SCAN with nested attribute

Can I scan DynamoDB by 'order.shortCode', in the given example. The console is indicating I can't with dot notation, and I can't find any documentation on it.
{
"key2": "cj11b1ygp0000jcgubpe5mso3",
"order": {
"amount": 74.22,
"dateCreated": "2017-04-02T19:15:33-04:00",
"orderNumber": "cj11b1ygp0000jcgubpe5mso3",
"shortCode": "SJLLDE"
},
"skey2": "SJLLDE"
}
To scan by a nested attribute, you should use ExpressionAttributeNames parameter to pass each path component (i.e. order and shortCode) separately into FilterExpression like shown below:
var params = {
TableName: 'YOUR_TABLE_NAME',
FilterExpression: "#order.#shortCode = :shortCodeValue",
ExpressionAttributeNames: {
'#order': 'order',
"#shortCode": "shortCode"
},
ExpressionAttributeValues: {
':shortCodeValue': 'SJLLDE'
}
};
dynamodbDoc.scan(params, function(err, data) {
});
Here is a link to documentation explaining this:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.ExpressionAttributeNames.html#Expressions.ExpressionAttributeNames.NestedAttributes

How to set a DynamoDB Map property value, when the map doesn't exist yet

How do you "upsert" a property to a DynamoDB row. E.g. SET address.state = "MA" for some item, when address does not yet exist?
I feel like I'm having a chicken-and-egg problem because DynamoDB doesn't let you define a sloppy schema in advance.
If address DID already exist on that item, of type M (for Map), the internet tells me I could issue an UpdateExpression like:
SET #address.#state = :value
with #address, #state, and :value appropriately mapped to address, state, and MA, respectively.
But if the address property does not already exist, this gives an error:
'''
ValidationException: The document path provided in the update expression is invalid for update
'''
So.. it appears I either need to:
Figure out a way to "upsert" address.state (e.g., SET address = {}; SET address.state = 'MA' in a single command)
or
Issue three (!!!) roundtrips in which I try it, SET address = {}; on failure, and then try it again.
If the latter.... how do I set a blank map?!?
Ugh.. I like Dynamo, but unless I'm missing something obvious this is a bit crazy..
You can do it with two round trips, the first conditionally sets an empty map for address if it doesn't already exist, and the second sets the state:
db.update({
UpdateExpression: 'SET #a = :value',
ConditionExpression: 'attribute_not_exists(#a)',
ExpressionAttributeValues: {
":value": {},
},
ExpressionAttributeNames: {
'#a': 'address'
}
}, ...);
Then:
db.update({
UpdateExpression: 'SET #a.#b = :v',
ExpressionAttributeNames: {
'#a': 'address',
'#b': 'state'
},
ExpressionAttributeValues: {
':v': 'whatever'
}
}, ...);
You cannot set nested attributes if the parent document does not exist. Since address does not exist you cannot set the attribute province inside it. You can achieve your goal if you set address to an empty map when you create the item. Then, you can use the following parameters to condition an update on an attribute address.province not existing yet.
var params = {
TableName: 'Image',
Key: {
Id: 'dynamodb.png'
},
UpdateExpression: 'SET address.province = :ma',
ConditionExpression: 'attribute_not_exists(address.province)',
ExpressionAttributeValues: {
':ma': 'MA'
},
ReturnValues: 'ALL_NEW'
};
docClient.update(params, function(err, data) {
if (err) ppJson(err); // an error occurred
else ppJson(data); // successful response
});
By the way, I had to replace state with province as state is a reserved word.
Another totally different method is to simply create the address node when creating the parent document in the first place. For example assuming you have a hash key of id, you might do:
db.put({
Item: {
id: 42,
address: {}
}
}, ...);
This will allow you to simply set the address.state value as the address map already exists:
db.update({
UpdateExpression: 'SET #a.#b = :v',
AttributeExpressionNames: {
'#a': 'address',
'#b': 'state'
},
AttributeExpressionValues: {
':v': 'whatever'
}
}, ...);
Some kotlin code to do this recursively regardless how deep it goes. It sets existence of parent paths as condition and if condition check fails, recursively creates those paths first. It has to be in the library's package so it can access those package private fields/classes.
package com.amazonaws.services.dynamodbv2.xspec
import com.amazonaws.services.dynamodbv2.document.Table
import com.amazonaws.services.dynamodbv2.model.ConditionalCheckFailedException
import com.amazonaws.services.dynamodbv2.xspec.ExpressionSpecBuilder.attribute_exists
fun Table.updateItemByPaths(hashKeyName: String, hashKeyValue: Any, updateActions: List<UpdateAction>) {
val parentPaths = updateActions.map { it.pathOperand.path.parent() }
.filter { it.isNotEmpty() }
.toSet() // to remove duplicates
try {
val builder = ExpressionSpecBuilder()
updateActions.forEach { builder.addUpdate(it) }
if (parentPaths.isNotEmpty()) {
var condition: Condition = ComparatorCondition("=", LiteralOperand(true), LiteralOperand(true))
parentPaths.forEach { condition = condition.and(attribute_exists<Any>(it)) }
builder.withCondition(condition)
}
this.updateItem(hashKeyName, hashKeyValue, builder.buildForUpdate())
} catch (e: ConditionalCheckFailedException) {
this.updateItemByPaths(hashKeyName, hashKeyValue, parentPaths.map { M(it).set(mapOf<String, Any>()) })
this.updateItemByPaths(hashKeyName, hashKeyValue, updateActions)
}
}
private fun String.parent() = this.substringBeforeLast('.', "")
Here is a helper function I wrote in Typescript that works for this a single level of nesting using a recursive method.
I refer to the top-level attribute as a column.
//usage
await setKeyInColumn('customerA', 'address', 'state', "MA")
// Updates a map value to hold a new key value pair. It will create a top-level address if it doesn't exist.
static async setKeyInColumn(primaryKeyValue: string, colName: string, key: string, value: any, _doNotCreateColumn?:boolean) {
const obj = {};
obj[key] = value; // creates a nested value like {address:value}
// Some conditions depending on whether the column already exists or not
const ConditionExpression = _doNotCreateColumn ? undefined:`attribute_not_exists(${colName})`
const AttributeValue = _doNotCreateColumn? value : obj;
const UpdateExpression = _doNotCreateColumn? `SET ${colName}.${key} = :keyval `: `SET ${colName} = :keyval ` ;
try{
const updateParams = {
TableName: TABLE_NAME,
Key: {key:primaryKeyValue},
UpdateExpression,
ExpressionAttributeValues: {
":keyval": AttributeValue
},
ConditionExpression,
ReturnValues: "ALL_NEW",
}
const resp = await docClient.update(updateParams).promise()
if (resp && resp[colName]) {
return resp[colName];
}
}catch(ex){
//if the column already exists, then rerun and do not create it
if(ex.code === 'ConditionalCheckFailedException'){
return this.setKeyInColumn(primaryKeyValue,colName,key, value, true)
}
console.log("Failed to Update Column in DynamoDB")
console.log(ex);
return undefined
}
}
I've got quite similar situation. I can think of only a one way to do this in 1 query/atomically.
Extract map values to top level attributes.
Example
Given I have this post item in DynamoDB:
{
"PK": "123",
"SK": "post",
"title": "Hello World!"
}
And I want to later add an analytics entry to same partition:
{
"PK": "123",
"SK": "analytics#december",
"views": {
// <day of month>: <views>
"1": "12",
"2": "457463",
// etc
}
}
Like in your case, it's not possible to increment/decrement views days counters in single query if analytics item nor views map might not exist (could be later feature or don't want to put empty items).
Proposed solution:
{
"PK": "123",
"SK": "analytics#december",
// <day of month>: <views>
"1": "12", // or "day1" if "1" seems too generic
"2": "457463",
// etc
}
}
Then you could do something like this (increment +1 example):
{
UpdateExpression: "SET #day = if_not_exists(#day, 0) + 1",
AttributeExpressionNames: {
'#day': "1"
}
}
if day attribute value doesn't exist, set default value to 0
if item in database doesn't exist, update API adds a new one

All instances of a node by xxxx name

Is there a one liner or how can I get all instances of a named list in any node?
say I get jason where multiple nodes could have a sub collection called "comments". How can I get all nodes that contain a collection of "comments"?
Thanks,
If you can provide an example of the JSON, I can give you a definitive answer.
However, I can post some of the JSON I'm parsing and you can see how it works and possibly shape it to fit your needs.
"abridged_cast": [
{
"name": "Clark Gable",
"characters": ["Rhett Butler"]
},
{
"name": "Vivien Leigh",
"characters": ["Scarlett O'Hara"]
},
{
"name": "Leslie Howard",
"characters": ["Ashley Wilkes"]
},
{
"name": "Olivia de Havilland",
"characters": ["Melanie Hamilton"]
},
{
"name": "Hattie McDaniel",
"characters": ["Mammy"]
}
],
Notice how abridged_cast is an array of values, and one value in that array (characters) is an array itself.
Here's how I fetch the data:
var castMembers = (JArray) x["abridged_cast"];
foreach (var castMember in castMembers)
{
CastMember member = new CastMember();
member.Actor = (string) castMember["name"];
var characters = (JArray) castMember["characters"];
foreach (var character in characters)
{
member.Characters.Add((string)character);
movie.Cast.Add(member);
}
}

Resources