Indexing in Vespa is slow - bigdata

When indexing in local Vespa, the indexing is slow.
My configuration:
`
<container id="default" version="1.0">
<search />
<document-api />
<nodes>
<node hostalias="node1" />
</nodes>
</container>
<content id="bo" version="1.0">
<redundancy>1</redundancy>
<documents>
<document type="psearch" mode="index" />
</documents>
<nodes>
<node hostalias="node1" distribution-key="0" />
</nodes>
</content>
`
and schema:
schema psearch {
document psearch {
field Id type int {
indexing: summary | attribute
attribute: fast-search
}
field Name type string {
indexing: summary | index | attribute
index: enable-bm25
}
field AdId type string {
indexing: summary | index | attribute
index: enable-bm25
}
field Country type string {
indexing: summary | index | attribute
index: enable-bm25
}
field Avatar type string {
indexing: summary | index | attribute
index: enable-bm25
}
field Value type long {
indexing: summary | attribute
attribute: fast-search
}
field Numbers type int {
indexing: summary | attribute
attribute: fast-search
}
field BotLastTime type long {
indexing: summary | attribute
attribute: fast-search
}
field BotDailyCount type int {
indexing: summary | attribute
attribute: fast-search
}
field Platform type string {
indexing: summary | index | attribute
index: enable-bm25
}
}
fieldset default {
fields: Id, Name, AdId, Country, Avatar, Numbers, BotLastTime, BotDailyCount, Platform
}
rank-profile default {
first-phase {
expression: nativeRank(Id, Name, AdId, Country, Avatar, Numbers, BotLastTime, BotDailyCount, Platform)
}
}
}
I use /document/v1 API to push documents into Vespa (POST to put a given document, by ID)
https://docs.vespa.ai/en/reference/document-v1-api-reference.html
On my tests on local Vespa it takes arount 2.3 milliseconds to push one document, in a test where i push 100k documents.
I did the same test wit Elastic search and the average time is around 1.7 milliseconds. I am trying to find a way of getting at least the same performance as in ElasticSearch.
Any idea how can i improve my time on each document push?

Did you try using https://docs.vespa.ai/en/vespa-feed-client.html - this is optimized for throughput, and normally the best client to push indexing load. This question was also asked at https://github.com/vespa-engine/vespa/issues/25715, where more answers are found

Related

Adding key name to value using jq

I am trying to dynamically assign the key name as its value in my json
This is the json i am using:
{
"test1": "",
"test2": "",
"test3": ""
}
the result i would like to obtain looks like this:
{
"test1": "test1",
"test2": "test2",
"test3": "test3"
}
I am not familiar with jq and the closest result i got is using:
keys[] as $key | {"\($key)": "\($key)"} | .
here is the output:
{
"test1": "test1"
}
{
"test2": "test2"
}
{
"test3": "test3"
}
with_entries lets you manipulate .key and .value for each field. Just set one to the value of the other:
with_entries(.value = .key)
{
"test1": "test1",
"test2": "test2",
"test3": "test3"
}
Demo
Following your approach, you could collect your result objects into an array using the array constructors […] around your filter, and then add up the array's items producing one merged object. (Note that | . can be dropped as it doesn't do anything but reproduce itself, and that the string interpolation "\($key)" is just the same as $key, given $key is a string, which is the case here as object field names are always strings.)
[keys[] as $key | {($key): $key}] | add
Demo
You may also entirely drop the use of variables as there is no other context interfering:
[keys[] | {"\(.)": .}] | add
Demo
And there is a shortcut for patterns like [.[] | …] called map:
keys | map({"\(.)": .}) | add
Demo
Alternatively, you also might want to consider using reduce for an iterative manipulation, and/or keys_unsorted which acts like keys but produces the keys in the original (unsorted) order:
reduce keys_unsorted[] as $key (.; .[$key] = $key)
Demo

Remove multiple entries from an array of objects using jq

I have the following json and want to remove multiple entries from the ebooks array if they are not in the following array ["Pascal", "Python"] (will eventually be dynamic array, this is just for example)
{
"eBooks":[
{
"language":"Pascal",
"edition":"third"
},
{
"language":"Python",
"edition":"four"
},
{
"language":"SQL",
"edition":"second"
}
]
}
was hoping to do something like this, which if it worked would delete last one containing the SQL because it's not in the array, but this doesn't work
jq '.ebooks[] | select ( .language | in(["Pascal", "Python"]))' ebooks.json
You're almost there. Use del, IN and a capital B in eBooks :)
jq 'del(.eBooks[] | select(.language | IN("Pascal", "Python")))' ebooks.json
{
"eBooks": [
{
"language": "SQL",
"edition": "second"
}
]
}
Demo

Kusto: Permission based display of columns

I am trying to access function parameters within the 'case' statement in that function and displaying data/"filtered" based on the permission flag..Is it possible?
Usecase: TypeCast the value based on the columnType and check if the user has the permission to view the column based on which you display either the value or say something like "filtered"
Here is what I tried
function rls_columnCheck
.create-or-alter function rls_columnCheck(tableName:string, columnName: string, value:string, columnType:string, IsInGroupPII:bool, IsInGroupFinance:bool) {
let PIIColumns = rls_getTablePermissions(tableName, "PII");
let FinanceColumns = rls_getTablePermissions(tableName, "Finance");
let val= case(columnType=="bool", tobool(value),
columnType=="datetime", todatetime(value),
columnType=="int", toint(value),
value);
iif(columnName in (PIIColumns),
iif(columnName in (FinanceColumns),
iif(IsInGroupPII == true and IsInGroupFinance == true,
val,
"filtered"), // PII True, Fin True
iif(IsInGroupPII == true,
val,
"filtered") // PII True, Fin False
),
iif(columnName in (FinanceColumns),
iif(IsInGroupFinance == true,
val,
"filtered"), // PII False, Fin True
val // PII False, Fin False
)
);
}
Error:
Call to iff(): #then data type (int) must match the #else data type (string)
val in your function must have a single and well-defined data type, that is known at "compile" time of the query.
you can't have different cases, where in each it has a different type (bool, datetime, int, string - in your case statement) - hence the error.
if it makes sense in your use case, you can try to always have val typed as string.
This is not a good approach to use RLS because this will actually cause the engine to run a function for every column of every record. It has many downsides:
Performance of displaying the table’s contents (even if you have full permissions)
Queries on the table won’t benefit from the indexes Kusto stores (suppose you query PermissionTesting2 | where Col1 has “blablabla” - instead of checking the index for “blablabla”, the engine will have to scan all the data, because it has to apply a function for every single cell)
A better approach is to do something like this:
let UserCanSeePII = current_principal_is_member_of('aadgroup=group1#domain.com');
let UserCanSeeFinance = current_principal_is_member_of('aadgroup=group2#domain.com');
let ResultWithPII = YourTable | where UserCanSeePII and (not UserCanSeeFinance) | where ... | extend ...;
let ResultWithFinance = YourTable | where UserCanSeeFinance and (not UserCanSeePII) | where ... | extend ...;
let ResultWithPIIandFinance = YourTable | where UserCanSeeFinance and UserCanSeePII | where ... | extend ...;
let ResultWithoutPIIandFinance = YourTable | where (not UserCanSeePII) and (not UserCanSeeFinance) | where ... | extend ...;
union ResultWithPII, ResultWithFinance, ResultWithPIIandFinance, ResultWithoutPIIandFinance

Pick one of union

I am trying to use the type ImageURISource which is here - https://github.com/facebook/react-native/blob/26684cf3adf4094eb6c405d345a75bf8c7c0bf88/Libraries/Image/ImageSource.js#L15
type ImageURISource = {
uri?: string,
bundle?: string,
method?: string,
headers?: Object,
body?: string,
cache?: 'default' | 'reload' | 'force-cache' | 'only-if-cached',
width?: number,
height?: number,
scale?: number,
};
export type ImageSource = ImageURISource | number | Array<ImageURISource>;
However we see that it is exported as a union along with 2 other things. Is it possible to pick from a union just one?
I was hoping to do:
$Pick<ImageSource, ImageURISource>
It's not very pretty, but you could use refinement to specifically refine the type that you want out of it by doing something like this:
var source: ImageSource = {}
if (typeof source === "number" || Array.isArray(source)) throw new Error();
var uriSource = source;
type ImageURISource = typeof uriSource;
The downside here is that if the add more types to the union, your code would start failing again.
It seems like you'd be best off making a PR to react-native to expose that type.

Using jsonPath looking for a string

I'm trying to use jsonPath and the pick function to determine if a rule needs to run or not based on the current domain. A simplified version of what I'm doing is here:
global
{
dataset shopscotchMerchants <- "https://s3.amazonaws.com/app-files/dev/merchantJson.json" cachable for 2 seconds
}
rule checkdataset is active
{
select when pageview ".*" setting ()
pre
{
merchantData = shopscotchMerchants.pick("$.merchants[?(#.merchant=='Telefora')]");
}
emit
<|
console.log(merchantData);
|>
}
The console output I expect is the telefora object, instead I get all three objects from the json file.
If instead of merchant=='Telefora' I use merchantID==16 then it works great. I thought jsonPath could do matches to strings as well. Although the example above isn't searching against the merchantDomain part of the json, I'm experiencing the same problem with that.
Your problem comes from the fact that, as stated in the documentation, the string equality operators are eq, neq, and like. == is only for numbers. In your case, you want to test if one string is equal to another string, which is the job of the eq string equality operator.
Simply swap == for eq in you JSONpath filter expression and you will be good to go:
global
{
dataset shopscotchMerchants <- "https://s3.amazonaws.com/app-files/dev/merchantJson.json" cachable for 2 seconds
}
rule checkdataset is active
{
select when pageview ".*" setting ()
pre
{
merchantData = shopscotchMerchants.pick("$.merchants[?(#.merchant eq 'Telefora')]"); // replace == with eq
}
emit
<|
console.log(merchantData);
|>
}
I put this to the test in my own test ruleset, the source for which is below:
ruleset a369x175 {
meta {
name "test-json-filtering"
description <<
>>
author "AKO"
logging on
}
dispatch {
domain "exampley.com"
}
global {
dataset merchant_dataset <- "https://s3.amazonaws.com/app-files/dev/merchantJson.json" cachable for 2 seconds
}
rule filter_some_delicous_json {
select when pageview "exampley.com"
pre {
merchant_data = merchant_dataset.pick("$.merchants[?(#.merchant eq 'Telefora')]");
}
{
emit <|
try { console.log(merchant_data); } catch(e) { }
|>;
}
}
}

Resources