Elasticsearch index limit? - symfony

I'm working with Symfony 4, FosElasticaBundle and Kibana, and I have approximatly 600K items to index in ElasticSearch.
I'm using the command fos:elastica:populate, and when I run this I have something like :
Resetting myindex
0/546097 [>---------------------------] 0%
So, 546097 is the exact number of items I need to index, but after the indexation, when I try to get all items in ElasticSearch with curl like :
curl -XGET "http://localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query" : {
"match_all" : {}
}
}'
I have count=420,603
When I try to check the index with Kibana, I can see the same count 420,603.
I did the indexation several times (reset + index), and it's always the same count of 420603....
So my question is : What is this limit ? Why the indexation stop to 420603 items exactly ?
Thanks !

Related

Boto3 DynamoDb Query with Select Count without pagination

This is more of a concept clarification. I can find the actual counts using Boto3 via repeated queries using the LastEvaluatedKey of previous response.
I want to count items matching certain conditions in dynamoDb. I am using the "select = count", which according to the docs [1] should just return count of matched items, and my assumption that the response will not be paginated.
COUNT - Returns the number of matching items, rather than the matching
items themselves.
When i try it via aws-cli, my assumptions seems correct, (like the rest api samples in the doc [1])
aws dynamodb query \
--table-name 'my-table' \
--index-name 'classification-date-index' \
--key-condition-expression 'classification = :col AND #dt BETWEEN :start AND :end' \
--expression-attribute-values '{":col" : {"S":"INTERNAL"}, ":start" : {"S": "2020-04-10"}, ":end" : {"S": "2020-04-25"}}' \
--expression-attribute-names '{"#dt" : "date"}' \
--select 'COUNT'
{
"Count": 18817,
"ScannedCount": 18817,
"ConsumedCapacity": null
}
But when I try using Python3 and Boto3, the response is paginated, and I have to repeat the query till LastEvaluatedKey is empty.
In [22]: table.query(IndexName='classification-date-index', Select='COUNT', KeyConditionExpression= Key('classification').eq('INTERNAL') & Key('date').between('2020-04-10', '2020-04-25'))
Out[22]:
{'Count': 5667,
'ScannedCount': 5667,
'LastEvaluatedKey': {'classification': 'INTERNAL',
'date': '2020-04-14',
's3Path': '<redacted>'},
'ResponseMetadata': {'RequestId': 'TH3ILO0P47QB7GAU9M3M98BKJVVV4KQNSO5AEMVJF66Q9ASUAAJG',
'HTTPStatusCode': 200,
'HTTPHeaders': {'server': 'Server',
'date': 'Sat, 25 Apr 2020 13:32:36 GMT',
'content-type': 'application/x-amz-json-1.0',
'content-length': '230',
'connection': 'keep-alive',
'x-amzn-requestid': 'TH3ILO0P47QB7GAU9M3M98BKJVVV4KQNSO5AEMVJF66Q9ASUAAJG',
'x-amz-crc32': '133035383'},
'RetryAttempts': 0}}
I expected the same behaviour from the Boto3 sdk like the aws cli, as the response seems lesser than the 1mb.
The docs are slightly conflicting ...
"Paginating Table Query Results" [2] page says :
DynamoDB paginates the results from Query operations. With pagination,
the Query results are divided into "pages" of data that are 1 MB in
size (or less). An application can process the first page of results,
then the second page, and so on. A single Query only returns a result
set that fits within the 1 MB size limit.
While the "Query" [1] page says:
A single Query operation will read up to the maximum number of items
set (if using the Limit parameter) or a maximum of 1 MB of data and
then apply any filtering to the results using FilterExpression.
[1] https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
[2] https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html
Just ran down this issue myself. The AWS CLI does automatic summation of the pages from the DynamoDB query. To stop it from doing this, add --no-paginate onto your command as listed on this page

Can NOT handle the reserver keyword "Key" in DyanmoDB command line

I'm trying to execute one query against the DynamoDB. The command line is as below:
aws dynamodb query --table-name History
--key-condition-expression "#k = :v1" --expression-attribute-names '{"#k":"Key"}' --expression-attribute-values file://query.json
Json file:
{ ":v1": { "S":"cef50df4-b063-cebb-e0c0-08d651599ab7"} }
For my talbe "History", it has the hashkey of column "Key". When I execute this command line, it always tells me that:
Error parsing parameter '--expression-attribute-names': Expected: '=',
received: ''' for input: '{#k:Key}'
Can someone tell me how to correct it? Thanks a lot.
Problem in your JSON fromate '{"#k":"Key"}'`
Please change --expression-attribute-names '{"#k":"Key"}' to
--expression-attribute-names '{\"#k\":\"Key\"}' and try
reference Link: https://github.com/aws/aws-cli/issues/2298

Filtering information from JIRA

I would like to get some information from jira project, using http method, f.e.:
curl -D- -u uname:pass -X PUT -d "Content-Type: application/json" http://localhost:8080/jira/rest/api/2/search?jql=project=XXX%20created='-5d'
After all, I received a lot of information, but I would like get only one tag:
{"expand":"schema,names","startAt":0,"maxResults":50,"total":1234,"issues":
here - multiple lines....
Have You maybe idea, how I can get only "total":1234 field?
Thank You in advance.
Add the following to your URL:
&maxResults=0
Which will result in a return like:
{
"startAt": 0,
"maxResults": 0,
"total": 3504,
"issues": []
}
You can then pipe you're curl to an awk and get the number only with:
curl --silent "https://jira.atlassian.com/rest/api/2/search?jql=project=STASH%20&created=%27-5d%27&maxResults=0" | awk '{split($0,a,":"); print a[4]}' | awk '{split($0,a,","); print a[1]}'

Using secondary index and key filter together in RIAK mapred

Is possible to use sec index and key filters together in a map reduce query. Something like this
*{"inputs":{ "bucket" :"ignore_bucket1",
"index" :"secindex_bin",
"key" :"secIndexVal",
"key_filters":[["and",
[["tokenize", "-", 5], ["greater_than_eq", "20120101"]],
[["tokenize", "-", 5], ["less_than_eq", "20120112"]]
]]
}}
Also is it efficient to get list of keys using sec index and then run keyfilter on returned keys ?
As far as I know it is not possible to combine these in the input statement as they represent very different ways of retrieving keys. It would be possible to implement this is as you suggested, by using the secondary index to retrieve the initial set (avoids scan of all keys) and then implement the key filtering logic as a map phase function.
Another, probably faster, way to get around it could perhaps be to create an additional compound binary secondary index, e.g. [secIndexVal]_[date]. If this is ensured to sort correctly, you could run a single secondary index range query on this and get the values you specified above.
Afaik you cannot do this through the HTTP api,
As Christian mentioned you can use a range query, but you don't need a alternative index as you already have the primary key index which can be referenced by $key as index field:
olav#nyx ~ (master *%) » curl http://nyx:8098/riak/test/20120931 -d "31. Sep 2012"
olav#nyx ~ (master *%) » curl http://nyx:8098/riak/test/20121002 -d "02. Oct 2012"
olav#nyx ~ (master *%) » curl http://nyx:8098/riak/test/20121021 -d "21. Oct 2012
olav#nyx ~ (master *%) » curl http://nyx:8098/riak/test/20121102 -d "The future"
olav#nyx ~ (master *%) » curl -X POST -H "content-type: application/json" \
-d #- http://localhost:8098/mapred \
<<EOF
{ "inputs":{
"bucket":"test"
, "index":"\$key"
, "start":"20121001"
, "end":"20121101"
}
, "query":[{
"reduce":{
"language":"erlang"
, "module":"riak_kv_mapreduce"
, "function":"reduce_identity"
, "keep":true
}
}]
}
EOF
# ...
[["test","20121021"],["test","20121002"]]
If you really want key filters with you can use Erlang pb client and do something along these lines (you need riak_kv in your codepath):
{ok, Pid} = riakc_pb_socket:start_link("127.0.0.1", 8087),
Index = {index, <<"test1">>, <<"field_int">>, <<"123">>},
{ok, Filter} = riak_kv_mapred_filters:build_filter([[<<"ends_with">>,"1"]]).
MapReduce = [
{ reduce
, {qfun, fun(X, F) -> lists:filter(fun({A, B}) -> F(B) end, X) end}
, riak_kv_mapred_filters:compose(Filter)
, true}],
riakc_pb_socket:mapred(Pid, Index, MapReduce).

Call Elastic search from shell script for indexing pdf document

I installed elasticsearch 5.0.1 and corresponding ingest attachment. Tried indexing pdf document from shell script as below
#!/bin/ksh
var=$(base64 file_name.pdf)
var1=$(curl -XPUT 'http://localhost:9200/my_index4/my_type/my_id?pipeline=attachment&pretty' -d' { "data" : $var }')
echo $var1
I got error as
{ "error" : { "root_cause" : [ { "type" : "exception", "reason" :
"java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field
[data]]; nested: IllegalArgumentException[Illegal base64 character 24];",
"header" : { "processor_type" : "attachment" } } ]...
Can anyone please help on resolving the above issue ... Not sure whether I am passing invalid base64 character ?
Please note that when I pass like this, It works !
var1=$(curl -XPUT 'http://localhost:9200/my_index4/my_type/my_id?pipeline=attachment&pretty'
-d' { "data" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" }')
I guess the issue has to shell not expanding the variables within single-quotes, you need to double-quote to expand it. i.e.
change -d' { "data" : $var }'
to
-d '{"data" : "'"$(base64 file_name.pdf)"'"}'
directly to pass the base64 stream.
(or)
-d '{"data" : "'"$var"'"}'
More about quoting and variables in ksh here.

Resources