worker_limit_reached on parallel map reduce jobs - riak

I have 50 hosts trying to run the map reduce job below on Riak. I am getting the error below where some of the hosts complain about the worker_limit being reached.
Looking for some insights on whether I can tune the system to avoid this error? Couldn't find too much documentation around the worker_limit.
{"phase":0,"error":"[worker_limit_reached]","input":"{<<\"provisionentry\">>,<<\"R89Okhz49SDje0y0qvcnkK7xLH0\">>}","type":"result","stack":"[]"} with query MapReduce(path='/mapred', reply_headers={'content-length': '144', 'access-control-allow-headers': 'Content-Type', 'server': 'MochiWeb/1.1 WebMachine/1.10.8 (that head fake, tho)', 'connection': 'close', 'date': 'Thu, 27 Aug 2015 00:32:22 GMT', 'access-control-allow-origin': '*', 'access-control-allow-methods': 'POST, GET, OPTIONS', 'content-type': 'application/json'}, verb='POST', headers={'Content-Type': 'application/json'}, data=MapReduceJob(inputs=MapReduceInputs(bucket='provisionentry', key=u'34245e92-ccb5-42e2-a1d9-74ab1c6af8bf', index='testid_bin'), query=[MapReduceQuery(map=MapReduceQuerySpec(language='erlang', module='datatools', function='map_object_key_value'))]))

Map reduce in Riak does not scale well, and so does not work well as part of a user-facing service.
It is suitable for periodic administrative tasks, or pre-calculations when the number of jobs can be limited.
Since the map phase of the job is a coverage query, you will need to involve at least 1/n_val (rounded up) vnodes in each map, using 1 worker at each. Since you cannot guarantee that the selected coverage sets do not overlap, you should not expect to be able to simultaneously run more map reduce jobs than your worker limit setting.
The default worker limit is 50 (https://github.com/basho/riak_pipe/blob/develop/src/riak_pipe_vnode.erl#L86), but you can adjust that by setting {worker_limit, 50} in the riak_pipe section of app.config or advanced.config.
Keep in mind that each worker is a process, so you may need to increase the process limit for the erlang VM as well.

Related

Understanding elasticsearch circuit_breaking_exception

I am trying to figure out why I am getting this error when indexing a document from a python web app.
The document in this case is a base64 encoded string of a file of size 10877 KB.
I post it to my web app, which then posts it via elasticsearch.py to my elastic instance.
My elastic instance throws an error:
TransportError(429, 'circuit_breaking_exception', '[parent] Data
too large, data for [<http_request>] would be
[1031753160/983.9mb], which is larger than the limit of
[986932838/941.2mb], real usage: [1002052432/955.6mb], new bytes
reserved: [29700728/28.3mb], usages [request=0/0b,
fielddata=0/0b, in_flight_requests=29700728/28.3mb,
accounting=202042/197.3kb]')
I am trying to understand why my 10877 KB file ends up at a size of 983mb as reported by elastic.
I understand that increasing the JVM max heap size may allow me to send bigger files, but I am more wondering why it appears the request size is 10x the size of what I am expecting.
Let us see what we have here, step by step:
[parent] Data too large, data for [<http_request>]
gives the name of the circuit breaker
would be [1031753160/983.9mb],
says, how the heap size will look, when the request would be executed
which is larger than the limit of [986932838/941.2mb],
tells us the current setting of the circuit breaker above
real usage: [1002052432/955.6mb],
this is the real usage of the heap
new bytes reserved: [29700728/28.3mb],
actually an estimatiom, what impact the request will have (the size of the data structures which needs to be created in order to process the request). Your ~10MB file will probably consume 28.3MB.
usages [
request=0/0b,
fielddata=0/0b,
in_flight_requests=29700728/28.3mb,
accounting=202042/197.3kb
]
This last line tells us how the estmation is being calculated.

Cosmos Db library Microsoft.Azure.DocumentDB.Core (2.1.0) - Actual REST invocations

We are attempting to Wiremock (https://github.com/WireMock-Net/WireMock.Net) CosmosDb invocations - so we can build integrationtests in our .net core 2.1 microservice.
By looking at the WireMock instance Request/Response entries, we can observe the following:
1) GET towards "/"
We mock the returning metadata of databases
THIS IS OK
2) GET towards collection (in our case: "/dbs/Chunker/colls/RHTMLChunks")
Returns metadata about the collections
THIS IS OK
3) POST a Query that results in one document being returned towards the documents endpoint on the collection (in our case: "/dbs/Chunker/colls/RHTMLChunks/docs")
I have tried to emulate what we get when we do the exact same query towards the CosmosDb instance in Postman, including headers and response.
However I observe that the lib does the query again, and again, and again....
(I can see this by pausing in Visual Studio, then look at the RequestLog in WireMock)
Does anyone know what should be returned. I have set up WireMock to return the following json payload:
{
"_rid": "q0dcAOelSAI=",
"Documents": [
{
"id": "gL20020621z2D34-1",
"ChunkSize": 658212,
"TotalChunks": 2,
"Metadata": {
"Active": true,
"PublishedDate": "",
},
"ChunkId": 1,
"Markup": "<h1>hello</h1>",
"MainDestination": "gL20020621z2D34",
"_rid": "q0dcAOelSAIHAAAAAAAAAA==",
"_self": "dbs/q0dcAA==/colls/q0dcAOelSAI=/docs/q0dcAOelSAIHAAAAAAAAAA==/",
"_etag": "\"0100e92a-0000-0000-0000-5ba96cf70000\"",
"_attachments": "attachments/",
"_ts": 1537830135
}
],
"_count": 0
}
Problems:
1) Can not find .pdb belonging to Microsoft.Azure.DocumentDB.Core v2.1.0
2) What payload/headers should be returned, so the library will NOT blow up, and retry when we invoke:
var response = await documentQuery.ExecuteNextAsync<DocumentDto>(); // this hangs forever
Please help :)
We're working on open sourcing the C# code base and some other fun improvements to make this easier. In the mean time, I'd advocate for using the emulator for local testing/etc., although I understand mocking is still a lot faster an nicer - it'll just be hard :)
My best pointer is actually our Node.js code base since that's public already. The query code is relatively hard to follow, but basically, you create a query, we look up all the partitions we need to talk to, then we send a request for each partition and keep querying until we don't get back a continuation token anymore (or maxBufferedItem Count/etc. goes over the limit, and we pause until goes back down, etc.)
Effectively, we send out N number of requests for each partition, where N is the number of pages of results, and can vary per partition and query. You'd likely be able to mock a single partition, single page response relatively easy, but a full partition response isn't gonna be fun.
As I mentioned in the beginning, we've got some cool stuff coming, hopefully before the end of the year, which will make offline mocking easier, as well as open sourcing it finally. You might be better off with the emulator until then.

Microsoft Academic Graph CalcHisotgram Being Aborted

I'm using the CalcHistogram endpoint to query the total number of paper entities for every year from around 1980 to 2018. A typical response looks like this:
{
"expr": "Y=2001",
"num_entities": 4179575,
"histograms": []
}
That's 4179575 papers from the year 2001.
However, starting at around year 2002 (the exact year is not consistent), the return values are being aborted.
{
"expr": "Y=2002",
"histograms": [],
"aborted": true
}
This is what my request looks like. I've tried using both GET and POST methods.
GET
https://api.labs.cognitive.microsoft.com/academic/v1.0/calchistogram? expr=Y=2002&model=latest&count=10&offset=0 HTTP/1.1
Host: api.labs.cognitive.microsoft.com
Any ideas on why this is being aborted or how I can find the total number of papers for each year?
Thanks!
Found that CalcHistogram endpoint also accepts a timeout parameter. The api will try to evaluate the query until timeout (which defaults to 1000 milli) is reached, at which point it returns aborted: true.
So, just add timeout: 5000 to your request.

IncompleteRead error when submitting neo4j batch from remote server; malformed HTTP response

I've set up neo4j on server A, and I have an app running on server B which is to connect to it.
If I clone the app on server A and run the unit tests, it works fine. But running them on server B, the setup runs for 30 seconds and fails with an IncompleteRead:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 208, in run
self.setUp()
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 291, in setUp
self.setupContext(ancestor)
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 314, in setupContext
try_run(context, names)
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/util.py", line 469, in try_run
return func()
File "/comps/comps/webapp/tests/__init__.py", line 19, in setup
create_graph.import_films(films)
File "/comps/comps/create_graph.py", line 49, in import_films
batch.submit()
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/neo4j.py", line 2643, in submit
return [BatchResponse(rs).hydrated for rs in responses.json]
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/packages/httpstream/http.py", line 563, in json
return json.loads(self.read().decode(self.encoding))
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/packages/httpstream/http.py", line 634, in read
data = self._response.read()
File "/usr/local/lib/python2.7/httplib.py", line 532, in read
return self._read_chunked(amt)
File "/usr/local/lib/python2.7/httplib.py", line 575, in _read_chunked
raise IncompleteRead(''.join(value))
IncompleteRead: IncompleteRead(131072 bytes read)
-------------------- >> begin captured logging << --------------------
py2neo.neo4j.batch: INFO: Executing batch with 2 requests
py2neo.neo4j.batch: INFO: Executing batch with 1800 requests
--------------------- >> end captured logging << ---------------------
The exception happens when I submit a sufficiently large batch. If I reduce the size of the data set, it goes away. It seems to be related to request size rather than the number of requests (if I add properties to the nodes I'm creating, I can have fewer requests).
If I use batch.run() instead of .submit(), I don't get an error, but the tests fail; it seems that the batch is rejected silently. If I use .stream() and don't iterate over the results, the same thing happens as .run(); if I do iterate over them, I get the same error as .submit() (except that it's "0 bytes read").
Looking at httplib.py suggests that we'll get this error when an HTTP response has Transfer-Encoding: Chunked and doesn't contain a chunk size where one is expected. So I ran tcpdump over the tests, and indeed, that seems to be what's happening. The final chunk has length 0x8000, and its final bytes are
"http://10.210.\r\n
0\r\n
\r\n
(Linebreaks added after \n for clarity.) This looks like correct chunking, but the 0x8000th byte is the first "/", rather than the second ".". Eight bytes early. It also isn't a complete response, being invalid JSON.
Interestingly, within this chunk we get the following data:
"all_relatio\r\n
1280\r\n
nships":
That is, it looks like the start of a new chunk, but embedded within the old one. This new chunk would finish in the correct location (the second "." of above), if we noticed it starting. And if the chunk header wasn't there, the old chunk would finish in the correct location (eight bytes later).
I then extracted the POST request of the batch, and ran it using cat batch-request.txt | nc $SERVER_A 7474. The response to that was a valid chunked HTTP response, containing a complete valid JSON object.
I thought maybe netcat was sending the request faster than py2neo, so I introduced some slowdown
cat batch-request.txt | perl -ne 'BEGIN { $| = 1 } for (split //) { select(undef, undef, undef, 0.1) unless int(rand(50)); print }' | nc $SERVER_A 7474
But it continued to work, despite being much slower now.
I also tried doing tcpdump on server A, but requests to localhost don't go over tcp.
I still have a few avenues that I haven't explored: I haven't worked out how reliably the request fails or under precisely which conditions (I once saw it succeed with a batch that usually fails, but I haven't explored the boundaries). And I haven't tried making the request from python directly, without going through py2neo. But I don't particularly expect either of these to be be very informative. And I haven't looked closely at the TCP dump except for using wireshark's 'follow TCP stream' to extract the HTTP conversation; I don't really know what I'd be looking for there. There's a large section that wireshark highlights in black in the failed dump, and only isolated lines black in the successful dump, maybe that's relevant?
So for now: does anyone know what might be going on? Anything else I should try to diagnose the problem?
The TCP dumps are here: failed and successful.
EDIT: I'm starting to understand the failed TCP dump. The whole conversation takes ~30 seconds, and there's a ~28-second gap in which both servers are sending ZeroWindow TCP frames - these are the black lines I mentioned.
First, py2neo fills up neo4j's window; neo4j sends a frame saying "my window is full", and then another frame which fills up py2neo's window. Then we spend ~28 seconds with each of them just saying "yup, my window is still full". Eventually neo4j opens its window again, py2neo sends a bit more data, and then py2neo opens its window. Both of them send a bit more data, then py2neo finishes sending its request, and neo4j sends more data before also finishing.
So I'm thinking that maybe the problem is something like, both of them are refusing to process more data until they've sent some more, and neither can send some more until the other processes some. Eventually neo4j enters a "something's gone wrong" loop, which py2neo interprets as "go ahead and send more data".
It's interesting, but I'm not sure what it means, that the penultimate TCP frame sent from neo4j to py2neo starts \r\n1280\r\n - the beginning of the fake-chunk. The \r\n8000\r\n that starts the actual chunk, just appears part-way through an unremarkable TCP frame. (It was the third frame sent after py2neo finished sending its post request.)
EDIT 2: I checked to see precisely where python was hanging. Unsurprisingly, it was while sending the request - so BatchRequestList._execute() doesn't return until after neo4j gives up, which is why neither .run() or .stream() did any better than .submit().
It appears that a workaround is to set the header X-Stream: true;format=pretty. (By default it's just true; it used to be pretty, but that was removed due to this bug (which looks like it's actually a neo4j bug, and still seems to be open, but isn't currently an issue for me).
It looks like, by setting format=pretty, we cause neo4j to not send any data until it's processed the whole of the input. So it doesn't try to send data, doesn't block while sending, and doesn't refuse to read until it's sent something.
Removing the X-Stream header entirely, or setting it to false, seems to have the same effect as setting format=pretty (as in, making neo4j send a response which is chunked, pretty-printed, doesn't contain status codes, and doesn't get sent until the whole request has been processed), which is kinda weird.
You can set the header for an individual batch with
batch._batch._headers['X-Stream'] = 'true;format=pretty'
Or set the global headers with
neo4j._add_header('X-Stream', 'true;format=pretty')

Parallel HTTP web crawler in Erlang

I'm coding on a simple web crawler and have generated a bunch gf static files I try to crawl by the code at bottom. I have two issues/questions I don't have an idea for:
1.) Looping over the sequence 1..200 throws me an error exactly after 100 pages have been crawled:
** exception error: no match of right hand side value {error,socket_closed_remotely}
in function erlang_test_01:fetch_page/1 (erlang_test_01.erl, line 11)
in call from lists:foreach/2 (lists.erl, line 1262)
2.) How to parallelize the requests, e.g. 20 cincurrent reqs
-module(erlang_test_01).
-export([start/0]).
-define(BASE_URL, "http://46.4.117.69/").
to_url(Id) ->
?BASE_URL ++ io_lib:format("~p", [Id]).
fetch_page(Id) ->
Uri = to_url(Id),
{ok, {{_, Status, _}, _, Data}} = httpc:request(get, {Uri, []}, [], [{body_format,binary}]),
Status,
Data.
start() ->
inets:start(),
lists:foreach(fun(I) -> fetch_page(I) end, lists:seq(1, 200)).
1. Error message
socket_closed_remotely indicates that the server closed the connection, maybe because you made too many requests in a short timespan.
2. Parallellization
Create 20 worker processes and one process holding the URL queue. Let each process ask the queue for a URL (by sending it a message). This way you can control the number of workers.
An even more "Erlangy" way is to spawn one process for each URL! The upside to this is that your code will be very straightforward. The downside is that you cannot control your bandwidth usage or number of connections to the same remote server in a simple way.

Resources