Parallel HTTP web crawler in Erlang

Parallel HTTP web crawler in Erlang - http

I'm coding on a simple web crawler and have generated a bunch gf static files I try to crawl by the code at bottom. I have two issues/questions I don't have an idea for:
1.) Looping over the sequence 1..200 throws me an error exactly after 100 pages have been crawled:
** exception error: no match of right hand side value {error,socket_closed_remotely}
in function erlang_test_01:fetch_page/1 (erlang_test_01.erl, line 11)
in call from lists:foreach/2 (lists.erl, line 1262)
2.) How to parallelize the requests, e.g. 20 cincurrent reqs
-module(erlang_test_01).
-export([start/0]).
-define(BASE_URL, "http://46.4.117.69/").
to_url(Id) ->
?BASE_URL ++ io_lib:format("~p", [Id]).
fetch_page(Id) ->
Uri = to_url(Id),
{ok, {{_, Status, _}, _, Data}} = httpc:request(get, {Uri, []}, [], [{body_format,binary}]),
Status,
Data.
start() ->
inets:start(),
lists:foreach(fun(I) -> fetch_page(I) end, lists:seq(1, 200)).

1. Error message
socket_closed_remotely indicates that the server closed the connection, maybe because you made too many requests in a short timespan.
2. Parallellization
Create 20 worker processes and one process holding the URL queue. Let each process ask the queue for a URL (by sending it a message). This way you can control the number of workers.
An even more "Erlangy" way is to spawn one process for each URL! The upside to this is that your code will be very straightforward. The downside is that you cannot control your bandwidth usage or number of connections to the same remote server in a simple way.

Related

Can I set thread priority in dotnet? (specifically interacting with Suave webserver, but it's a general question)

I have a tool that does a lot of calculations in a loop. It is creating async tasks and then run them in parallel. About 20% of the time, the CPU is maxed out.
At the same time, I have a REST api, implemented with Suave, used to query the data.
The issue is that when the CPU is busy, Suave will just not reply at all.
Right now, I have about 10 seconds of every minute where the rest calls will not be processed, while the calculations are done, and afterwards requests are processed normally.
So I am trying to investigate if priorities in thread may be the solution for that.
I'm starting Suave like this:
let listening, server = startWebServerAsync configuration webApplication
server |> Async.Start
listening |> Async.RunSynchronously
but I was wondering if there is a way to set the priority of the server so that its code is executed if there is a request.
Alternatively, I start all the calculations like this:
snapshots
|> List.map (fun snapshot ->
async {
return dosomestuff...
})
|> Async.Parallel
|> Async.RunSynchronously
is there a way to lower the priority of this execution to give a chance for the web server to reply?
or, should I insert some Thread.Sleep(1) in the computation to give a chance to the context switch?
What I have tried:
I've tried to sprinkle the calculations with Thread.Sleep(0) and also Thread.Sleep(1) to see if it helps to do a context switch when there is a Suave request. No effect.
I've started the calculations in their own thread and set a low priority, like this:
let thread = Thread(ThreadStart(processLoop))
thread.Priority <- ThreadPriority.BelowNormal
thread.Start()
but this didn't change anything either.
More detail about Suave:
this is an example of an endpoint from Suave.
// get the correlation matrix
let private getCorrelationMatrix () =
match ReportStore.getReport() with
| Ok report ->
{|
StartTime = report.StartTime
EndTime = report.EndTime
Interval = report.Interval
PublicationTime = report.PublicationTime
CorrelationMatrix = report.CorrelationMatrix
|}
|> Json.serialize |> Successful.OK >=> setMimeType "application/json"
| Result.Error e ->
ServerErrors.INTERNAL_ERROR e
with ReportStore.getReport() just getting the last data, or error, from a mutable.
The Suave endpoint are very lightweight, they just grab the last data, or the last error, from an array and return it.
It really looks like when all cores are busy with parallel execution, no other threads can preempt that. 10s is very long when you wait for a reply!

IncompleteRead error when submitting neo4j batch from remote server; malformed HTTP response

I've set up neo4j on server A, and I have an app running on server B which is to connect to it.
If I clone the app on server A and run the unit tests, it works fine. But running them on server B, the setup runs for 30 seconds and fails with an IncompleteRead:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 208, in run
self.setUp()
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 291, in setUp
self.setupContext(ancestor)
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 314, in setupContext
try_run(context, names)
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/util.py", line 469, in try_run
return func()
File "/comps/comps/webapp/tests/__init__.py", line 19, in setup
create_graph.import_films(films)
File "/comps/comps/create_graph.py", line 49, in import_films
batch.submit()
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/neo4j.py", line 2643, in submit
return [BatchResponse(rs).hydrated for rs in responses.json]
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/packages/httpstream/http.py", line 563, in json
return json.loads(self.read().decode(self.encoding))
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/packages/httpstream/http.py", line 634, in read
data = self._response.read()
File "/usr/local/lib/python2.7/httplib.py", line 532, in read
return self._read_chunked(amt)
File "/usr/local/lib/python2.7/httplib.py", line 575, in _read_chunked
raise IncompleteRead(''.join(value))
IncompleteRead: IncompleteRead(131072 bytes read)
-------------------- >> begin captured logging << --------------------
py2neo.neo4j.batch: INFO: Executing batch with 2 requests
py2neo.neo4j.batch: INFO: Executing batch with 1800 requests
--------------------- >> end captured logging << ---------------------
The exception happens when I submit a sufficiently large batch. If I reduce the size of the data set, it goes away. It seems to be related to request size rather than the number of requests (if I add properties to the nodes I'm creating, I can have fewer requests).
If I use batch.run() instead of .submit(), I don't get an error, but the tests fail; it seems that the batch is rejected silently. If I use .stream() and don't iterate over the results, the same thing happens as .run(); if I do iterate over them, I get the same error as .submit() (except that it's "0 bytes read").
Looking at httplib.py suggests that we'll get this error when an HTTP response has Transfer-Encoding: Chunked and doesn't contain a chunk size where one is expected. So I ran tcpdump over the tests, and indeed, that seems to be what's happening. The final chunk has length 0x8000, and its final bytes are
"http://10.210.\r\n
0\r\n
\r\n
(Linebreaks added after \n for clarity.) This looks like correct chunking, but the 0x8000th byte is the first "/", rather than the second ".". Eight bytes early. It also isn't a complete response, being invalid JSON.
Interestingly, within this chunk we get the following data:
"all_relatio\r\n
1280\r\n
nships":
That is, it looks like the start of a new chunk, but embedded within the old one. This new chunk would finish in the correct location (the second "." of above), if we noticed it starting. And if the chunk header wasn't there, the old chunk would finish in the correct location (eight bytes later).
I then extracted the POST request of the batch, and ran it using cat batch-request.txt | nc $SERVER_A 7474. The response to that was a valid chunked HTTP response, containing a complete valid JSON object.
I thought maybe netcat was sending the request faster than py2neo, so I introduced some slowdown
cat batch-request.txt | perl -ne 'BEGIN { $| = 1 } for (split //) { select(undef, undef, undef, 0.1) unless int(rand(50)); print }' | nc $SERVER_A 7474
But it continued to work, despite being much slower now.
I also tried doing tcpdump on server A, but requests to localhost don't go over tcp.
I still have a few avenues that I haven't explored: I haven't worked out how reliably the request fails or under precisely which conditions (I once saw it succeed with a batch that usually fails, but I haven't explored the boundaries). And I haven't tried making the request from python directly, without going through py2neo. But I don't particularly expect either of these to be be very informative. And I haven't looked closely at the TCP dump except for using wireshark's 'follow TCP stream' to extract the HTTP conversation; I don't really know what I'd be looking for there. There's a large section that wireshark highlights in black in the failed dump, and only isolated lines black in the successful dump, maybe that's relevant?
So for now: does anyone know what might be going on? Anything else I should try to diagnose the problem?
The TCP dumps are here: failed and successful.
EDIT: I'm starting to understand the failed TCP dump. The whole conversation takes ~30 seconds, and there's a ~28-second gap in which both servers are sending ZeroWindow TCP frames - these are the black lines I mentioned.
First, py2neo fills up neo4j's window; neo4j sends a frame saying "my window is full", and then another frame which fills up py2neo's window. Then we spend ~28 seconds with each of them just saying "yup, my window is still full". Eventually neo4j opens its window again, py2neo sends a bit more data, and then py2neo opens its window. Both of them send a bit more data, then py2neo finishes sending its request, and neo4j sends more data before also finishing.
So I'm thinking that maybe the problem is something like, both of them are refusing to process more data until they've sent some more, and neither can send some more until the other processes some. Eventually neo4j enters a "something's gone wrong" loop, which py2neo interprets as "go ahead and send more data".
It's interesting, but I'm not sure what it means, that the penultimate TCP frame sent from neo4j to py2neo starts \r\n1280\r\n - the beginning of the fake-chunk. The \r\n8000\r\n that starts the actual chunk, just appears part-way through an unremarkable TCP frame. (It was the third frame sent after py2neo finished sending its post request.)
EDIT 2: I checked to see precisely where python was hanging. Unsurprisingly, it was while sending the request - so BatchRequestList._execute() doesn't return until after neo4j gives up, which is why neither .run() or .stream() did any better than .submit().

It appears that a workaround is to set the header X-Stream: true;format=pretty. (By default it's just true; it used to be pretty, but that was removed due to this bug (which looks like it's actually a neo4j bug, and still seems to be open, but isn't currently an issue for me).
It looks like, by setting format=pretty, we cause neo4j to not send any data until it's processed the whole of the input. So it doesn't try to send data, doesn't block while sending, and doesn't refuse to read until it's sent something.
Removing the X-Stream header entirely, or setting it to false, seems to have the same effect as setting format=pretty (as in, making neo4j send a response which is chunked, pretty-printed, doesn't contain status codes, and doesn't get sent until the whole request has been processed), which is kinda weird.
You can set the header for an individual batch with
batch._batch._headers['X-Stream'] = 'true;format=pretty'
Or set the global headers with
neo4j._add_header('X-Stream', 'true;format=pretty')

SWI-Prolog http_post and http_delete inexplicably hang

When I attempt to use SWI-Prolog's http_post/4, as follows:
:- use_module(library(http/http_client).
update(URL, Arg) :-
http_post(URL, form([update = Arg), _, [status_code(204)]).
When I query this rule, and watch the TCP traffic, I see the HTTP POST request and reply with the expected 204 status code both occur immediately. However, Prolog hangs for up to 30 seconds before returning back 'true'. What is happening that prevents the rule from immediately returning?
I've tried this variant as well, but it also hangs:
:- use_module(library(http/http_client).
update(URL, Arg) :-
http_post(URL, form([update = Arg), Reply, [status_code(204)]),
close(Reply).
I have a similar issue with http_delete/3, but not with http_get/3.

library docs state that http_post
It is equivalent to http_get/3, except for providing an input document, which is posted using http_post_data/3.
http_get has timeout(+Timeout) in its options. That could help to lower the latency, but as it is set to +infinite by default, I fear will not solve the issue. Seems like the service you are calling keeps the connection alive up to some timeout.
Personally I had to use http_open, instead of http_post, when calling Google API services on https...

Why does my concurrent Haskell program terminate prematurely?

I have a UDP server that reflects every ping message it receives (this works well I think). I the client side I would then like to do two things:
make sure that I fired off N (e.g. 10000) messages, and
count the number of correctly received responses.
It seems that either because of the nature of UDP or because of the forkIO thing, my client code below ends prematurely/does not do any counting at all.
Also I am very surprised to see that the function tryOnePing returns 250 times the Int 4. Why could this be?
main = withSocketsDo $ do
s <- socket AF_INET Datagram defaultProtocol
hostAddr <- inet_addr host
thread <- forkIO $ receiveMessages s
-- is there any better way to eg to run that in parallel and make sure
-- that sending/receiving are asynchronous?
-- forM_ [0 .. 10000] $ \i -> do
-- sendTo s "ping" (SockAddrInet port hostAddr)
-- actually this would be preferred since I can discard the Int 4 that
-- it returns but forM or forM_ are out of scope here?
let tryOnePing i = sendTo s "ping" (SockAddrInet port hostAddr)
pings <- mapM tryOnePing [0 .. 1000]
let c = length $ filter (\x -> x==4) pings
-- killThread thread
-- took that out to make sure the function receiveMessages does not
-- end prematurely. still seems that it does
sClose s
print c
-- return()
receiveMessages :: Socket -> IO ()
receiveMessages socket = forever $ do
-- also tried here forM etc. instead of forever but no joy
let recOnePing i = recv socket 1024
msg <- mapM recOnePing [0 .. 1000]
let r = length $ filter (\x -> x=="PING") msg
print r
print "END"

The main problem here is that when your main thread finishes, all other threads gets killed automatically. You have to get the main thread to wait for the receiveMessages thread, or it will in all likelyhood simply finish before any responses have been received. One simple way of doing this is to use an MVar.
An MVar is a synchronized cell that can either be empty or hold exactly one value. The current thread will block if it tries to take from an empty MVar or insert into a full one.
In this case, we don't care about the value itself, so we'll just store a () in it.
We'll start with the MVar empty. Then the main thread will fork off the receiver thread, send all the packets, and try to take the value from the MVar.
import Control.Concurrent.MVar
main = withSocketsDo $ do
-- prepare socket, same as before
done <- newEmptyMVar
-- we need to pass the MVar to the receiver thread so that
-- it can use it to signal us when it's done
forkIO $ receiveMessages sock done
-- send pings, same as before
takeMVar done -- blocks until receiver thread is done
In the receiver thread, we will receive all the messages and then put a () in the MVar to signal that we're done receiving.
receiveMessages socket done = do
-- receive messages, same as before
putMVar done () -- allows the main thread to be unblocked
This solves the main issue, and the program runs fine on my Ubuntu laptop, but there are a couple more things you want to take care of.
sendTo does not guarantee that the whole string will be sent. You'll have to check the return value to see how much was sent, and retry if not all of it was sent. This can happen even for a short message like "ping" if the send buffer is full.
recv requires a connected socket. You'll want to use recvFrom instead. (Although it still works on my PC for some unknown reason).
Printing to standard output is not synchronized, so you might want to alter this so that the MVar will be used to communicate the number of received packets instead of just (). That way, you can do all the output from the main thread. Alternatively, use another MVar as a mutex to control access to standard output.
Finally, I recommend reading the documentation of Network.Socket, Control.Concurrent and Control.Concurrent.MVar carefully. Most of my answer is stitched together from information found there.

"Throttled" async download in F#

I'm trying to download the 3000+ photos referenced from the xml backup of my blog. The problem I came across is that if just one of those photos is no longer available, the whole async gets blocked because AsyncGetResponse doesn't do timeouts.
ildjarn helped me to put together a version of AsyncGetResponse which does fail on timeout, but using that gives a lot more timeouts - as though requests that are just queued timeout. It seems like all the WebRequests are launched 'immediately', the only way to make it work is to set the timeout to the time required to download all of them combined: which isn't great because it means I have adjust the timeout depending on the number of images.
Have I reached the limits of vanilla async? Should I be looking at reactive extensions instead?
This is a bit embarassing, because I've already asked two questions here on this particular bit of code, and I still haven't got it working the way I want!

I think there must be a better way to find out that a file is not available than using a timeout. I'm not exactly sure, but is there some way to make it throw an exception if a file cannot be found? Then you could just wrap your async code inside try .. with and you should avoid most of the problems.
Anyway, if you want to write your own "concurrency manager" that runs certain number of requests in parallel and queues remaining pending requests, then the easiest option in F# is to use agents (the MailboxProcessor type). The following object encapsulates the behavior:
type ThrottlingAgentMessage =
| Completed
| Work of Async<unit>
/// Represents an agent that runs operations in concurrently. When the number
/// of concurrent operations exceeds 'limit', they are queued and processed later
type ThrottlingAgent(limit) =
let agent = MailboxProcessor.Start(fun agent ->
/// Represents a state when the agent is blocked
let rec waiting () =
// Use 'Scan' to wait for completion of some work
agent.Scan(function
| Completed -> Some(working (limit - 1))
| _ -> None)
/// Represents a state when the agent is working
and working count = async {
while true do
// Receive any message
let! msg = agent.Receive()
match msg with
| Completed ->
// Decrement the counter of work items
return! working (count - 1)
| Work work ->
// Start the work item & continue in blocked/working state
async { try do! work
finally agent.Post(Completed) }
|> Async.Start
if count < limit then return! working (count + 1)
else return! waiting () }
working 0)
/// Queue the specified asynchronous workflow for processing
member x.DoWork(work) = agent.Post(Work work)

Nothing is ever easy. :)
I think the issues you're hitting are intrinsic to the problem domain (as opposed to merely being issues with the async programming model, though they do interact somewhat).
Say you want to download 3000 pictures. First, in your .NET process, there is something like System.Net.ConnectionLimit or something I forget the name of, that will e.g. throttle the number of simultaneous HTTP connections your .NET process can run simultaneously (and the default is just '2' I think). So you could find that control and set it to a higher number, and it would help.
But then next, your machine and internet connection have finite bandwidth. So even if you could try to concurrently start 3000 HTTP connections, each individual connection would get slower based on the bandwidth pipe limitations. So this would also interact with timeouts. (And this doesn't even consider what kinds of throttles/limits are on the server. Maybe if you send 3000 requests it will think you are DoS attacking and blacklist your IP.)
So this is really a problem domain where a good solution requires some intelligent throttling and flow-control in order to manage how the underlying system resources are used.
As in the other answer, F# agents (MailboxProcessors) are a good programming model for authoring such throttling/flow-control logic.
(Even with all that, if most picture files are like 1MB but then there is a 1GB file mixed in there, that single file might trip a timeout.)
Anyway, this is not so much an answer to the question, as just pointing out how much intrinsic complexity there is in the problem domain itself. (Perhaps it's also suggestive of why UI 'download managers' are so popular.)

Here's a variation on Tomas's answer, because I needed an agent which could return results.
type ThrottleMessage<'a> =
| AddJob of (Async<'a>*AsyncReplyChannel<'a>)
| DoneJob of ('a*AsyncReplyChannel<'a>)
| Stop
/// This agent accumulates 'jobs' but limits the number which run concurrently.
type ThrottleAgent<'a>(limit) =
let agent = MailboxProcessor<ThrottleMessage<'a>>.Start(fun inbox ->
let rec loop(jobs, count) = async {
let! msg = inbox.Receive() //get next message
match msg with
| AddJob(job) ->
if count < limit then //if not at limit, we work, else loop
return! work(job::jobs, count)
else
return! loop(job::jobs, count)
| DoneJob(result, reply) ->
reply.Reply(result) //send back result to caller
return! work(jobs, count - 1) //no need to check limit here
| Stop -> return () }
and work(jobs, count) = async {
match jobs with
| [] -> return! loop(jobs, count) //if no jobs left, wait for more
| (job, reply)::jobs -> //run job, post Done when finished
async { let! result = job
inbox.Post(DoneJob(result, reply)) }
|> Async.Start
return! loop(jobs, count + 1) //job started, go back to waiting
}
loop([], 0)
)
member m.AddJob(job) = agent.PostAndAsyncReply(fun rep-> AddJob(job, rep))
member m.Stop() = agent.Post(Stop)
In my particular case, I just need to use it as a 'one shot' 'map', so I added a static function:
static member RunJobs limit jobs =
let agent = ThrottleAgent<'a>(limit)
let res = jobs |> Seq.map (fun job -> agent.AddJob(job))
|> Async.Parallel
|> Async.RunSynchronously
agent.Stop()
res
It seems to work ok...

Here's an out of the box solution:
FSharpx.Control offers an Async.ParallelWithThrottle function. I'm not sure if it is the best implementation as it uses SemaphoreSlim. But the ease of use is great and since my application doesn't need top performance it works well enough for me. Although since it is a library if someone knows how to make it better it is always a nice thing to make libraries top performers out of the box so the rest of us can just use the code that works and just get our work done!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex