I have some fairly straightforward F# async code to download a hundred random articles off of Wikipedia (for research).
For some reason, the code hangs at arbitrary points in time during the download. Sometimes it's after 50, sometimes it's after 80.
The async code itself is fairly straightforward:
let parseWikiAsync(url:string, count:int ref) =
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
let! html = wc.AsyncDownloadString(Uri(url))
let ret =
try html |> parseDoc |> parseArticle
with | ex -> printfn "%A" ex; None
lock count (fun () ->
if !count % 10 = 0 then
printfn "%d" !count
count := !count + 1
)
return ret
}
Because I couldn't figure out through fsi what the problem was, I made WebClientWithTimeout, a System.Net.WebClient wrapper that allows me to specify a timeout:
type WebClientWithTimeout() =
inherit WebClient()
member val Timeout = 60000 with get, set
override x.GetWebRequest uri =
let r = base.GetWebRequest(uri)
r.Timeout <- x.Timeout
r
And then I use the async combinators to retrieve just over a hundred pages, and weed out all the articles that return parseWikiAsync calls that return None (most of which are "disambiguation pages") until I have exactly 100 articles:
let en100 =
let count = ref 0
seq { for _ in 1..110 -> parseWikiAsync("http://en.wikipedia.org/wiki/Special:Random", count) }
|> Async.Parallel
|> Async.RunSynchronously
|> Seq.choose id
|> Seq.take 100
When I compile the code and run it in the debugger, there are only three threads, of which only one is running actual code -- the Async pipeline. The other two have "not available" for location, and nothing in the call stack.
Which I think means that it's not stuck in AsyncDownloadString or in anywhere in parseWikiAsync. What else could be causing this?
Oh, also, initially it takes about a full minute before the async code actually starts. After that it goes at a fairly reasonable pace until it hangs again indefinitely.
Here's the call stack for the main thread:
> mscorlib.dll!System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle waitableSafeHandle, long millisecondsTimeout, bool hasThreadAffinity, bool exitContext) + 0x22 bytes
mscorlib.dll!System.Threading.WaitHandle.WaitOne(int millisecondsTimeout, bool exitContext) + 0x28 bytes
FSharp.Core.dll!Microsoft.FSharp.Control.AsyncImpl.ResultCell<Microsoft.FSharp.Control.AsyncBuilderImpl.Result<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>>.TryWaitForResultSynchronously(Microsoft.FSharp.Core.FSharpOption<int> timeout) + 0x36 bytes
FSharp.Core.dll!Microsoft.FSharp.Control.CancellationTokenOps.RunSynchronously<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>(System.Threading.CancellationToken token, Microsoft.FSharp.Control.FSharpAsync<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]> computation, Microsoft.FSharp.Core.FSharpOption<int> timeout) + 0x1ba bytes
FSharp.Core.dll!Microsoft.FSharp.Control.FSharpAsync.RunSynchronously<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>(Microsoft.FSharp.Control.FSharpAsync<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]> computation, Microsoft.FSharp.Core.FSharpOption<int> timeout, Microsoft.FSharp.Core.FSharpOption<System.Threading.CancellationToken> cancellationToken) + 0xb9 bytes
WikiSurvey.exe!<StartupCode$WikiSurvey>.$Program.main#() Line 97 + 0x55 bytes F#
Wikipedia is not to blame here, it's a result of how Async.Parallel works internally. The type signature for Async.Parallel is seq<Async<'T>> -> Async<'T[]>. It returns a single Async value containing all of the results from the sequence -- so it doesn't return until all of the computations in the seq<Async<'T>> return.
To illustrate, I modified your code so it tracks the number of outstanding requests, i.e., requests which have been sent to the server, but have not yet received / parsed the response.
open Microsoft.FSharp.Control
open Microsoft.FSharp.Control.WebExtensions
open System
open System.Net
open System.Threading
type WebClientWithTimeout() =
inherit WebClient()
let mutable timeout = -1
member __.Timeout
with get () = timeout
and set value = timeout <- value
override x.GetWebRequest uri =
let r = base.GetWebRequest(uri)
r.Timeout <- x.Timeout
r
type ParsedDoc = ParsedDoc
type ParsedArticle = ParsedArticle
let parseDoc (str : string) = ParsedDoc
let parseArticle (doc : ParsedDoc) = Some ParsedArticle
/// A synchronized wrapper around Console.Out so we don't
/// get garbled console output.
let synchedOut =
System.Console.Out
|> System.IO.TextWriter.Synchronized
let parseWikiAsync(url : string, outstandingRequestCount : int ref) =
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
wc.Headers.Add ("User-Agent", "Friendly Bot 1.0 (FriendlyBot#friendlybot.com)")
// Increment the outstanding request count just before we send the request.
do
// NOTE : The message must be created THEN passed to synchedOut.WriteLine --
// piping it (|>) into synchedOut.WriteLine or using fprintfn causes a closure
// to be created which somehow defeats the synchronization and garbles the output.
let msg =
Interlocked.Increment outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
let! html = wc.AsyncDownloadString(Uri(url))
let ret =
try html |> parseDoc |> parseArticle
with ex ->
let msg = sprintf "%A" ex
synchedOut.WriteLine msg
None
// Decrement the outstanding request count now that we've
// received a reponse and parsed it.
do
let msg =
Interlocked.Decrement outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
return ret
}
/// Writes a message to the console, passing a value through
/// so it can be used within a function pipeline.
let inline passThruWithMessage (msg : string) value =
Console.WriteLine msg
value
let en100 =
let outstandingRequestCount = ref 0
seq { for _ in 1..120 ->
parseWikiAsync("http://en.wikipedia.org/wiki/Special:Random", outstandingRequestCount) }
|> Async.Parallel
|> Async.RunSynchronously
|> passThruWithMessage "Finished running all of the requests."
|> Seq.choose id
|> Seq.take 100
If you compile and run that code, you'll see output like this:
Outstanding requests: 4
Outstanding requests: 2
Outstanding requests: 1
Outstanding requests: 3
Outstanding requests: 5
Outstanding requests: 6
Outstanding requests: 7
Outstanding requests: 8
Outstanding requests: 9
Outstanding requests: 10
Outstanding requests: 12
Outstanding requests: 14
Outstanding requests: 15
Outstanding requests: 16
Outstanding requests: 17
Outstanding requests: 18
Outstanding requests: 13
Outstanding requests: 19
Outstanding requests: 20
Outstanding requests: 24
Outstanding requests: 22
Outstanding requests: 26
Outstanding requests: 27
Outstanding requests: 28
Outstanding requests: 29
Outstanding requests: 30
Outstanding requests: 25
Outstanding requests: 21
Outstanding requests: 23
Outstanding requests: 11
Outstanding requests: 29
Outstanding requests: 28
Outstanding requests: 27
Outstanding requests: 26
Outstanding requests: 25
Outstanding requests: 24
Outstanding requests: 23
Outstanding requests: 22
Outstanding requests: 21
Outstanding requests: 20
Outstanding requests: 19
Outstanding requests: 18
Outstanding requests: 17
Outstanding requests: 16
Outstanding requests: 15
Outstanding requests: 14
Outstanding requests: 13
Outstanding requests: 12
Outstanding requests: 11
Outstanding requests: 10
Outstanding requests: 9
Outstanding requests: 8
Outstanding requests: 7
Outstanding requests: 6
Outstanding requests: 5
Outstanding requests: 4
Outstanding requests: 3
Outstanding requests: 2
Outstanding requests: 1
Outstanding requests: 0
Finished running all of the requests.
As you can see, all of the requests are made before any of them are parsed -- so if you're on a slower connection, or you're trying to retrieve a large number of documents, the server could be dropping the connection because it may assume you're not retrieving the response it's trying to send. Another issue with the code is that you need to explicitly specify the number of elements to generate in the seq, which makes the code less reusable.
A better solution would be to retrieve and parse the pages as they're needed by some consuming code. (And if you think about it, that's exactly what an F# seq is good for.) We'll start by creating a function that takes a Uri and produces a seq<Async<'T>> -- i.e., it produces an infinite sequence of Async<'T> values, each of which will retrieve the content from the Uri, parse it, and return the result.
/// Given a Uri, creates an infinite sequence of whose elements are retrieved
/// from the Uri.
let createDocumentSeq (uri : System.Uri) =
#if DEBUG
let outstandingRequestCount = ref 0
#endif
Seq.initInfinite <| fun _ ->
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
wc.Headers.Add ("User-Agent", "Friendly Bot 1.0 (FriendlyBot#friendlybot.com)")
#if DEBUG
// Increment the outstanding request count just before we send the request.
do
// NOTE : The message must be created THEN passed to synchedOut.WriteLine --
// piping it (|>) into synchedOut.WriteLine or using fprintfn causes a closure
// to be created which somehow defeats the synchronization and garbles the output.
let msg =
Interlocked.Increment outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
#endif
let! html = wc.AsyncDownloadString uri
let ret =
try Some html
with ex ->
let msg = sprintf "%A" ex
synchedOut.WriteLine msg
None
#if DEBUG
// Decrement the outstanding request count now that we've
// received a reponse and parsed it.
do
let msg =
Interlocked.Decrement outstandingRequestCount
|> sprintf "Outstanding requests: %i"
synchedOut.WriteLine msg
#endif
return ret
}
Now we use this function to retrieve the pages as a stream:
//
let en100_Streaming =
#if DEBUG
let documentCount = ref 0
#endif
Uri ("http://en.wikipedia.org/wiki/Special:Random")
|> createDocumentSeq
|> Seq.choose (fun asyncDoc ->
Async.RunSynchronously asyncDoc
|> Option.bind (parseDoc >> parseArticle))
#if DEBUG
|> Seq.map (fun x ->
let msg =
Interlocked.Increment documentCount
|> sprintf "Parsed documents: %i"
synchedOut.WriteLine msg
x)
#endif
|> Seq.take 50
// None of the computations actually take place until
// this point, because Seq.toArray forces evaluation of the sequence.
|> Seq.toArray
If you run that code, you'll see that it pulls the results one at a time from the server and doesn't leave outstanding requests hanging around. Also, it's very easy to change the number of results you want to retrieve -- all you need to do is change the value you pass to Seq.take.
Now while that streaming code works just fine, it doesn't execute the requests in parallel so it could be slow for large numbers of documents. This is an easy problem to fix, though the solution may be a little non-intuitive. Instead of trying to execute the entire sequence of requests in parallel -- which is the problem in the original code -- let's create a function which uses Async.Parallel to execute small batches of requests in parallel, then uses Seq.collect to combine the results back into a flat sequence.
/// Given a sequence of Async<'T>, creates a new sequence whose elements
/// are computed in batches of a specified size.
let parallelBatch batchSize (sequence : seq<Async<'T>>) =
sequence
|> Seq.windowed batchSize
|> Seq.collect (fun batch ->
batch
|> Async.Parallel
|> Async.RunSynchronously)
To utilize this function, we just need a few small tweaks to the code from the streaming version:
let en100_Batched =
let batchSize = 10
#if DEBUG
let documentCount = ref 0
#endif
Uri ("http://en.wikipedia.org/wiki/Special:Random")
|> createDocumentSeq
// Execute batches in parallel
|> parallelBatch batchSize
|> Seq.choose (Option.bind (parseDoc >> parseArticle))
#if DEBUG
|> Seq.map (fun x ->
let msg =
Interlocked.Increment documentCount
|> sprintf "Parsed documents: %i"
synchedOut.WriteLine msg
x)
#endif
|> Seq.take 50
// None of the computations actually take place until
// this point, because Seq.toArray forces evaluation of the sequence.
|> Seq.toArray
Again, it's easy to change the number of documents you want to retrieve, and the batch size can easily be modified (again, I suggest you keep it reasonably small). If you wanted to, you could make a few tweaks to the 'streaming' and 'batching' code so you could switch between them at run-time.
One last thing -- with my code the requests shouldn't time-out, so you can probably get rid of the WebClientWithTimeout class and just use WebClient directly.
Your code doesn't seem to be doing anything particularly special so I am going to assume that Wikipedia is not liking your activity. Take a look at their bot policy. Digging a bit deeper they also seem to have a strict User-Agent policy
As of February 15, 2010, Wikimedia sites require a HTTP User-Agent
header for all requests. This was an operative decision made by the
technical staff and was announced and discussed on the technical
mailing list.[1][2] The rationale is, that clients that do not send a
User-Agent string are mostly ill behaved scripts that cause a lot of
load on the servers, without benefiting the projects. Note that
non-descriptive default values for the User-Agent string, such as used
by Perl's libwww, may also be blocked from using Wikimedia web sites
(or parts of the web sites, such as api.php).
User agents (browsers or scripts) that do not send a User-Agent header
may now encounter an error message like this:
Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.
So with all I've found they will probably not like what you are doing even if you add a proper user-agent but you might as well give it a try.
wc.Headers.Add ("User-Agent", "Friendly Bot 1.0 (FriendlyBot#friendlybot.com)")
It also wouldn't hurt to avoid making so many connections to their servers.
Related
Issue
I have a client that needs to send the following custom data structure to an API:
#[derive(Serialize, Deserialize)]
pub struct FheSum {
pub server_keys: ServerKey,
pub value1: FheUint8,
pub value2: FheUint8,
}
The code for the client is the following:
let fhe_post: FheSum = FheSum {
server_keys: server_keys.to_owned(),
value1: value_api.to_owned(),
value2: value_api_2.to_owned(),
};
let client = reqwest::blocking::Client::builder()
.timeout(None)
.build().unwrap();
let response = client
.post("http://127.0.0.1:8000/computesum")
.json(&fhe_post)
.send().unwrap();
let response_json: Result<FheSumResult, reqwest::Error> = response.json();
match response_json {
Ok(j) => {
let result_api: u8 = FheUint8::decrypt(&j.result, &client_keys);
println!("Final Result: {}", result_api)
},
Err(e) => println!("{:}", e),
};
In the API, I have the following definition of an HttpServer:
HttpServer::new(|| {
let json_cfg = actix_web::web::JsonConfig::default()
.limit(std::usize::MAX);
App::new()
.app_data(json_cfg)
.service(integers::computesum)
})
.client_disconnect_timeout(std::time::Duration::from_secs(3000))
.client_request_timeout(std::time::Duration::from_secs(3000))
.max_connection_rate(std::usize::MAX)
.bind(("127.0.0.1", 8000))?
.run()
.await
And the associated endpoint the client is trying to access:
#[post("/computesum")]
pub async fn computesum(req: Json<FheSum>) -> HttpResponse {
let req: FheSum = req.into_inner();
let recovered: FheSum = FheSum::new(
req.server_keys,
req.value1,
req.value2,
);
set_server_key(recovered.server_keys);
let result_api_enc: FheSumResult = FheSumResult::new(recovered.value1 + recovered.value2);
HttpResponse::Ok()
.content_type(ContentType::json())
.json(&result_api_enc)
}
Problem
The structs are the same in both the client and the server. This code works when using common data types such as Strings. The issue is when using this data structures. The memory occupied, obtained with mem::size_of_val which returns the size in bytes, is the following:
Size 1: 2488
Size 2: 32
Size 3: 32
The result has been obtained in bytes, so, given the limit established in the HttpServer, this shouldn't be an issue. Timeouts have also been set at much higher values than commonly needed.
Even with this changes, the client always shows Killed, and doesn't display the answer from the server, not giving any clues on what the problem might be.
The client is killing the process before being able to process the server's response. I want to find a way to send these custom data types to the HTTP server without the connection closing before the operation has finished.
I have already tried different libraries for the client such as the acw crate, apart from reqwest and the result is the same. I have also tried not using reqwest in blocking mode, and the error persists.
I've been experimenting with using Tcl as a local server for a desktop application using a browser as the GUI, and using web sockets. Thus far the web sockets are working and I can send a request from the browser to Tcl, retrieve data from a SQLite database, and pass it back to the browser. And it's not difficult to set up a promise-like process in JS to "wait" for Tcl to respond in a non-blocking manner. Also, I've sent a request id on each message to Tcl such that when it responds JS knows which promise to resolve/reject.
My question concerns working in the other direction; that is, initiating a message from Tcl and waiting for the browser to respond and in a non-blocking manner.
In the code below I'm using a coroutine to store the callback procedure and any relevant arguments at the time the request is sent from Tcl to the browser; and then using the id to have the readable event handler call the correct coroutine and pass it the response. Then the coroutine calls the original callback with its arguments and the response received from the browser.
I tried to simulate a delay in sending two messages through a pipe at different times to see if this would work similar to the asynchronous manner of JS. It appears to provide similar results but, of course, is a very simple example.
Would you please tell me if this is the correct approach to take to accomplish this, and tell me of better ones? Thank you for considering my question.
Would you please also tell me why each coro$id doesn't require a separate coSend$id procedure? It doesn't matter if both messages are sent at the same ms, each calls the correct call back with the correct arguments and response. Thanks.
proc coSend {callBack args} {
set response [yield]
$callBack $response $args
}
proc redo {x} {
if {$x < 11} {
chan puts stdout "x: $x => [expr {$x*100}]ms gone by."
incr x
after 100 redo $x
} else {
set ::forever 1
}
}
proc ReadLine {chan} {
if {[catch {chan gets $chan line} len]} {
chan puts stdout "Error in chan $chan"
return
}
set response [string trim [lassign [split $line] id]]
coro$id $response
}
proc SendMessage {callBack args msg delay} {
coroutine coro[set id [GetMsgID]] coSend $callBack $args
after $delay MimicResponse "$id {$msg}"
}
proc MimicResponse {args} {
chan puts $::wchan $args
}
proc GetMsgID {} {
return [incr ::msg_id]
}
proc CallBack_1 {response args} {
chan puts stdout "$args $response"
}
proc CallBack_2 {response args} {
chan puts stdout "$args $response"
}
lassign [chan pipe] rchan wchan
chan configure $rchan -buffering line -blocking 1 -encoding iso8859-1 -translation crlf
chan configure $wchan -buffering line -blocking 1 -encoding iso8859-1 -translation crlf
chan event $rchan readable [list ReadLine $rchan]
set msg_id 0
redo 1
SendMessage CallBack_1 {arg11 arg12} "This is the response to the first message that should return second." 700
SendMessage CallBack_2 {arg21 arg22 arg23} "This is the response to the second message that should return first." 300
set forever 0
vwait forever
# Results are:
# x: 1 => 100ms gone by.
# x: 2 => 200ms gone by.
# x: 3 => 300ms gone by.
# {{arg21 arg22 arg23}} \{This is the response to the second message that should return first.\}
# x: 4 => 400ms gone by.
# x: 5 => 500ms gone by.
# x: 6 => 600ms gone by.
# x: 7 => 700ms gone by.
# {{arg11 arg12}} \{This is the response to the first message that should return second.\}
# x: 8 => 800ms gone by.
# x: 9 => 900ms gone by.
# x: 10 => 1000ms gone by.
`
In Tcl, coroutines are really little stacks; you can yield anywhere in the call hierarchy, and do not need to track what's going on. (In some other languages, this would make them "colorless"; you don't need the async/await ceremony to call them.) The things that need to be aware of how they've been called can use info coroutine to discover if they're running in a coroutine context or not.
The disadvantage of this is that coroutines are comparatively expensive, as they have a (Tcl) stack (but no C stack; Tcl avoids loading that up, and it was work to do that that stack reduction spun out both coroutine and tailcall) so you're best associating one of those per significant ongoing operation. An example would be having one for each connection made to a server when there's a non-trivial protocol present, at least until such time as the effort is transferred to another worker thread.
If you're doing much with I/O and coroutines, the coroutine package in Tcllib is recommended for handling the basics of making apparently synchronous code be actually asynchronous.
I'm building an application for a toy problem to learn more about SAFE. I have some background processes running server-side and occasionally they need to send a message unprompted to the connected clients. This means that I need a reference to the SocketHub from outside of any particular request.
Currently I have a mutable variable which I pass a value to when the Channel is joined:
let mainChannel = channel {
join (fun ctx socketId ->
task {
printfn "Connected! Main Socket Id: %O" socketId
let hub = ctx.GetService<Channels.ISocketHub>()
webSocketHub <- Some hub // Passing the reference to a mutable variable
task {
do! Task.Delay 500
let m = (socketId |> (SetChannelSocketId >> GameData))
do! (harderSendMessage socketId "message" m "Problem sending SocketId")
} |> ignore
return Channels.Ok })
}
However, it seems to me like there should be a better way to get access to the hub - I just can't figure it out.
WHAT I HAVE:
My code flow is like this:
(1) construct request
(2) POST to URL
(3) Write results to output directory
#Bean
public IntegrationFlow validateRequest() {
return IntegrationFlows.from("REQUEST_CHANNEL")
.channel(c -> c.executor(new SimpleAsyncTaskExecutor()))
.handle(requestModifier, "constructRequest")
.handle(Http.outboundGateway("POST_URL", restTemplate)
.httpMethod(HttpMethod.POST)
.mappedRequestHeaders("ab*", "TraceabilityID", authenticator.getToken())
.charset("UTF-8")
.expectedResponseType(Response.class))
.handle(outputWriter, "writeToDir");
}
WHAT I NEED:
The timeout for the POST_URL is 20000 ms.
My code tries to write the response before it receives a timeout and gives a null pointer exception.
Which of the below approaches should I use?
-> Add wait() to Http.outboundGateway so that the thread waits for atleast 20 s for a response.
-> Make the whole thread sleep for 20 sec. Can you please give me an example for this?
Let's say I have some system "BlackBox" that I can't change, and I want to try do some work with it for 2 seconds, if it will not finish - stop and proceed with other appliation:
let log msg =
let timestamp = (System.DateTime.UtcNow.ToString("o"))
printfn "%s: %s" timestamp msg
type BlackBox() =
let tcs = new System.Threading.Tasks.TaskCompletionSource<bool>()
let work () = async {
log "work started"
let! result = tcs.Task |> Async.AwaitTask
log "work finished"
return result }
member x.DoWork () = work ()
member x.SetResult () = tcs.TrySetResult(true)
let tryDoWork (box:BlackBox) = async {
try
log "operration starting with 2sec timeout"
Async.RunSynchronously(box.DoWork(), timeout = 2000) |> ignore
log "operration succeeded"
with
| ex -> log "operation timedout" }
let sut = BlackBox()
tryDoWork sut |> Async.Start
log "waiting 5sec before setting task result"
Async.Sleep 5000 |> Async.RunSynchronously
log "setting task result"
sut.SetResult() |> ignore
// Output is:
// ------------------------------------
// 2016-02-24T16:45:11.0302884Z: waiting 5sec before setting task result
// 2016-02-24T16:45:11.0302884Z: operration starting with 2sec timeout
// 2016-02-24T16:45:11.0351932Z: work started
// 2016-02-24T16:45:16.0322394Z: setting task result
// 2016-02-24T16:45:16.0351731Z: work finished
// 2016-02-24T16:45:16.0361528Z: operation timedout
Async.RunSyncronously with timeout throws TimeoutException after 2 seconds, but not in this situation as internally BlackBox is waiting for Task to be finished.
TCS (from #Carsten's comment) works for me as I have no access to BlackBox internals and can't change that without changing the whole design of the application and need mechanism to add timeout, the cancel for longer task is not required
Other proposed things were:
#kevin:
snippet by Eirik Tsarpalis and Async.AwaitIAsyncResult
and as #kevin noticed that reason for my example not working as expected is that
RunSynchronously with a timeout relies on the given async being cancelable
one more thing I found also is how to cancel non-cancelable async operations