Timeout on WebRequest with F#-Style Asynchronous Workflows - asynchronous

For a broader context, here is my code, which downloads a list of URLs.
It seems to me that there is no good way to handle timeouts in F# when using use! response = request.AsyncGetResponse() style URL fetching. I have pretty much everything working as I'd like it too (error handling and asynchronous request and response downloading) save the problem that occurs when a website takes a long time to response. My current code just hangs indefinitely. I've tried it on a PHP script I wrote that waits 300 seconds. It waited the whole time.
I have found "solutions" of two sorts, both of which are undesirable.
AwaitIAsyncResult + BeginGetResponse
Like the answer by ildjarn on this other Stack Overflow question. The problem with this is that if you have queued many asynchronous requests, some are artificially blocked on AwaitIAsyncResult. In other words, the call to make the request has been made, but something behind the scenes is blocking the call. This causes the time-out on AwaitIAsyncResult to be triggered prematurely when many concurrent requests are made. My guess is a limit on the number of requests to a single domain or just a limit on total requests.
To support my suspicion I wrote little WPF application to draw a timeline of when the requests seem to be starting and ending. In my code linked above, notice the timer start and stops on lines 49 and 54 (calling line 10). Here is the resulting timeline image.
When I move the timer start to after the initial response (so I am only timing the downloading of the contents), the timeline looks a lot more realistic. Note, these are two separate runs, but no code change aside from where the timer is started. Instead of having the startTime measured directly before use! response = request.AsyncGetResponse(), I have it directly afterwards.
To further support my claim, I made a timeline with Fiddler2. Here is the resulting timeline. Clearly the requests aren't starting exactly when I tell them to.
GetResponseStream in a new thread
In other words, synchronous requests and download calls are made in a secondary thread. This does work, since GetResponseStream respects the Timeout property on the WebRequest object. But in the process, we lose all of the waiting time as the request is on the wire and the response hasn't come back yet. We might as well write it in C#... ;)
Questions
Is this a known problem?
Is there any good solution that takes advantage of F# asynchronous workflows and still allows timeouts and error handling?
If the problem is really that I am making too many requests at once, then would the best way to limit the number of request be to use a Semaphore(5, 5) or something like that?
Side Question: if you've looked at my code, can you see any stupid things I've done and could fix?
If there is anything you are confused about, please let me know.

AsyncGetResponse simply ignoring any timeout value posted... here's a solution we just cooked:
open System
open System.IO
open System.Net
type Request = Request of WebRequest * AsyncReplyChannel<WebResponse>
let requestAgent =
MailboxProcessor.Start <| fun inbox -> async {
while true do
let! (Request (req, port)) = inbox.Receive ()
async {
try
let! resp = req.AsyncGetResponse ()
port.Reply resp
with
| ex -> sprintf "Exception in child %s\n%s" (ex.GetType().Name) ex.Message |> Console.WriteLine
} |> Async.Start
}
let getHTML url =
async {
try
let req = "http://" + url |> WebRequest.Create
try
use! resp = requestAgent.PostAndAsyncReply ((fun chan -> Request (req, chan)), 1000)
use str = resp.GetResponseStream ()
use rdr = new StreamReader (str)
return Some <| rdr.ReadToEnd ()
with
| :? System.TimeoutException ->
req.Abort()
Console.WriteLine "RequestAgent call timed out"
return None
with
| ex ->
sprintf "Exception in request %s\n\n%s" (ex.GetType().Name) ex.Message |> Console.WriteLine
return None
} |> Async.RunSynchronously;;
getHTML "www.grogogle.com"
i.e. We're delegating to another agent and calling it providing an async timeout... if we do not get a reply from the agent in the specified amount of time we abort the request and move on.

I see my other answer may fail to answer your particular question... here's another implementation for a task limiter that doesn't require the use of semaphore.
open System
type IParallelLimiter =
abstract GetToken : unit -> Async<IDisposable>
type Message=
| GetToken of AsyncReplyChannel<IDisposable>
| Release
let start count =
let agent =
MailboxProcessor.Start(fun inbox ->
let newToken () =
{ new IDisposable with
member x.Dispose () = inbox.Post Release }
let rec loop n = async {
let! msg = inbox.Scan <| function
| GetToken _ when n = 0 -> None
| msg -> async.Return msg |> Some
return!
match msg with
| Release ->
loop (n + 1)
| GetToken port ->
port.Reply <| newToken ()
loop (n - 1)
}
loop count)
{ new IParallelLimiter with
member x.GetToken () =
agent.PostAndAsyncReply GetToken}
let limiter = start 100;;
for _ in 0..1000 do
async {
use! token = limiter.GetToken ()
Console.WriteLine "Sleeping..."
do! Async.Sleep 3000
Console.WriteLine "Releasing..."
} |> Async.Start

Related

In Rust+Tokio, should you return a oneshot::Receiver as a processing callback?

I'm making an API where the user can submit items to be processed, and they might want to check whether their item was processed successfully. I thought that this would be a good place to use tokio::sync::oneshot channels, where I'd return the receiver to the caller, and they can later await on it to get the result they're looking for.
let processable_item = ...;
let where_to_submit: impl Submittable = get_submit_target();
let status_handle: oneshot::Receiver<SubmissionResult> = where_to_submit.submit(processable_item).await;
// ... do something that does not depend on the SubmissionResult ...
// Now we want to get the status of our submission
let status = status_handle.await;
Submitting the item involves creating a oneshot channel, and putting the Sender half into a queue while the Receiver goes back to the calling code:
#[async_trait]
impl Submittable for Something {
async fn submit(item: ProcessableItem) -> oneshot::Receiver<SubmissionResult> {
let (sender, receiver) = oneshot::channel();
// Put the item, with the associated sender, into a queue
let queue: mpsc::Receiver<(ProcessableItem, oneshot::Sender<SubmissionResult>)> = get_processing_queue();
queue.send( (item, sender) ).await.expect("Processing task closed!");
return receiver;
}
}
When I do this, cargo clippy complains (via the [clippy::async_yields_async] lint) that I'm returning oneshot::Receiver, which can be awaited, from an async function, and suggests that I await it then.
This is not what I wanted, which is to allow a degree of background processing while the user doesn't need the SubmissionResult yet, as opposed to making them wait until it's available.
Is this API even a good idea? Does there exist a common approach to doing this?
Looks fine to me. This is a false positive of Clippy, so you can just silence it: #[allow(clippy::async_yields_async)].

How to concurrently crawl paginated webpages with unknown end?

I'm trying to write a web crawler in Rust using the tokio asynchronous runtime. I want to fetch/process multiple pages asynchronously but I also want the crawler to stop when it reaches the end (in other words if there is nothing left to crawl). So far I have used futures::future::try_join_all for getting a collective result from the async functions that I have provide as Futures but this obviously requires the program to know the total pages to crawl beforehand. For example:
async fn fetch(_url: String) -> Result<String, ()> {
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
Ok(String::from("foo"))
}
#[tokio::main]
async fn main() {
let search_url = "https://example.com/?page={page_num}";
let futures = (1..=3)
.map(|page_num| search_url.replace("{page_num}", &page_num.to_string()))
.map(|url| fetch(url));
let _ = futures::future::try_join_all(futures).await.unwrap();
}
Rust Playground
In this simple example I have to know the total pages to go through (1..=3) before actually fetching them. What I want is, not providing any range and have a condition to stop the whole process. (e.g. if the HTML result contains "not found")
I looked into futures::executor::block_on but I'm not sure if it is something that I can utilize for this task.
Here's roughly how to do this using Stream and .buffered():
use futures::{future, stream, StreamExt};
#[derive(Debug)]
struct Error;
async fn fetch_page(page: i32) -> Result<String, Error> {
println!("fetching page: {}", page);
// simulate loading pages
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
if page < 5 {
// successfully got page
Ok(String::from("foo"))
} else {
// page doesn't exist
Err(Error)
}
}
#[tokio::main]
async fn main() {
let pages: Vec<String> = stream::iter(1..)
.map(fetch_page)
.buffered(10)
.take_while(|page| future::ready(page.is_ok()))
.map(|page| page.unwrap())
.collect()
.await;
println!("pages: {:?}", pages);
}
I'll go over the steps in main() in detail:
stream::iter(1..) creates an unbounded Stream of integers representing each page number
.map(fetch_page) of course will call fetch_page for each page number
.buffered(10) this will allow up to 10 fetch_page calls to occur concurrently and will preserve the original order
.take_while(|page| future::ready(page.is_ok())) will keep the stream going until a fetch_page returns an error, it uses futures::future::ready since the function passed to take_while must return a future
.map(|page| page.unwrap()) will pull out the successful pages, it won't panic because we know the stream will stop when any errors occur
.collect() does essentially the same thing as for an iterator except you have to .await it
Running the above code prints out the following, showing that it tries 10 at a time but will only return up to the first failure:
fetching page: 1
fetching page: 2
fetching page: 3
fetching page: 4
fetching page: 5
fetching page: 6
fetching page: 7
fetching page: 8
fetching page: 9
fetching page: 10
pages: ["foo", "foo", "foo", "foo"]
This glosses over some nice-to-haves like handling non-missing-page errors or retrying, but I hope this gives you a good foundation. In those cases you might reach for the methods on TryStreamExt, which specially handle streams of Results.

How to get the cookie from a GET response?

I am writing a function that makes a GET request to a website and returns the response cookie:
extern crate futures;
extern crate hyper;
extern crate tokio_core;
use tokio_core::reactor::Core;
use hyper::Client;
use std::error::Error;
use hyper::header::Cookie;
use futures::future::Future;
fn get_new_cookie() -> Result<String, Box<Error>> {
println!("Getting cookie...");
let core = Core::new()?;
let client = Client::new(&core.handle());
println!("Created client");
let uri = "http://www.cnn.com".parse().expect("Cannot parse url");
println!("Parsed url");
let response = client.get(uri).wait().expect("Cannot get url.");
println!("Got response");
let cookie = response
.headers()
.get::<Cookie>()
.expect("Cannot get cookie");
println!("Cookie: {}", cookie);
Ok(cookie)
}
fn main() {
println!("{:?}", get_new_cookie());
}
This doesn't work; it is stuck on the client.get(...) string. The output I'm getting is:
Getting cookie...
Created client
Parsed url
and after that nothing happens.
What am I doing wrong and how I can change it so it'd work?
As Stefan points out, by calling wait, you are putting the thread to sleep until the future has completed. However, that thread needs to run the event loop, so you've just caused a deadlock. Using Core::run is more correct.
As Francis Gagné points out, the "Cookie" header is used to send a cookie to the server. SetCookie is used to send a cookie to the client. It also returns a vector of all the cookies together:
fn get_new_cookie() -> Result<String, Box<Error>> {
println!("Getting cookie...");
let mut core = Core::new()?;
let client = Client::new(&core.handle());
println!("Created client");
let uri = "http://www.cnn.com".parse().expect("Cannot parse url");
println!("Parsed url");
let response = core.run(client.get(uri)).expect("Cannot get url.");
println!("Got response");
let cookie = response
.headers()
.get::<SetCookie>()
.expect("Cannot get cookie");
println!("Cookie: {:?}", cookie);
Ok(cookie.join(","))
}
However, if you only want a synchronous API, use Reqwest instead. It is built on top of Hyper:
extern crate reqwest;
use std::error::Error;
use reqwest::header::SetCookie;
fn get_new_cookie() -> Result<String, Box<Error>> {
let response = reqwest::get("http://www.cnn.com")?;
let cookies = match response.headers().get::<SetCookie>() {
Some(cookies) => cookies.join(","),
None => String::new(),
};
Ok(cookies)
}
fn main() {
println!("{:?}", get_new_cookie());
}
See the documentation for the wait method:
Note: This method is not appropriate to call on event loops or similar
I/O situations because it will prevent the event loop from making
progress (this blocks the thread). This method should only be called
when it's guaranteed that the blocking work associated with this
future will be completed by another thread.
Future::wait is already deprecated in the tokio-reform branch.
I'd recommend to design the full application to deal with async concepts (i.e. get_new_cookie should take a Handle and return a Future, not allocating its own event loop).
You could run the request with Core::run like this:
let response = core.run(client.get(uri)).expect("Cannot get url.");
reqwest 0.11 (and perhaps earlier) update
In the get_new_cookie function, I believe the code snippet to retrieve the cookies from a reqwest::Response goes something like:
// returns Option<&HeaderValue>
response.headers().get(http::header::SET_COOKIE)

Why Threading.Timer can't work in async block?

This program work fine:
let mutable inc =0
let a(o:obj)=
let autoEvent=o :?> AutoResetEvent
Console.WriteLine("a")
inc<-inc+1
if inc=3 then
autoEvent.Set()|>ignore
let autoEvent=new AutoResetEvent(false)
let timer=new Timer(a,autoEvent,0,2000)
autoEvent.WaitOne()|>ignore
But when I put the same code in the async block when I want to deal with tcp client:
let mutable inc =0
let a(o:obj)=
let autoEvent=o :?> AutoResetEvent
Console.WriteLine("a")
inc<-inc+1
if inc=3 then
autoEvent.Set()|>ignore
let listener=new TcpListener(IPAddress.Parse("127.0.0.1"),2000)
let private loop(client:TcpClient,sr:StreamReader,sw:StreamWriter)=
async{
let autoEvent=new AutoResetEvent(false)
let timer=new Timer(a,autoEvent,0,2000)
autoEvent.WaitOne()|>ignore
}
let private startLoop()=
while true do
let client=listener.AcceptTcpClient()
let stream=client.GetStream()
let sr=new StreamReader(stream)
let sw=new StreamWriter(stream)
sw.AutoFlush<-true
Async.Start(loop(client,sr,sw))|>ignore
listener.Start()
startLoop()
listener.Stop()
the timer function will not quit when it have run three times.I want to know why?Thanks
I first want to mention a few things, instead of using Console.WriteLine("a"), just use printfn "a". Secondly, the snippet of code you gave does not terminate, so if you try it in FSI, it will continue running after the main thread finishes. This is likely not an issue in a console app. To answer your question, it has to do with async workflow. If you like in this article: Async Programming, you'll notice that they spawn the async computation as a child and then perform an async sleep to give the child a chance to start. This has to do with the way tasks are scheduled. .NET Frameworks use a "work-first" policy. Continuations typically don't get executed until a blocking event forces the thread to give up the current task. This is how I got the timer event to run:
open System
open System.Threading
let mutable inc =0
let a(o:obj)=
let autoEvent=o :?> AutoResetEvent
printfn "a"
inc<-inc+1
if inc=3 then
printfn "hit 3!"
//autoEvent.Set()|>ignore
let private loop i =
async{
printfn "Started as child..."
let aWrap(o:obj) = // so that we can see which child prints
printfn "%d" i
let autoEvent=new AutoResetEvent(false)
let timer=new Timer(aWrap,autoEvent,0,2000)
autoEvent.WaitOne()|>ignore
}
let startLoopAsync() =
async {
let children =
[1..3]
|> List.map(fun i ->
Async.StartChild(loop i) // start as child
)
do! Async.Sleep 100 // give chance for children to start
children
|> List.iter (Async.RunSynchronously >> ignore) // wait for all children
}
startLoopAsync() |> (Async.RunSynchronously >> ignore) // wait for async loop start
Thread.Sleep(5000)
Note that I used StartChild. I recommend this because of the facts noted here: Async.Start vs. Async.StartChild. A child async task does not need to be given its own cancellation token. Instead it inherits from its parent. So, if I had assigned a cancellation token to the startLoopAsync(), I could cancel that task and all children would cancel as well. Lastly, I recommend keeping a handle on timer in case you ever need to stop that re-occurring event. Not keeping a handle would result in not being able to stop it without killing the process. That is what Thread.Sleep(5000) was for. To show that after the async tasks finish, the timers keep triggering events until the process dies (which requires killing FSI if you use that to test).
I hope this answers your question,
Cheers!

Generic reply from agent/mailboxprocessor?

I currently have an agent that does heavy data processing by constantly posting "work" messages to itself.
Sometimes clients to this agent wants to interrupt this processing to access the data in a safe manner.
For this I thought that posting an async to the agent that the agent can run whenever it's in a safe state would be nice. This works fine and the message looks like this:
type Message = |Sync of Async<unit>*AsyncReplyChannel<unit>
And the agent processing simply becomes:
match mailbox.Receive () with
| Sync (async, reply) -> async |> Async.RunSynchronously |> reply.Reply
This works great as long as clients don't need to return some value from the async as I've constrained the async/reply to be of type unit and I cannot use a generic type in the discriminated union.
My best attempts to solve this has involved wrapper asyncs and waithandles, but this seems messy and not as elegant as I've come to expect from F#. I'm also new to async workflows in F# so it's very possible that I've missed/misunderstood some concepts here.
So the question is; how can I return generic types in a agent response?
The thing that makes this difficult is that, in your current version, the agent would somehow have to calculate the value and then pass it to the channel, without knowing what is the type of the value. Doing that in a statically typed way in F# is tricky.
If you make the message generic, then it will work, but the agent will only be able to handle messages of one type (the type T in Message<T>).
An alternative is to simply pass Async<unit> to the agent and let the caller do the value passing for each specific type. So, you can write message & agent just like this:
type Message = | Sync of Async<unit>
let agent = MailboxProcessor.Start(fun inbox -> async {
while true do
let! msg = inbox.Receive ()
match msg with
| Sync (work) -> do! work })
When you use PostAndReply, you get access to the reply channel - rather than passing the channel to the agent, you can just use it in the local async block:
let num = agent.PostAndReply(fun chan -> Sync(async {
let ret = 42
chan.Reply(ret) }))
let str = agent.PostAndReply(fun chan -> Sync(async {
let ret = "hi"
chan.Reply(ret) }))

Resources