How to concurrently crawl paginated webpages with unknown end? - asynchronous

I'm trying to write a web crawler in Rust using the tokio asynchronous runtime. I want to fetch/process multiple pages asynchronously but I also want the crawler to stop when it reaches the end (in other words if there is nothing left to crawl). So far I have used futures::future::try_join_all for getting a collective result from the async functions that I have provide as Futures but this obviously requires the program to know the total pages to crawl beforehand. For example:
async fn fetch(_url: String) -> Result<String, ()> {
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
Ok(String::from("foo"))
}
#[tokio::main]
async fn main() {
let search_url = "https://example.com/?page={page_num}";
let futures = (1..=3)
.map(|page_num| search_url.replace("{page_num}", &page_num.to_string()))
.map(|url| fetch(url));
let _ = futures::future::try_join_all(futures).await.unwrap();
}
Rust Playground
In this simple example I have to know the total pages to go through (1..=3) before actually fetching them. What I want is, not providing any range and have a condition to stop the whole process. (e.g. if the HTML result contains "not found")
I looked into futures::executor::block_on but I'm not sure if it is something that I can utilize for this task.

Here's roughly how to do this using Stream and .buffered():
use futures::{future, stream, StreamExt};
#[derive(Debug)]
struct Error;
async fn fetch_page(page: i32) -> Result<String, Error> {
println!("fetching page: {}", page);
// simulate loading pages
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
if page < 5 {
// successfully got page
Ok(String::from("foo"))
} else {
// page doesn't exist
Err(Error)
}
}
#[tokio::main]
async fn main() {
let pages: Vec<String> = stream::iter(1..)
.map(fetch_page)
.buffered(10)
.take_while(|page| future::ready(page.is_ok()))
.map(|page| page.unwrap())
.collect()
.await;
println!("pages: {:?}", pages);
}
I'll go over the steps in main() in detail:
stream::iter(1..) creates an unbounded Stream of integers representing each page number
.map(fetch_page) of course will call fetch_page for each page number
.buffered(10) this will allow up to 10 fetch_page calls to occur concurrently and will preserve the original order
.take_while(|page| future::ready(page.is_ok())) will keep the stream going until a fetch_page returns an error, it uses futures::future::ready since the function passed to take_while must return a future
.map(|page| page.unwrap()) will pull out the successful pages, it won't panic because we know the stream will stop when any errors occur
.collect() does essentially the same thing as for an iterator except you have to .await it
Running the above code prints out the following, showing that it tries 10 at a time but will only return up to the first failure:
fetching page: 1
fetching page: 2
fetching page: 3
fetching page: 4
fetching page: 5
fetching page: 6
fetching page: 7
fetching page: 8
fetching page: 9
fetching page: 10
pages: ["foo", "foo", "foo", "foo"]
This glosses over some nice-to-haves like handling non-missing-page errors or retrying, but I hope this gives you a good foundation. In those cases you might reach for the methods on TryStreamExt, which specially handle streams of Results.

Related

In Rust+Tokio, should you return a oneshot::Receiver as a processing callback?

I'm making an API where the user can submit items to be processed, and they might want to check whether their item was processed successfully. I thought that this would be a good place to use tokio::sync::oneshot channels, where I'd return the receiver to the caller, and they can later await on it to get the result they're looking for.
let processable_item = ...;
let where_to_submit: impl Submittable = get_submit_target();
let status_handle: oneshot::Receiver<SubmissionResult> = where_to_submit.submit(processable_item).await;
// ... do something that does not depend on the SubmissionResult ...
// Now we want to get the status of our submission
let status = status_handle.await;
Submitting the item involves creating a oneshot channel, and putting the Sender half into a queue while the Receiver goes back to the calling code:
#[async_trait]
impl Submittable for Something {
async fn submit(item: ProcessableItem) -> oneshot::Receiver<SubmissionResult> {
let (sender, receiver) = oneshot::channel();
// Put the item, with the associated sender, into a queue
let queue: mpsc::Receiver<(ProcessableItem, oneshot::Sender<SubmissionResult>)> = get_processing_queue();
queue.send( (item, sender) ).await.expect("Processing task closed!");
return receiver;
}
}
When I do this, cargo clippy complains (via the [clippy::async_yields_async] lint) that I'm returning oneshot::Receiver, which can be awaited, from an async function, and suggests that I await it then.
This is not what I wanted, which is to allow a degree of background processing while the user doesn't need the SubmissionResult yet, as opposed to making them wait until it's available.
Is this API even a good idea? Does there exist a common approach to doing this?
Looks fine to me. This is a false positive of Clippy, so you can just silence it: #[allow(clippy::async_yields_async)].

Is there a way to poll several futures simultaniously in rust async

I'm trying to execute several sqlx queries in parallel given by a iterator.
This is probably the closest I've got so far.
let mut futures = HahshMap::new() // placeholder, filled HashMap in reality
.iter()
.map(async move |(_, item)| -> Result<(), sqlx::Error> {
let result = sqlx::query_file_as!(
// omitted
)
.fetch_one(&pool)
.await?;
channel.send(Enum::Event(result)).ignore();
Ok(())
})
.clollect();
futures::future::join_all(futures);
All queries and sends are independent from each other, so if one of them fails, the others should still get processed.
Futthermore the current async closure is not possible like this.
Rust doesn't yet have async closures. You instead need to have the closure return an async block:
move |(_, item)| async move { ... }
Additionally, make sure you .await the future returned by join_all in order to ensure the individual tasks are actually polled.

How to process a vector as an asynchronous stream?

In my RSS reader project, I want to read my RSS feeds asynchronously. Currently, they're read synchronously thanks to this code block
self.feeds = self
.feeds
.iter()
.map(|f| f.read(&self.settings))
.collect::<Vec<Feed>>();
I want to make that code asynchronous, because it will allow me to better handle poor web server responses.
I understand I can use a Stream that I can create from my Vec using stream::from_iter(...) which transforms the code into something like
self.feeds = stream::from_iter(self.feeds.iter())
.map(|f| f.read(&self.settings))
// ???
.collect::<Vec<Feed>>()
}
But then, I have two questions
How to have results joined into a Vec (which is a synchronous struct)?
How to execute that stream? I was thinking about using task::spawn but it doesn't seems to work ...
How to execute that stream? I was thinking about using task::spawn but it doesn't seems to work
In the async/await world, asynchronous code is meant to be executed by an executor, which is not part of the standard library but provided by third-party crates such as tokio. task::spawn only schedules one instance of async fn to run, not actually running it.
How to have results joined into a vec (which is a sync struct)
The bread and butter of your rss reader seems to be f.read. It should be turned into an asynchronous function. Then the vector of feeds will be mapped into a vector of futures, which need to be polled to completion.
The futures crate has futures::stream::futures_unordered::FuturesUnordered to help you do that. FuturesUnordered itself implements Stream trait. This stream is then collected into the result vector and awaited to completion like so:
//# tokio = { version = "0.2.4", features = ["full"] }
//# futures = "0.3.1"
use tokio::time::delay_for;
use futures::stream::StreamExt;
use futures::stream::futures_unordered::FuturesUnordered;
use std::error::Error;
use std::time::{Duration, Instant};
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let start = Instant::now();
let feeds = (0..10).collect::<Vec<_>>();
let res = read_feeds(feeds).await;
dbg!(res);
dbg!(start.elapsed());
Ok(())
}
async fn read_feeds(feeds: Vec<u32>) -> Vec<u32> {
feeds.iter()
.map(read_feed)
.collect::<FuturesUnordered<_>>()
.collect::<Vec<_>>()
.await
}
async fn read_feed(feed: &u32) -> u32 {
delay_for(Duration::from_millis(500)).await;
feed * 2
}
delay_for is to simulate the potentially expensive operation. It also helps to demonstrate that these readings indeed happen concurrently without any explicit thread related logic.
One nuance here. Unlike its synchronous counterpart, the results of reading rss feeds aren't in the same order of feeds themselves any more, whichever returns the first will be at the front. You need to deal with that somehow.

Flutter multiple async methods for parrallel execution

I'm still struggeling with the async/await pattern so I'm here to ask you some precisions.
I saw this page explaining the async/await pattern pretty well. I'm posting here the example that bother me :
import 'dart:async';
Future<String> firstAsync() async {
await Future<String>.delayed(const Duration(seconds: 2));
return "First!";
}
Future<String> secondAsync() async {
await Future<String>.delayed(const Duration(seconds: 2));
return "Second!";
}
Future<String> thirdAsync() async {
await Future<String>.delayed(const Duration(seconds: 2));
return "Third!";
}
void main() async {
var f = await firstAsync();
print(f);
var s = await secondAsync();
print(s);
var t = await thirdAsync();
print(t);
print('done');
}
In this example, each async method is called one after another, so the execution time for the main function is 6 seconds (3 x 2 seconds). However, I don't understand what's the point of asynchronous function if they are executed one after another.
Are async functions not supposed to execute in the background ? Is it not the point of multiple async functions to fastens the process with parrallel execution ?
I think I'm missing something about asynchronous functions and async/await pattern in flutter so if you could explain me that, it would be very appreciated.
Best
Waiting on multiple Futures to complete using Future.wait()
If the order of execution of the functions is not important, you can use Future.wait().
The functions get triggered in quick succession; when all of them complete with a value, Future.wait() returns a new Future. This Future completes with a list containing the values produced by each function.
Future
.wait([firstAsync(), secondAsync(), thirdAsyncC()])
.then((List responses) => chooseBestResponse(responses))
.catchError((e) => handleError(e));
or with async/await
try {
List responses = await Future.wait([firstAsync(), secondAsync(), thirdAsyncC()]);
} catch (e) {
handleError(e)
}
If any of the invoked functions completes with an error, the Future returned by Future.wait() also completes with an error. Use catchError() to handle the error.
Resource:https://v1-dartlang-org.firebaseapp.com/tutorials/language/futures#waiting-on-multiple-futures-to-complete-using-futurewait
The example is designed to show how you can wait for a long-running process without actually blocking the thread. In practice, if you have several of those that you want to run in parallel (for example: independent network calls), you could optimize things.
Calling await stops the execution of the method until the future completes, so the call to secondAsync will not happen until firstAsync finishes, and so on. If you do this instead:
void main() async {
var f = firstAsync();
var s = secondAsync();
var t = thirdAsync();
print(await f);
print(await s);
print(await t);
print('done');
}
then all three futures are started right away, and then you wait for them to finish in a specific order.
It is worth highlighting that now f, s, and t have type Future<String>. You can experiment with different durations for each future, or changing the order of the statements.
If anyone new in this problem use the async . Dart has a function called FutureGroup. You can use it to run futures in parallel.
Sample:
final futureGroup = FutureGroup();//instantiate it
void runAllFutures() {
/// add all the futures , this is not the best way u can create an extension method to add all at the same time
futureGroup.add(hello());
futureGroup.add(checkLocalAuth());
futureGroup.add(hello1());
futureGroup.add(hello2());
futureGroup.add(hello3());
// call the `.close` of the group to fire all the futures,
// once u call `.close` this group cant be used again
futureGroup.close();
// await for future group to finish (all futures inside it to finish)
await futureGroup.future;
}
This futureGroup has some useful methods which can help you ie. .future etc.. check the documentation to get more info.
Here's a sample usage Example One using await/async and Example Two using Future.then.
you can always use them in a single future
final results = await Future.wait([
firstAsync();
secondAsync();
thirdAsync();
]);
results will be an array of you return type. in this case array of strings.
cheers.
Try this resolve.
final List<Future<dynamic>> featureList = <Future<dynamic>>[];
for (final Partner partner in partnerList) {
featureList.add(repository.fetchAvatar(partner.uid));
}
await Future.wait<dynamic>(featureList);
If want parallel execution you should switch to multi thread concept called Isolates
mix this with async/await concepts . You can also check this website for more
https://buildflutter.com/flutter-threading-isolates-future-async-and-await/
Using async / await like that is useful when you need a resource before executing the next task.
In your example you don't do really useful things, but imagine you call firstAsync, that gives you a stored authorization token in your phone, then you call secondAsync giving this token get asynchronously and execute an HTTP request and then checking the result of this request.
In this case you don't block the UI thread (user can interact with your app) and other tasks (get token, HTTP request...) are done in background.
i think you miss understood how flutter works first flutter is not multi threaded.....!
second if it isn't multi threaded how can it executes parallel tasks, which doesnt happen....! here is some links that will help you understand more https://webdev.dartlang.org/articles/performance/event-loop
https://www.dartlang.org/tutorials/language/futures
flutter doesn't put futures on another thread but what happens that they are added to a queue the links that i added are for event loop and how future works. hope you get it , feel free to ask me :)

Generic reply from agent/mailboxprocessor?

I currently have an agent that does heavy data processing by constantly posting "work" messages to itself.
Sometimes clients to this agent wants to interrupt this processing to access the data in a safe manner.
For this I thought that posting an async to the agent that the agent can run whenever it's in a safe state would be nice. This works fine and the message looks like this:
type Message = |Sync of Async<unit>*AsyncReplyChannel<unit>
And the agent processing simply becomes:
match mailbox.Receive () with
| Sync (async, reply) -> async |> Async.RunSynchronously |> reply.Reply
This works great as long as clients don't need to return some value from the async as I've constrained the async/reply to be of type unit and I cannot use a generic type in the discriminated union.
My best attempts to solve this has involved wrapper asyncs and waithandles, but this seems messy and not as elegant as I've come to expect from F#. I'm also new to async workflows in F# so it's very possible that I've missed/misunderstood some concepts here.
So the question is; how can I return generic types in a agent response?
The thing that makes this difficult is that, in your current version, the agent would somehow have to calculate the value and then pass it to the channel, without knowing what is the type of the value. Doing that in a statically typed way in F# is tricky.
If you make the message generic, then it will work, but the agent will only be able to handle messages of one type (the type T in Message<T>).
An alternative is to simply pass Async<unit> to the agent and let the caller do the value passing for each specific type. So, you can write message & agent just like this:
type Message = | Sync of Async<unit>
let agent = MailboxProcessor.Start(fun inbox -> async {
while true do
let! msg = inbox.Receive ()
match msg with
| Sync (work) -> do! work })
When you use PostAndReply, you get access to the reply channel - rather than passing the channel to the agent, you can just use it in the local async block:
let num = agent.PostAndReply(fun chan -> Sync(async {
let ret = 42
chan.Reply(ret) }))
let str = agent.PostAndReply(fun chan -> Sync(async {
let ret = "hi"
chan.Reply(ret) }))

Resources