I'd like to improve the integration of my async data collection with my rayon data processing by overlapping the retrieval and the processing. Currently, I pull lots of pages from a web site using normal async code. Once that is complete, I do the cpu-intensive work using rayon's par_iter.
It seems like I should be able to easily overlap the processing, so that I'm not waiting for every last page before I begin the grunt work. Every page that I retrieve is independent of the others, so there is no need to wait before the conversion.
Here's what I have working currently (simplified just a bit):
use rayon::prelude::*;
use futures::{stream, StreamExt};
use reqwest::{Client, Result};
const CONCURRENT_REQUESTS: usize = usize::MAX;
const MAX_PAGE: usize = 1000;
#[tokio::main]
async fn main() {
// get data from server
let client = Client::new();
let bodies: Vec<Result<String>> = stream::iter(1..MAX_PAGE+1)
.map(|page_number| {
let client = &client;
async move {
client
.get(format!("https://someurl?{page_number}"))
.send()
.await?
.text()
.await
}
})
.buffer_unordered(CONCURRENT_REQUESTS)
.collect()
.await;
// transform the data
let mut rows: Vec<MyRow> = bodies
.par_iter()
.filter_map(|body| body.as_ref().ok())
.map(|data| {
let page = serde_json::from_str::<MyPage>(data).unwrap();
page.rows
.iter()
.map(|x| Row::new(x))
.collect::<Vec<MyRow>>()
})
.flatten()
.collect();
// do something with rows
}
Related
I built a LED clock that also displays weather. My program does a couple of different things in a loop, each thing with a different interval:
updates the LEDs every 50ms,
checks the light level (to adjust the brightness) every 1 second,
fetches weather every 10 minutes,
actually some more, but that's irrelevant.
Updating the LEDs is the most critical: I don't want this to be delayed when e.g. weather is being fetched. This should not be a problem as fetching weather is mostly an async HTTP call.
Here's the code that I have:
let mut measure_light_stream = tokio::time::interval(Duration::from_secs(1));
let mut update_weather_stream = tokio::time::interval(WEATHER_FETCH_INTERVAL);
let mut update_leds_stream = tokio::time::interval(UPDATE_LEDS_INTERVAL);
loop {
tokio::select! {
_ = measure_light_stream.tick() => {
let light = lm.get_light();
light_smooth.sp = light;
},
_ = update_weather_stream.tick() => {
let fetched_weather = weather_service.get(&config).await;
// Store the fetched weather for later access from the displaying function.
weather_clock.weather = fetched_weather.clone();
},
_ = update_leds_stream.tick() => {
// Some code here that actually sets the LEDs.
// This code accesses the weather_clock, the light level etc.
},
}
}
I realised the code doesn't do what I wanted it to do - fetching the weather blocks the execution of the loop. I see why - the docs of tokio::select! say the other branches are cancelled as soon as the update_weather_stream.tick() expression completes.
How do I do this in such a way that while fetching the weather is waiting on network, the LEDs are still updated? I figured out I could use tokio::spawn to start a separate non-blocking "thread" for fetching weather, but then I have problems with weather_service not being Send, let alone weather_clock not being shareable between threads. I don't want this complication, I'm fine with everything running in a single thread, just like what select! does.
Reproducible example
use std::time::Duration;
use tokio::time::{interval, sleep};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut slow_stream = interval(Duration::from_secs(3));
let mut fast_stream = interval(Duration::from_millis(200));
// Note how access to this data is straightforward, I do not want
// this to get more complicated, e.g. care about threads and Send.
let mut val = 1;
loop {
tokio::select! {
_ = fast_stream.tick() => {
println!(".{}", val);
},
_ = slow_stream.tick() => {
println!("Starting slow operation...");
// The problem: During this await the dots are not printed.
sleep(Duration::from_secs(1)).await;
val += 1;
println!("...done");
},
}
}
}
You can use tokio::join! to run multiple async operations concurrently within the same task.
Here's an example:
async fn measure_light(halt: &Cell<bool>) {
while !halt.get() {
let light = lm.get_light();
// ....
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
async fn blink_led(halt: &Cell<bool>) {
while !halt.get() {
// LED blinking code
tokio::time::sleep(UPDATE_LEDS_INTERVAL).await;
}
}
async fn poll_weather(halt: &Cell<bool>) {
while !halt.get() {
let weather = weather_service.get(&config).await;
// ...
tokio::time::sleep(WEATHER_FETCH_INTERVAL).await;
}
}
// example on how to terminate execution
async fn terminate(halt: &Cell<bool>) {
tokio::time::sleep(Duration::from_secs(10)).await;
halt.set(true);
}
async fn main() {
let halt = Cell::new(false);
tokio::join!(
measure_light(&halt),
blink_led(&halt),
poll_weather(&halt),
terminate(&halt),
);
}
If you're using tokio::TcpStream or other non-blocking IO, then it should allow for concurrent execution.
I've added a Cell flag for halting execution as an example. You can use the same technique to share any mutable state between join branches.
EDIT: Same thing can be done with tokio::select!. The main difference with your code is that the actual "business logic" is inside the futures awaited by select.
select allows you to drop unfinished futures instead of waiting for them to exit on their own (so halt termination flag is not necessary).
async fn main() {
tokio::select! {
_ = measure_light() => {},
_ = blink_led() = {},
_ = poll_weather() => {},
}
}
Here's a concrete solution, based on the second part of stepan's answer:
use std::time::Duration;
use tokio::time::sleep;
#[tokio::main]
async fn main() {
// Cell is an acceptable complication when accessing the data.
let val = std::cell::Cell::new(1);
tokio::select! {
_ = async {loop {
println!(".{}", val.get());
sleep(Duration::from_millis(200)).await;
}} => {},
_ = async {loop {
println!("Starting slow operation...");
// The problem: During this await the dots are not printed.
sleep(Duration::from_secs(1)).await;
val.set(val.get() + 1);
println!("...done");
sleep(Duration::from_secs(3)).await;
}} => {},
}
}
Playground link
I need a way to run the same function many times with different inputs.
And since the function depends on a slow web API, I need to run it concurrently and collect the results in one variable.
I use the following:
use tokio_stream::StreamExt;
async fn run(input: &str) -> Vec<String> {
vec![String::from(input), String::from(input)]
}
async fn main() {
let mut input = tokio_stream::iter(vec!["1","2","3","4","5","6","7","8"]);
let mut handles = vec![];
while let Some(domain) = input.next().await {
handles.push(run(domain));
}
let mut results = vec![];
let mut handles = tokio_stream::iter(handles);
while let Some(handle) = handles.next().await {
results.extend(handle.await);
}
}
I know there is a way with the futures crate, but I don't know if I can use it with tokio. Also tokio_stream::StreamExt contains fold and map methods but I can't find a way to use them without calling await.
What is the best way to do this?
IIUC what you want, you can use tokio::spawn to launch your tasks in the background and futures::join_all to wait until they have all completed. E.g. something like this (untested):
async fn run(input: &str) -> Vec<String> {
vec![String::from(input), String::from(input)]
}
async fn main() {
let input = vec!["1","2","3","4","5","6","7","8"];
let handles = input.iter().map (|domain| {
tokio::spawn (async move { run (domain).await })
});
let results = futures::join_all (handles).await;
}
I began writing a program using the Druid crate and Crabler crate to make a webscraping application whose data I can explore. I only realized that merging synchronous and asynchronous programming was a bad idea long after I had spent a while building this program. What I am trying to do right now is have the scraper run while the application is open (preferably every hour).
Right now the scraper doesn't run until after the application is closed. I tried to use Tokio's spawn to make a separate thread that starts before the application opens, but this doesn't work because the Crabler future doesn't have the "Send" trait.
I tried to make a minimal functional program as shown below. The title_handler doesn't function as expected but otherwise it demonstrates the issue I'm having well.
Is it possible to allow the WebScraper to run while the application is open? If so, how?
EDIT: I tried using task::spawn_blocking() to run the application and it threw out a ton of errors, including that druid doesn't implement the trait Send.
use crabler::*;
use druid::widget::prelude::*;
use druid::widget::{Align, Flex, Label, TextBox};
use druid::{AppLauncher, Data, Lens, WindowDesc, WidgetExt};
const ENTRY_PREFIX: [&str; 1] = ["https://duckduckgo.com/?t=ffab&q=rust&ia=web"];
// Use WebScraper trait to get each item with the ".result__title" class
#[derive(WebScraper)]
#[on_response(response_handler)]
#[on_html(".result__title", title_handler)]
struct Scraper {}
impl Scraper {
// Print webpage status
async fn response_handler(&self, response: Response) -> Result<()> {
println!("Status {}", response.status);
Ok(())
}
async fn title_handler(&self, _: Response, el: Element) -> Result<()> {
// Get text of element
let title_data = el.children();
let title_text = title_data.first().unwrap().text().unwrap();
println!("Result is {}", title_text);
Ok(())
}
}
// Run scraper to get info from https://duckduckgo.com/?t=ffab&q=rust&ia=web
async fn one_scrape() -> Result<()> {
let scraper = Scraper {};
scraper.run(Opts::new().with_urls(ENTRY_PREFIX.to_vec()).with_threads(1)).await
}
#[derive(Clone, Data, Lens)]
struct Init {
tag: String,
}
fn build_ui() -> impl Widget<Init> {
// Search box
let l_search = Label::new("Search: ");
let tb_search = TextBox::new()
.with_placeholder("Enter tag to search")
.lens(Init::tag);
let search = Flex::row()
.with_child(l_search)
.with_child(tb_search);
// Describe layout of UI
let layout = Flex::column()
.with_child(search);
Align::centered(layout)
}
#[async_std::main]
async fn main() -> Result<()> {
// Describe the main window
let main_window = WindowDesc::new(build_ui())
.title("Title Tracker")
.window_size((400.0, 400.0));
// Create starting app state
let init_state = Init {
tag: String::from("#"),
};
// Start application
AppLauncher::with_window(main_window)
.launch(init_state)
.expect("Failed to launch application");
one_scrape().await
}
I want to build a program that collects weather updates and represents them as a stream. I want to call get_weather() in an infinite loop, with 60 seconds delay between finish and start.
A simplified version would look like this:
async fn get_weather() -> Weather { /* ... */ }
fn get_weather_stream() -> impl futures::Stream<Item = Weather> {
loop {
tokio::timer::delay_for(std::time::Duration::from_secs(60)).await;
let weather = get_weather().await;
yield weather; // This is not supported
// Note: waiting for get_weather() stops the timer and avoids overflows.
}
}
Is there any way to do this easily?
Using tokio::timer::Interval will not work when get_weather() takes more than 60 seconds:
fn get_weather_stream() -> impl futures::Stream<Item = Weather> {
tokio::timer::Interval::new_with_delay(std::time::Duration::from_secs(60))
.then(|| get_weather())
}
If that happens, the next function will start immediately. I want to keep exactly 60 seconds between the previous get_weather() start and the next get_weather() start.
Use stream::unfold to go from the "world of futures" to the "world of streams". We don't need any extra state, so we use the empty tuple:
use futures::StreamExt; // 0.3.4
use std::time::Duration;
use tokio::time; // 0.2.11
struct Weather;
async fn get_weather() -> Weather {
Weather
}
const BETWEEN: Duration = Duration::from_secs(1);
fn get_weather_stream() -> impl futures::Stream<Item = Weather> {
futures::stream::unfold((), |_| async {
time::delay_for(BETWEEN).await;
let weather = get_weather().await;
Some((weather, ()))
})
}
#[tokio::main]
async fn main() {
get_weather_stream()
.take(3)
.for_each(|_v| async {
println!("Got the weather");
})
.await;
}
% time ./target/debug/example
Got the weather
Got the weather
Got the weather
real 3.085 3085495us
user 0.004 3928us
sys 0.003 3151us
See also:
How do I convert an iterator into a stream on success or an empty stream on failure?
How do I iterate over a Vec of functions returning Futures in Rust?
Creating a stream of values while calling async fns?
EDIT: I refactored the code to make it simpler.
I'm writing a small program to check a website for dead links, using tokio and reqwests to make requests async without the need for threading. But I also need to be able to return something from each of the requests that tokio is running; namely if a request failed or not.
fn fetch(req: Vec<&'static str>) {
let client = Client::new();
let (tx, rx) = mpsc::channel();
let req_len = req.len();
let work = stream::iter_ok(req)
.map(move |url| client.get(url).send())
.buffer_unordered(PARALLEL_REQUESTS)
.then(move |response| {
let this_tx = tx.clone();
match response {
Ok(x) => {
format_response(x);
this_tx.send(1).unwrap();
}
Err(x) => {
format_error(x);
}
}
future::ok(())
})
.for_each(|n| Ok(()));
tokio::run(work);
The code works, but I'd like some feedback as to what the best way of writing this in Rust would be.