How might I avoid a global mutable variable in this code? - web-scraping

The following code is meant to print There is page two. if it finds a certain div on this website:
use reqwest;
use select::document::Document;
use select::predicate::Name;
use std::io;
static mut DECIDE: bool = false;
fn page_two_filter(x: &str, url: &str) {
if x == "pSiguiente('?pagina=2')" {
unsafe {
DECIDE = true;
}
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Give me the URL with the search results?");
let mut url = String::new();
io::stdin()
.read_line(&mut url)
.expect("Failed to read line");
let url = url.trim();
let html = reqwest::get(url).await?.text().await?;
Document::from(html.as_str())
.find(Name("div"))
.filter_map(|n| n.attr("onclick"))
.for_each(|x| page_two_filter(x, url));
unsafe {
if DECIDE == true {
println!("There is page two.")
}
}
Ok(())
}
Dependencies from Cargo.toml
[dependencies]
futures = "0.3.15"
reqwest = "0.11.9"
scraper = "0.12.0"
select = "0.5.0"
tokio = { version = "1", features = ["full"] }
Is there a safer way, i.e. without the unsafe blocks of code, of doing what that code does?
Wanting to avoid global mutable variables, I've tried with redefining page_two_filter and an if statement with the result of the call to page_two_filter, like so:
fn page_two_filter(x: &str, url: &str) -> bool {
if x == "pSiguiente('?pagina=2')" {
return true;
}
false
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Give me the URL with the search results?");
let mut url = String::new();
io::stdin()
.read_line(&mut url)
.expect("Failed to read line");
let url = url.trim();
let html = reqwest::get(url).await?.text().await?;
if Document::from(html.as_str())
.find(Name("div"))
.filter_map(|n| n.attr("onclick"))
.for_each(|x| page_two_filter(x, url))
{
println!("There is page two.")
}
Ok(())
}
but compiler does not allow me doing this saying:
mismatched types expected `()`, found `bool`

Instead of for_each(), I guess you need find().
This returns Some( found_element ) if found or None if not found.
You can then use the Option returned by find() with if let, match, is_some()...
if let Some(_) = Document::from(html.as_str())
.find(Name("div"))
.filter_map(|n| n.attr("onclick"))
.find(|x| page_two_filter(x, url))
{
println!("There is page two.")
}

First of all, the
mismatched types expected (), found bool
error is because there is no semicolon after the println statement in the for_each closure.
Secondly, the filter is actually a one-liner, which could be integrated in that very closure
fn page_two_filter(x: &str, url: &str) -> bool {
x == "pSiguiente('?pagina=2')"
}
Lastly, you already use various iterator methods, so why not continue?
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Give me the URL with the search results?");
let mut url = String::new();
io::stdin().read_line(&mut url).expect("Failed to read line");
let url = url.trim();
let html = reqwest::get(url).await?.text().await?;
if let Some(_) = Document::from(html.as_str())
.find(Name("div"))
.filter_map(|n| n.attr("onclick"))
.find_map(|attr| if attr == "pSiguiente('?pagina=2')" {
Some(true)
} else {
None
}) {
println!("There is page two.");
}
Ok(())
}

You can use Iterator::any which returns true on first find of condition, false otherwise:
fn page_two_filter(x: &str, url: &str) -> bool {
x == "pSiguiente('?pagina=2')"
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Give me the URL with the search results?");
let mut url = String::new();
io::stdin()
.read_line(&mut url)
.expect("Failed to read line");
let url = url.trim();
let html = reqwest::get(url).await?.text().await?;
let found = Document::from(html.as_str())
.find(Name("div"))
.filter_map(|n| n.attr("onclick"))
.any(|x| page_two_filter(x, url));
if found {
println!("There is page two.");
}
}

Related

How do you get multiple urls at the same time in a synchronus function

I am getting data from the open weather map API. Currently the data is being retrieved synchronously which is slow. However, the function has to be synchronous as it is part of a library, but it can call an async function. How might I still make concurrent requests to increase performance? A solution that does not use reqwests works, but reqwests is preferred.
fn get_combined_data(open_weather_map_api_url: String, open_weather_map_api_key: String,
coordinates: Vec<String>, metric: bool) -> Vec<HashMap<String, String>> {
let urls: Vec<String> = get_urls(open_weather_map_api_url, open_weather_map_api_key,
coordinates.get(0).expect("Improper coordinates").to_string() + "," +
coordinates.get(1).expect("Improper coordinates"), metric);
let mut data: Vec<HashMap<String, String>> = Vec::new();
for url in urls {
let request = reqwest::blocking::get(url).expect("Url Get failed").json().expect("json expected");
data.push(request);
}
return data;
}
If your program isn't already async, probably the easiest way might be to use rayon.
use reqwest;
use std::collections::HashMap;
use rayon::prelude::*;
fn get_combined_data(open_weather_map_api_url: String, open_weather_map_api_key: String,
coordinates: Vec<String>, metric: bool) -> Vec<HashMap<String, String>> {
let urls: Vec<String> = get_urls(open_weather_map_api_url, open_weather_map_api_key,
coordinates.get(0).expect("Improper coordinates").to_string() + "," +
coordinates.get(1).expect("Improper coordinates"), metric);
let data : Vec<_>= urls
.par_iter()
.map(|&url| reqwest::blocking::get(url).expect("Url Get failed").json().expect("json expected"))
.collect();
return data;
}
The easiest is probably to use tokios new_current_thread runtime and blocking on the data retreival.
use std::collections::HashMap;
use tokio::runtime;
pub fn collect_data() -> Vec<HashMap<String, String>> {
let rt = runtime::Builder::new_current_thread()
.build()
.expect("couldn't start runtime");
let urls = vec!["https://example.com/a", "https://example.com/b"];
rt.block_on(async move {
let mut data = vec![];
for url in urls {
data.push(async move {
reqwest::get(url)
.await
.expect("Url Get Failed")
.json()
.await
.expect("json expected")
});
}
futures::future::join_all(data).await
})
}
You need an asynchronous runtime in order to call asynchronous functions. The easiest way to get one is to use the #[tokio::main] attribute (which despite the name can be applied to any function):
#[tokio::main]
fn get_combined_data(
open_weather_map_api_url: String,
open_weather_map_api_key: String,
coordinates: Vec<String>,
metric: bool,
) -> Vec<HashMap<String, String>> {
let urls: Vec<String> = get_urls(
open_weather_map_api_url,
open_weather_map_api_key,
coordinates
.get(0)
.expect("Improper coordinates")
.to_string()
+ ","
+ coordinates.get(1).expect("Improper coordinates"),
metric,
);
futures::future::join_all (urls.map (|u| {
async move {
reqwest::get(url)
.await
.expect("Url Get Failed")
.json()
.await
.expect("json expected")
}
})).await
}

Hyper 0.12.x : Implementing Service for a struct

In hyper 0.12.33, how do I implement hyper::service::Service for a struct ?
I have tried the following but it is not sufficient as it seems that in 0.12 the Future trait is not provided automatically anymore for a struct that implements Service:
use futures::future::Future;
use hyper::{Body, Request, Response};
struct MyStruct;
impl MyStruct {
pub fn new() -> Self {
MyStruct
}
}
impl hyper::service::Service for MyStruct {
type ReqBody = Body;
type ResBody = Body;
type Error = hyper::Error;
type Future = Box<Future<Item = Response<Body>, Error = hyper::Error>>;
fn call(&mut self, req: Request<Body>) -> Self::Future {
unimplemented!()
}
}
fn main() {
let addr = "0.0.0.0:8080".parse().unwrap();
let server = hyper::Server::bind(&addr)
.serve(|| MyStruct::new())
.map_err(|e| eprintln!("server error: {}", e));
hyper::rt::run(server);
}
gives me the build error message:
Standard Error
Compiling playground v0.0.1 (/playground)
error[E0277]: the trait bound `MyStruct: futures::future::Future` is not satisfied
--> src/main.rs:26:10
|
26 | .serve(|| MyStruct::new())
| ^^^^^ the trait `futures::future::Future` is not implemented for `MyStruct`
|
= note: required because of the requirements on the impl of `hyper::service::make_service::MakeServiceRef<hyper::server::tcp::addr_stream::AddrStream>` for `[closure#src/main.rs:26:16: 26:34]`
error[E0599]: no method named `map_err` found for type `hyper::server::Server<hyper::server::tcp::AddrIncoming, [closure#src/main.rs:26:16: 26:34]>` in the current scope
--> src/main.rs:27:10
|
27 | .map_err(|e| eprintln!("server error: {}", e));
| ^^^^^^^
|
= note: the method `map_err` exists but the following trait bounds were not satisfied:
`&mut hyper::server::Server<hyper::server::tcp::AddrIncoming, [closure#src/main.rs:26:16: 26:34]> : futures::future::Future`
`hyper::server::Server<hyper::server::tcp::AddrIncoming, [closure#src/main.rs:26:16: 26:34]> : futures::future::Future`
This example gives one way. It compiles and runs with v0.14.12
#![deny(warnings)]
use std::task::{Context, Poll};
use futures_util::future;
use hyper::service::Service;
use hyper::{Body, Request, Response, Server};
const ROOT: &str = "/";
#[derive(Debug)]
pub struct Svc;
impl Service<Request<Body>> for Svc {
type Response = Response<Body>;
type Error = hyper::Error;
type Future = future::Ready<Result<Self::Response, Self::Error>>;
fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
Ok(()).into()
}
fn call(&mut self, req: Request<Body>) -> Self::Future {
let rsp = Response::builder();
let uri = req.uri();
if uri.path() != ROOT {
let body = Body::from(Vec::new());
let rsp = rsp.status(404).body(body).unwrap();
return future::ok(rsp);
}
let body = Body::from(Vec::from(&b"heyo!"[..]));
let rsp = rsp.status(200).body(body).unwrap();
future::ok(rsp)
}
}
pub struct MakeSvc;
impl<T> Service<T> for MakeSvc {
type Response = Svc;
type Error = std::io::Error;
type Future = future::Ready<Result<Self::Response, Self::Error>>;
fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
Ok(()).into()
}
fn call(&mut self, _: T) -> Self::Future {
future::ok(Svc)
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// pretty_env_logger::init();
let addr = "127.0.0.1:1337".parse().unwrap();
let server = Server::bind(&addr).serve(MakeSvc);
println!("Listening on http://{}", addr);
server.await?;
Ok(())
}
The indirection (MakeSvc -> Src) appears to follow from the architecture of Hyper, as described in this issue:
There's two steps involved here, and both make use of Service:
The MakeSvc is a Service that creates Svcs for each connection.
The Svc is a Service to handle requests on a single connection.

How to convert a future's output?

With Tokio's futures, if you want to convert an Error in the causal chain of combinators, you use from_err::<NewType>(). I want the same functionality, but instead for the Item in impl Future<Item = (), Error = ()>.
An example of some of my code:
let mut async_series_client = vec![];
async_series_client.push(Box::new(
SocketHandler::connect(
port,
addr,
handle,
tx_wave,
tx_linear,
KcpSessionManager::new(&handle2).unwrap(),
)
.from_err::<HyxeError>()
.join(tube)
.map_err(|mut err| err.printf()),
));
This returns ((),()) (Side question: does it return a tuple of () because of the join?). I want it to return just (). How can I do this?
Use Future::map. This is a parallel to Option::map, Result::map, and Iterator::map:
use futures::{future, Future}; // 0.1.27
fn some_future() -> impl Future<Item = i32, Error = ()> {
future::ok(42)
}
fn change_item() -> impl Future<Item = String, Error = ()> {
some_future().map(|i| i.to_string())
}
See also Stream::map.
When async/await syntax is stabilized, you may never need to use this combinator again as you can just use normal methods:
async fn some_future() -> i32 {
42
}
async fn change_output() -> String {
some_future().await.to_string()
}
Or Result::map:
async fn some_future() -> Result<i32, ()> {
Ok(42)
}
async fn change_output() -> Result<String, ()> {
some_future().await.map(|i| i.to_string())
}
But it still exists:
use futures::{Future, FutureExt}; // 0.3.0-alpha.16
async fn some_future() -> i32 {
42
}
fn change_output() -> impl Future<Output = String> {
some_future().map(|i| i.to_string())
}

BufWriter::write() doesn't write bytes to TcpStream

I've written an echo server and client in Rust. Here is my code:
Server:
use std::net::{TcpListener, TcpStream};
use std::thread;
use std::io::Write;
use std::io::BufReader;
use std::io::BufRead;
use std::io::BufWriter;
fn handle_connection(stream: TcpStream) {
let stream_clone = stream.try_clone().unwrap();
let mut reader = BufReader::new(stream);
let mut writer = BufWriter::new(stream_clone);
loop {
let mut s = String::new();
reader.read_line(&mut s).unwrap();
writer.write(s.as_bytes()).unwrap();
}
}
fn main() {
let listener = TcpListener::bind("127.0.0.1:8888")
.unwrap();
for stream in listener.incoming() {
thread::spawn(move || {
handle_connection(stream.unwrap());
});
}
}
Client:
use std::net::TcpStream;
use std::io;
use std::io::Write;
use std::io::BufReader;
use std::io::BufRead;
use std::io::BufWriter;
fn main() {
let stream = TcpStream::connect("127.0.0.1:8888")
.unwrap();
let stream_clone = stream.try_clone().unwrap();
let mut reader = BufReader::new(stream);
let mut writer = BufWriter::new(stream_clone);
loop {
let mut s = String::new();
let mut response = String::new();
io::stdin().read_line(&mut s).unwrap();
writer.write(s.as_bytes()).unwrap();
reader.read_line(&mut response).unwrap();
println!("{}", response.trim());
}
}
When I test the code, the server don't respond at all. My guess is that something is wrong with the write method. Am I right, or is there another reason?
You need to flush the buffers: writer.flush()
Fixed server:
use std::net::{TcpListener, TcpStream};
use std::thread;
use std::io::Write;
use std::io::BufReader;
use std::io::BufRead;
use std::io::BufWriter;
fn handle_connection(stream: TcpStream) {
let stream_clone = stream.try_clone().unwrap();
let mut reader = BufReader::new(stream);
let mut writer = BufWriter::new(stream_clone);
loop {
let mut s = String::new();
reader.read_line(&mut s).unwrap();
writer.write(s.as_bytes()).unwrap();
writer.flush().unwrap();
}
}
fn main() {
let listener = TcpListener::bind("127.0.0.1:8888")
.unwrap();
for stream in listener.incoming() {
thread::spawn(move || {
handle_connection(stream.unwrap());
});
}
}
Client:
use std::net::TcpStream;
use std::io;
use std::io::Write;
use std::io::BufReader;
use std::io::BufRead;
use std::io::BufWriter;
fn main() {
let stream = TcpStream::connect("127.0.0.1:8888")
.unwrap();
let stream_clone = stream.try_clone().unwrap();
let mut reader = BufReader::new(stream);
let mut writer = BufWriter::new(stream_clone);
loop {
let mut s = String::new();
let mut response = String::new();
io::stdin().read_line(&mut s).unwrap();
writer.write(s.as_bytes()).unwrap();
writer.flush().unwrap();
reader.read_line(&mut response).unwrap();
println!("{}", response.trim());
}
}

How would I make a TcpClient request per item in a futures Stream?

I have a concept project where the client sends a server a number (PrimeClientRequest), the server computes if the value is prime or not, and returns a response (PrimeClientResponse). I want the client to be a simple CLI which prompts the user for a number, sends the request to the server, and displays the response. Ideally I want to do this using TcpClient from Tokio and Streams from Futures-Rs.
I've written a Tokio server using services and I want to reuse the same codec and proto for the client.
Part of the client is a function called read_prompt which returns a Stream. Essentially it is an infinite loop at which each iteration reads in some input from stdin.
Here's the relevant code:
main.rs
use futures::{Future, Stream};
use std::env;
use std::net::SocketAddr;
use tokio_core::reactor::Core;
use tokio_prime::protocol::PrimeClientProto;
use tokio_prime::request::PrimeRequest;
use tokio_proto::TcpClient;
use tokio_service::Service;
mod cli;
fn main() {
let mut core = Core::new().unwrap();
let handle = core.handle();
let addr_string = env::args().nth(1).unwrap_or("127.0.0.1:8080".to_string());
let remote_addr = addr_string.parse::<SocketAddr>().unwrap();
println!("Connecting on {}", remote_addr);
let tcp_client = TcpClient::new(PrimeClientProto).connect(&remote_addr, &handle);
core.run(tcp_client.and_then(|client| {
client
.call(PrimeRequest { number: Ok(0) })
.and_then(|response| {
println!("RESP = {:?}", response);
Ok(())
})
})).unwrap();
}
cli.rs
use futures::{Future, Sink, Stream};
use futures::sync::mpsc;
use std::{io, thread};
use std::io::{Stdin, Stdout};
use std::io::prelude::*;
pub fn read_prompt() -> impl Stream<Item = u64, Error = ()> {
let (tx, rx) = mpsc::channel(1);
thread::spawn(move || loop {
let thread_tx = tx.clone();
let input = prompt(io::stdout(), io::stdin()).unwrap();
let parsed_input = input
.parse::<u64>()
.map_err(|_| io::Error::new(io::ErrorKind::Other, "invalid u64"));
thread_tx.send(parsed_input.unwrap()).wait().unwrap();
});
rx
}
fn prompt(stdout: Stdout, stdin: Stdin) -> io::Result<String> {
let mut stdout_handle = stdout.lock();
stdout_handle.write(b"> ")?;
stdout_handle.flush()?;
let mut buf = String::new();
let mut stdin_handle = stdin.lock();
stdin_handle.read_line(&mut buf)?;
Ok(buf.trim().to_string())
}
With the code above, the client sends a single request to the server before the client terminates. I want to be able to use the stream generated from read_prompt to provide input to the TcpClient and make a request per item in the stream. How would I go about doing this?
The full code can be found at joshleeb/tokio-prime.
The solution I have come up with (so far) has been to use the LoopFn in the Future-Rs crate. It's not ideal as a new connection still has to be made but it is at least a step in the right direction.
main.rs
use futures::{future, Future};
use std::{env, io};
use std::net::SocketAddr;
use tokio_core::reactor::{Core, Handle};
use tokio_prime::protocol::PrimeClientProto;
use tokio_prime::request::PrimeRequest;
use tokio_proto::TcpClient;
use tokio_service::Service;
mod cli;
fn handler<'a>(
handle: &'a Handle, addr: &'a SocketAddr
) -> impl Future<Item = (), Error = ()> + 'a {
cli::prompt(io::stdin(), io::stdout())
.and_then(move |number| {
TcpClient::new(PrimeClientProto)
.connect(addr, handle)
.and_then(move |client| Ok((client, number)))
})
.and_then(|(client, number)| {
client
.call(PrimeRequest { number: Ok(number) })
.and_then(|response| {
println!("{:?}", response);
Ok(())
})
})
.or_else(|err| {
println!("! {}", err);
Ok(())
})
}
fn main() {
let mut core = Core::new().unwrap();
let handle = core.handle();
let addr_string = env::args().nth(1).unwrap_or("127.0.0.1:8080".to_string());
let remote_addr = addr_string.parse::<SocketAddr>().unwrap();
println!("Connecting on {}", remote_addr);
let client = future::loop_fn((), |_| {
handler(&handle, &remote_addr)
.map(|_| -> future::Loop<(), ()> { future::Loop::Continue(()) })
});
core.run(client).ok();
}
cli.rs
use futures::prelude::*;
use std::io;
use std::io::{Stdin, Stdout};
use std::io::prelude::*;
#[async]
pub fn prompt(stdin: Stdin, stdout: Stdout) -> io::Result<u64> {
let mut stdout_handle = stdout.lock();
stdout_handle.write(b"> ")?;
stdout_handle.flush()?;
let mut buf = String::new();
let mut stdin_handle = stdin.lock();
stdin_handle.read_line(&mut buf)?;
parse_input(buf.trim().to_string())
}
fn parse_input(s: String) -> io::Result<u64> {
s.parse::<u64>()
.map_err(|_| io::Error::new(io::ErrorKind::Other, "invalid u64"))
}

Resources