iterate over large tsv.gs from network url - networking

I would like to iterate over some network files (tsv.gz), parse them (load each row), and only write portions (i.e. columns) to files, i.e. https://datasets.imdbws.com/ (ideally with flate2), but I can't seem to find any idioms for iterating over files from URIs. Should I use an external package like hyper and try to iterate over Body? If so, how can I convert a Body into something that implements Read? Here is some base code:
use flate2::read::GzDecoder;
use hyper::Client;
use std::io::BufReader;
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let client = Client::new();
let uri = "http://datasets.imdbws.com/title.basics.tsv.gz".parse()?;
let body = client.get(uri).await?.into_body();
let d = GzDecoder::new(body); // hyper::Body doesn't implement Read
for line in BufReader::new(d).lines() {
// do something with lines
}
Ok(())
}

Related

Understanding how to use rust TOKIO library

I want to learn the rust Tokio library. To facilitate this I want to write an ASYNC TCP logger in rust.
Basically a TCP client that connects to a TCP server (172.16.10.10 port 7777) and just logs messages received asynchronously file to a log file. I want the main function to read user input from the console - in my case was for pressing ‘q’ key - simulate the program doing some other task.
I expect to receive multiple TCP responses whilst waiting for user to press ‘q’ key.
I am trying to workout how to read and log multiple TCP responses independently of waiting for the user input
let mut buf_reader = BufReader::new(&stream);
let mut data = vec![];
buf_reader.read_to_end(&mut data).await.unwrap();
log_writer.write_all(&data).await.unwrap();`
Here is the code I have
use tokio::net::TcpStream;
use tokio::prelude::*;
use std::io::{stdin, stdout, Write, BufWriter, BufReader};
use std::fs::File;
#[tokio::main]
async fn main() {
let ip = "172.16.10.10:7777";
let mut stream = TcpStream::connect(ip).await.unwrap();
let message = [0x16, 0x02];
stream.write(&message).await.unwrap();
// Open a file for logging
let file = File::create("log.txt").unwrap();
let mut log_writer = BufWriter::new(file);
println!("Press 'q' to exit and receive response:");
stdout().flush().unwrap();
let mut input = String::new();
stdin().read_line(&mut input).unwrap();
if input.trim() == "q" {
// SIMULATE doing time consuming task
println!(“Quitting”);
}
}
I tried the following but this loops over the waiting for user input. This is not behaviour I want. I want to be reading and logging the TCP messages independent of the awaiting user inout.
loop {
stdin().read_line(&mut input).unwrap();
if input.trim() == "q" {
break;
}
let mut data = vec![];
buf_reader.read_to_end(&mut data).await.unwrap();
log_writer.write_all(&data).await.unwrap();
}
When I needed async multithreading, I defined multiple async fn to do what I want, then called them in async fn main as:
let handle1 = tokio::spawn(do_it("test_data.txt"));
let handle2 = tokio::spawn(do_something_else("test_data.txt"));
handle1.await.unwrap();
handle2.await.unwrap();
Since I'm zealous about keeping fn main as minimal as possible, this may not exactly work for you, but may give you a direction.

How to run multiple intervals in tauri with tokio

currently, I am building a small application with Rust and Tauri but I've got the following issue that I need to solve:
Things that I want to do simultaneously:
Checking every 10 sec if a specific application is running
Polling every second data from SharedMemory via winapi
Both of them are working fine but I tried to refactor stuff and now I've got the following problem:
When my frontend sends me an event that the application is ready (or inside .on_page_load() I want to start both processes I mentioned before:
#[tauri::command]
async fn app_ready(window: tauri::Window) {
let is_alive = false; // I think this needs to be a mutex or a mutex that is wrapped around Arc::new()
tokio::join!(
poll_acc_process(&window, &is_alive),
handle_phycics(&window, &is_alive),
);
}
Visual Studio Code is complaining about the following stuff: future cannot be sent between threads safely within impl futures::Future<Output = ()>, the trait std::marker::Send is not implemented for *mut c_void
c_void is the handle of CreateFileMappingW from the winapi crate.
async fn poll_acc_process(window: &Window, is_alive: &bool) {
loop {
window.emit("acc_process", is_alive).unwrap();
tokio::time::sleep(time::Duration::from_secs(10));
}
}
async fn handle_phycics(window: &Window, is_alive: &bool) {
while is_alive {
let s_handle = get_statics_mapped_file(); // _handle represents c_void here
let s_memory = get_statics_mapview_of_file(s_handle);
window
.emit("update_statistics", Statics::new_from_memory(s_memory))
.unwrap();
let p_handle = get_physics_mapped_file(); // _handle represents c_void here
let physics = get_physics_mapview_of_file(p_handle);
window.emit("update_physics", physics).unwrap();
if physics.current_max_rpm != 0 {
let g_handle = get_graphics_mapped_file(); // _handle represents c_void here
let g_memory = get_graphics_mapview_of_file(g_handle);
window
.emit("update_graphics", Graphics::new_from_mem(g_memory))
.unwrap();
}
tokio::time::sleep(time::Duration::from_secs(1)).await;
}
}
Is it possible to solve my problem somehow this way or should I try another approach?

How to write type definition with tokio_serde::SymmetricallyFramed::new()?

I am rying to serialize cbor using serde_tokio.
I can make a simple program work, but I need to actually store the tokio_serde::SymmetricallyFramed::new() in a structure to use it more than once.
(It consumes the socket, which is cool).
I can't seem to write a type that will store the value.
use futures::prelude::*;
use tokio::net::TcpStream;
use tokio_serde::formats::*;
use tokio_util::codec::{FramedWrite, LengthDelimitedCodec};
#[tokio::main]
pub async fn main() {
// Bind a server socket
let socket = TcpStream::connect("127.0.0.1:17653").await.unwrap();
// Delimit frames using a length header
let length_delimited = FramedWrite::new(socket, LengthDelimitedCodec::new());
// Serialize frames with JSON
let mut serialized = tokio_serde::SymmetricallyFramed::new(length_delimited, SymmetricalCbor::default());
// Send the value
serialized
.send(vec![1i32,2,3])
.await
.unwrap()
}
produces the right output. (Adopted from the json example in tokio-serde crate, here: https://github.com/carllerche/tokio-serde/blob/master/examples/client.rs
I want to put "serialized" into a structure (and hide how it is created in a fn), but I can't seem to write the right type.
use futures::prelude::*;
use serde_cbor;
use tokio::net::TcpStream;
use tokio_serde::formats::*;
use tokio_util::codec::{FramedWrite, LengthDelimitedCodec};
type CborWriter = tokio_serde::Framed<tokio_util::codec::FramedWrite<tokio::net::TcpStream, tokio_util::codec::LengthDelimitedCodec>, serde_cbor::Value, serde_cbor::Value, tokio_serde::formats::Cbor<serde_cbor::Value, serde_cbor::Value>>;
// something like this has been suggested, but so far no luck.
// fn setup_writer(socket: tokio::net::TcpStream) -> impl Sink<??>+Debug {
fn setup_writer(socket: tokio::net::TcpStream) -> CborWriter {
// Delimit frames using a length header
let length_delimited = FramedWrite::new(socket, LengthDelimitedCodec::new());
// Serialize frames with CBOR
let serialized = tokio_serde::SymmetricallyFramed::new(length_delimited, SymmetricalCbor::default());
return serialized;
}
#[tokio::main]
pub async fn main() {
// Bind a server socket
let socket = TcpStream::connect("127.0.0.1:17653").await.unwrap();
// Serialize frames with CBOR
let mut serialized = setup_writer(socket);
// Send the value
serialized
.send(serde_cbor::Value::Array(vec![serde_cbor::Value::Integer(1i128),
serde_cbor::Value::Integer(2i128),
serde_cbor::Value::Integer(3i128)]))
.await
.unwrap()
}
But, I don't want to put cbor::Value in. I should just be able to put the Serializable objects in. So I am obviously going in the wrong direction here. The JSON example in the tokio-serde crate is happy to put in/out serde_json::Value, but I should have to do that, I think.
A suggestion on Discord was made to change the first example as:
let mut serialized: () = tokio_serde::SymmetricallyFramed::new(length_delimited, SymmetricalCbor::default());
and let the compiler tell me what the type is:
= note: expected unit type `()`
found struct `tokio_serde::Framed<tokio_util::codec::FramedWrite<tokio::net::TcpStream, tokio_util::codec::LengthDelimitedCodec>, _, _, tokio_serde::formats::Cbor<_, _>>`
Well, I can't put _ into the type alias, or write it directly.
I think it should say something like "impl Serialize", but that's not yet a thing.
Obviously, the compiler gets the first example right, so there must be something that will go in there... but what?

How can I copy a vector to another location and reuse the existing allocated memory?

In C++, to copy the contents of a vector to another vector we use the assignment operator dest = src. However, in Rust src would be moved into dest and no longer usable.
I know the simplest answer is to do dest = src.clone() (for the sake of this question we'll assume T in Vec<T> is Clone). However - if I'm understanding correctly - this creates a brand new third vector with the copied contents of src and moves it into dest, throwing away dest's dynamically allocated array. If this is correct, it's a completely unnecessary dynamic allocation when we could have just copied the content directly into dest (assuming it had sufficient capacity).
Below is a function I've made that does exactly what I would like to do: empty out the dest vector and copy the elements of src to it.
// copy contents of src to dest without just cloning src
fn copy_content<T: Clone>(dest: &mut Vec<T>, src: &Vec<T>) {
dest.clear();
if dest.capacity() < src.len() {
dest.reserve(src.len());
}
for x in src {
dest.push(x.clone());
}
}
Is there a way to do this with builtin or standard library utilities? Is the dest = src.clone() optimized by the compiler to do this anyway?
I know that if T has dynamic resources then the extra allocation from src.clone() isn't a big deal, but if T is e.g. i32 or any other Copy type then it forces an allocation where none are necessary.
Did you ever look at the definition of Clone? It has the well known clone method but also a useful but often forgotten clone_from method:
pub trait Clone : Sized {
fn clone(&self) -> Self;
fn clone_from(&mut self, source: &Self) {
*self = source.clone()
}
}
To quote the doc:
Performs copy-assignment from source.
a.clone_from(&b) is equivalent to a = b.clone() in functionality, but can be overridden to reuse the resources of a to avoid unnecessary allocations.
Of course a type such as Vec does not use the provided-by-default clone_from and defines its own in a more efficient way, similar to what you would get in C++ from writing dest = src:
fn clone_from(&mut self, other: &Vec<T>) {
other.as_slice().clone_into(self);
}
with [T]::clone_into being defined as:
fn clone_into(&self, target: &mut Vec<T>) {
// drop anything in target that will not be overwritten
target.truncate(self.len());
let len = target.len();
// reuse the contained values' allocations/resources.
target.clone_from_slice(&self[..len]);
// target.len <= self.len due to the truncate above, so the
// slice here is always in-bounds.
target.extend_from_slice(&self[len..]);
}

How to chain tokio read functions?

Is there a way to chain the read_* functions in tokio::io in a "recursive" way ?
I'm essentially looking to do something like:
read_until x then read_exact y then write response then go back to the top.
In case you are confused what functions i'm talking about: https://docs.rs/tokio/0.1.11/tokio/io/index.html
Yes, there is a way.
read_until is returns a struct ReadUntil, which implements the Future-trait, which iteself provides a lot of useful functions, e.g. and_then which can be used to chain futures.
A simple (and silly) example looks like this:
extern crate futures;
extern crate tokio_io; // 0.1.8 // 0.1.24
use futures::future::Future;
use std::io::Cursor;
use tokio_io::io::{read_exact, read_until};
fn main() {
let cursor = Cursor::new(b"abcdef\ngh");
let mut buf = vec![0u8; 2];
println!(
"{:?}",
String::from_utf8_lossy(
read_until(cursor, b'\n', vec![])
.and_then(|r| read_exact(r.0, &mut buf))
.wait()
.unwrap()
.1
)
);
}
Here I use a Cursor, which happens to implement the AsyncRead-trait and use the read_until function to read until a newline occurs (between 'f' and 'g').
Afterwards to chain those I use and_then to use read_exact in case of an success, use wait to get the Result unwrap it (don't do this in production kids!) and take the second argument from the tuple (the first one is the cursor).
Last I convert the Vec into a String to display "gh" with println!.

Resources