I am using a futures-rs powered version of the Rusoto AWS Kinesis library. I need to spawn a deep pipeline of AWS Kinesis requests to achieve high-throughput because Kinesis has a limit of 500 records per HTTP request. Combined with the 50ms latency of sending a request, I need to start generating many concurrent requests. I am looking to create somewhere on the order of 100 in-flight requests.
The Rusoto put_records function signature looks like this:
fn put_records(
&self,
input: &PutRecordsInput,
) -> RusotoFuture<PutRecordsOutput, PutRecordsError>
The RusotoFuture is a wrapper defined like this:
/// Future that is returned from all rusoto service APIs.
pub struct RusotoFuture<T, E> {
inner: Box<Future<Item = T, Error = E> + 'static>,
}
The inner Future is wrapped but the RusutoFuture still implements Future::poll(), so I believe it is compatible with the futures-rs ecosystem. The RusotoFuture provides a synchronization call:
impl<T, E> RusotoFuture<T, E> {
/// Blocks the current thread until the future has resolved.
///
/// This is meant to provide a simple way for non-async consumers
/// to work with rusoto.
pub fn sync(self) -> Result<T, E> {
self.wait()
}
}
I can issue a request and sync() it, getting the result from AWS. I would like to create many requests, put them in some kind of queue/list, and gather finished requests. If the request errored I need to reissue the request (this is somewhat normal in Kinesis, especially when hitting limits on your shard throughput). If the request is completed successfully I should issue a request with new data. I could spawn a thread for each request and sync it but that seems inefficient when I have the async IO thread running.
I have tried using futures::sync::mpsc::channel from my application thread (not running from inside the Tokio reactor) but whenever I clone the tx it generates its own buffer, eliminating any kind of backpressure on send:
fn kinesis_pipeline(client: DefaultKinesisClient, stream_name: String, num_puts: usize, puts_size: usize) {
use futures::sync::mpsc::{ channel, spawn };
use futures::{ Sink, Future, Stream };
use futures::stream::Sender;
use rusoto_core::reactor::DEFAULT_REACTOR;
let client = Arc::new(KinesisClient::simple(Region::UsWest2));
let data = FauxData::new(); // a data generator for testing
let (mut tx, mut rx) = channel(1);
for rec in data {
tx.clone().send(rec);
}
}
Without the clone, I have the error:
error[E0382]: use of moved value: `tx`
--> src/main.rs:150:9
|
150 | tx.send(rec);
| ^^ value moved here in previous iteration of loop
|
= note: move occurs because `tx` has type `futures::sync::mpsc::Sender<rusoto_kinesis::PutRecordsRequestEntry>`, which does not implement the `Copy` trait
I have also look at futures::mpsc::sync::spawn based on recommendations but it takes owner ship of the rx (as a Stream) and does not solve my problem with the Copy of tx causing unbounded behavior.
I'm hoping if I can get the channel/spawn usage working, I will have a system which takes RusotoFutures, waits for them to complete, and then provides me an easy way to grab completion results from my application thread.
As far as I can tell your problem with channel is not that a single clone of the Sender increase the capacity by one, it is that you clone the Sender for every item you're trying to send.
The error you're seeing without clone comes from your incorrect usage of the Sink::send interface. With clone you actually should see the warning:
warning: unused `futures::sink::Send` which must be used: futures do nothing unless polled
That is: your current code doesn't actually ever send anything!
In order to apply backpressure you need to chain those send calls; each one should wait until the previous one finished (and you need to wait for the last one too!); on success you'll get the Sender back. The best way to do this is to generate a Stream from your iterator by using iter_ok and to pass it to send_all.
Now you got one future SendAll that you need to "drive". If you ignore the result and panic on error (.then(|r| { r.unwrap(); Ok::<(), ()>(()) })) you could spawn it as a separate task, but maybe you want to integrate it into your main application (i.e. return it in a Box).
// this returns a `Box<Future<Item = (), Error = ()>>`. you may
// want to use a different error type
Box::new(tx.send_all(iter_ok(data)).map(|_| ()).map_err(|_| ()))
RusotoFuture::sync and Future::wait
Don't use Future::wait: it is already deprecated in a branch, and it usually won't do what you actually are looking for. I doubt RusotoFuture is aware of the problems, so I recommend avoiding RusotoFuture::sync.
Cloning Sender increases channel capacity
As you correctly stated cloning Sender increases the capacity by one.
This seems to be done to improve performance: A Sender starts in the unblocked ("unparked") state; if a Sender isn't blocked it can send an item without blocking. But if the number of items in the queue hits the configured limit when a Sender sends an item, the Sender becomes blocked ("parked"). (Removing items from the queue will unblock the Sender at a certain time.)
This means that after the inner queue hits the limit each Sender still can send one item, which leads to the documented effect of increased capacity, but only if actually all the Senders are sending items - unused Senders don't increase the observed capacity.
The performance boost comes from the fact that as long as you don't hit the limit it doesn't need to park and notify tasks (which is quite heavy).
The private documentation at the top of the mpsc module describes more of the details.
Related
I am new to akka and still trying to understand the different akka and streaming concepts. For some new feature i need to add a http call to already existing stream which is working on an internal object. Something like this -
val step1Flow = Flow[SampleObject].filter(...--Filtering condition--...)
val step2Flow = Flow[SampleObject].map(obj => {
...
-- Business logic to update values in the obj --
...
})
...
override val flowGraph: Flow[SampleObject, SampleObject, NotUsed] =
bufferIn.via(Flow.fromGraph(GraphDSL.create() {
implicit builder =>
import GraphDSL.Implicits._
...
val step1 = builder.add(step1Flow)
val step2 = builder.add(step2Flow)
val step3 = builder.add(step3Flow)
...
source ~> step1 ~> step2 ~> step3 ~> merge
...
}
I need to add the new http request flow (lets call it newFlow) after step1. All these flow have Inlet and Outlet as SampleObject. Now my understanding is that the newFlow would need to be blocking because the outlet need to be SampleObject only. For that I have used Await function on the http call future. The code looks like this -
val responseFuture: Future[(Try[HttpResponse], SomeContext)] =
Source
.single(httpRequest -> context)
.via(Retry(retrySettings).join(clientFlow))
.runWith(Sink.head)
...
val (httpTry, passedAlongContext) = Await.result(responseFuture, 30.seconds)
-- logic to process response and return SampleObject --
Now this works fine but i think there should be a better way to do this without using wait. Also i think this would block the main thread till the request completes, which is going to affect the app throughput.
Could you please guide if the approach i used is correct or not. And how do i make use of some other thread pool to handle these blocking call so my main threadpool is not affected
This question seems very similar to mine but i do not understand it completely - connect Akka HTTP to Akka stream . Also i can't change the step2 or further flows.
EDIT : Added some code details for the stream
I ended up using the approach mentioned in the question because i couldn't find anything better after looking around. Adding this step decreased the throughput of my application as expected, but there are approaches to increase that can be used. Check these awesome blogs by Colin Breck -
https://blog.colinbreck.com/maximizing-throughput-for-akka-streams/
https://blog.colinbreck.com/partitioning-akka-streams-to-maximize-throughput/
To summarize -
Use Asynchronous Boundaries for flows which are blocking.
Use Futures if possible and add callbacks to futures. There are several ways to do that.
Use Buffers. There are several types of buffers available, choose what suits your needs.
Other than these, you can use inbuilt flows like -
Use "Broadcast" to broadcast your events to multiple consumers.
Use "Partition" to partition your stream into multiple streams based
on some condition.
Use "Balance" to partition your stream when there is no logical way to partition your events or they all could have different work loads.
You could use any one or multiple things from above options.
Boost's asio library allows the serialisation of asynchronous code in the following way. Handlers to asynchronous functions such as those which read from a stream, may be associated to a strand. A strand is associated with an "IO context". An IO context owns a thread pool. However many threads in the pool, it is guaranteed that no two handlers associated with the same strand are run concurrently. This makes it possible, for instance, to implement a state machine as if it were single-threaded, where all handlers for that machine serialise over a private strand.
I have been trying to figure out how this might be done with F#'s Async. I could not find any way to make sure that chosen sets of Async processes never run concurrently. Can anyone suggest how to do this?
It would be useful to know what is the use case that you are trying to implement. I don't think F# async has anything that would directly map to strands and you would likely use different techniques for implementing different things that might all be implemented using strands.
For example, if you are concerend with reading data from a stream, F# async block lets you write code that is asynchronous but sequential. The following runs a single logical process (which might be moved between threads of a thread pool when you wait using let!):
let readTest () = async {
let fs = File.OpenRead(#"C:\Temp\test.fs")
let buffer = Array.zeroCreate 10
let mutable read = 1
while read <> 0 do
let! r = fs.AsyncRead(buffer, 0, 10)
printfn "Read: %A" buffer.[0 .. r-1]
read <- r }
readTest() |> Async.Start
If you wanted to deal with events that occur without any control (i.e. push based rather than pull based), for example, when you cannot ask the system to read next buffer of data, you could serialize the events using a MailboxProcessor. The following sends two messages to the agent almost at the same time, but they are processed sequentially, with 1 second delay:
let agent = MailboxProcessor.Start(fun inbox -> async {
while true do
let! msg = inbox.Receive()
printfn "Got: %s" msg
do! Async.Sleep(1000)
})
agent.Post("hello")
agent.Post("world")
I have a Rust Tokio TCP server. Each client is handled by the Tokio future chain that looks like this:
let stream = <TcpStream from elsewhere>;
let task = database_connection
.and_then(connection| {
tokio::io::write_all(stream, SomeSuccessData);
}).map_err(|error| {
tokio::io::write_all(stream, SomeErrorData(error));
});
...
tokio::spawn(task);
The issue is I cannot use the same TcpStream in multiple branches of the chain, because tokio::io::write_all consumes the stream, even though it is supposed to be used in sequential manner. It is crucial to send different data depending on if there was, e.g., a database error.
How can I overcome this problem? Maybe there is a different API?
The documentation for io::write_all states:
Any error which happens during writing will cause both the stream and the buffer to get destroyed.
Since your code appears to be attempting to send a network message to indicate that the previous network message failed (which seems... dubious), the TcpStream is already gone by the time you try to send the second message.
The easiest solution is thus to clone the stream:
let stream2 = stream.try_clone().expect("Couldn't clone");
let task = database_connection
.and_then(|_| io::write_all(stream, b"success"))
.map_err(|_| io::write_all(stream2, b"error"));
If you only wanted to try to report the failure of the database connection, it's much easier: use Future::then instead of and_then:
let task = database_connection.then(|connection| match connection {
Ok(_) => io::write_all(stream, &b"success"[..]),
Err(_) => io::write_all(stream2, &b"error"[..]),
});
I'm investigating F# agents that have multiple states, i.e., using the "let rec/and" keyword combination (per Expert F# 3.0's "Message Processing and State Machines") to provide multiple async blocks. The only example I've been able to find so far is the "throttling agent" discussed here (also Fssnip.net). Are there any other resources for learning this pattern?
edit: My specific application is an agent that has two states,
| StartFeed rateMultiplier replychannel ->
- replychannel out data values at a delay (provided with each value)
multiplied by rateMultiplier
- loop by using
thisAgent.Post(StartFeed rateMultiplier replychannel)
| Pause ->
I would like to provide some way to pass in a feed rate multiplier value that increases/decreases the delay by the passed-in multiplier in the "feed" async state, without interrupting the feed of values. I guess the question boils down to "how do you keep an async state block actively looping while still being aware of new messages?" Almost like skipping the inbox.Receive asynchronous wait, unless a message actually comes in? Inbox.scan?
edit 2: Given the message queue aspect of MailboxProcessor, I can see that an external message (with a different rateMultiplier value) that is received by the agent and placed in the queue will successfully change the rate without interrupting the flow of data values out. Any advice on the "Pause" would be still be appreciated.
I have found Tomas Petricek's entry https://github.com/tpetricek/FSharp.AsyncExtensions/blob/master/src/Agents/BlockingQueueAgent.fs , which gives an agent, with the standard mailboxprocessor queue, a way to choose what async block it will employ to process the next incoming message (ie, let the agent 'change its state'):
inbox.Receive() is used for the 'standard state' - the agent's message 'inbox' queue is neither full nor empty (State #1)
inbox.Scan() is used for the 'edge' or limiting cases of empty (State #2) and full (State #3) message 'inbox' queue
the actions the agent (in whichever of the three states) can take in response to received messages are written as **distinct async blocks that are given their own 'and' async block in the agent's 'let rec' loop; I had thought that 'let rec...and...' async blocks were restricted to having a message receipt function (.Receive, .Scan, etc), which is incorrect, they may be any async block that maintains the desired control flow, as seen in the next feature of the 'let rec...and...' agent body:
once the agent, in whichever of the 3 states, responds to a new message by routing to the appropriate action, the action is itself finished with a call to another 'and' async block of the agent body 'let rec' loop, a 'chooseState()', an if/then block that determines which state will handle a new message and calls that 'and' async block from among the 3 available.
This example seems essential in demonstrating idiomatic use of the multi-state agent body construction, specifically how to combine the three functions of message receipt, response, and looping control as mutually recursive elements of a single 'let rec...and...and..." construction.
Of course other message-passing frameworks exist, but this is a general logic/routing design for a more complex agent, whatever the framework, so:
thanks, Tomas.
The problem
One data source generating data in format {key, value}
Multiple receivers each waiting for different key
Example
Getting data is run in loop. Sometimes I will want to get next value labelled with key by using
Value = MyClass:GetNextValue(Key)
I want my code to stop there until the value is ready (making some sort of future(?) value). I've tried using simple coroutines, but they work only when waiting for any data.
So the question I want to ask is something like How to implement async values in lua using coroutines or similar concept (without threads)?
Side notes
The main processing function will, apart from returning values to waiting consumers, process some of incoming data (say, labeled with special key) itself.
The full usage context should look something like:
-- in loop
ReceiveData()
ProcessSpecialData()
--
-- Called outside the loop:
V = RequestDataWithGivenKey(Key)
How to implement async values
You start by not implementing async values. You implement async functions: you don't get the value back until has been retrieved.
First, your code must be in a Lua coroutine. I'll assume you understand the care and feeding of coroutines. I'll focus on how to implement RequestDataWithGivenKey:
function RequestDataWithGivenKey(key)
local request = FunctionThatStartsAsyncGetting(key)
if(not request:IsComplete()) then
coroutine.yield()
end
--Request is complete. Return the value.
return request:GetReturnedValue()
end
FunctionThatStartsAsyncGetting returns a request back to the function. The request is an object that stores all of the data needs to process the specific request. It represents asking for the value. This should be a C-function that starts the actual async getting.
The request will be either a userdata or an encapsulated Lua table that stores enough information to communicate with the C-code that's doing the async fetching. IsComplete uses the internal request data to see if that request has completed. GetReturnedValue can only be called when IsComplete returns true; it puts the value on the Lua stack, so that this function can return it.
Your external code simply needs to handle the async stuff internally. Between resumes of these Lua coroutines, you'll need to pump whatever async stuff is doing the fetching, if there are outstanding requests.