Making blocking http call in akka stream processing - http

I am new to akka and still trying to understand the different akka and streaming concepts. For some new feature i need to add a http call to already existing stream which is working on an internal object. Something like this -
val step1Flow = Flow[SampleObject].filter(...--Filtering condition--...)
val step2Flow = Flow[SampleObject].map(obj => {
...
-- Business logic to update values in the obj --
...
})
...
override val flowGraph: Flow[SampleObject, SampleObject, NotUsed] =
bufferIn.via(Flow.fromGraph(GraphDSL.create() {
implicit builder =>
import GraphDSL.Implicits._
...
val step1 = builder.add(step1Flow)
val step2 = builder.add(step2Flow)
val step3 = builder.add(step3Flow)
...
source ~> step1 ~> step2 ~> step3 ~> merge
...
}
I need to add the new http request flow (lets call it newFlow) after step1. All these flow have Inlet and Outlet as SampleObject. Now my understanding is that the newFlow would need to be blocking because the outlet need to be SampleObject only. For that I have used Await function on the http call future. The code looks like this -
val responseFuture: Future[(Try[HttpResponse], SomeContext)] =
Source
.single(httpRequest -> context)
.via(Retry(retrySettings).join(clientFlow))
.runWith(Sink.head)
...
val (httpTry, passedAlongContext) = Await.result(responseFuture, 30.seconds)
-- logic to process response and return SampleObject --
Now this works fine but i think there should be a better way to do this without using wait. Also i think this would block the main thread till the request completes, which is going to affect the app throughput.
Could you please guide if the approach i used is correct or not. And how do i make use of some other thread pool to handle these blocking call so my main threadpool is not affected
This question seems very similar to mine but i do not understand it completely - connect Akka HTTP to Akka stream . Also i can't change the step2 or further flows.
EDIT : Added some code details for the stream

I ended up using the approach mentioned in the question because i couldn't find anything better after looking around. Adding this step decreased the throughput of my application as expected, but there are approaches to increase that can be used. Check these awesome blogs by Colin Breck -
https://blog.colinbreck.com/maximizing-throughput-for-akka-streams/
https://blog.colinbreck.com/partitioning-akka-streams-to-maximize-throughput/
To summarize -
Use Asynchronous Boundaries for flows which are blocking.
Use Futures if possible and add callbacks to futures. There are several ways to do that.
Use Buffers. There are several types of buffers available, choose what suits your needs.
Other than these, you can use inbuilt flows like -
Use "Broadcast" to broadcast your events to multiple consumers.
Use "Partition" to partition your stream into multiple streams based
on some condition.
Use "Balance" to partition your stream when there is no logical way to partition your events or they all could have different work loads.
You could use any one or multiple things from above options.

Related

Returning multiple items in gRPC: repeated List or stream single objects?

gRPC newbie. I have a simple api:
Customer getCustomer(int id)
List<Customer> getCustomers()
So my proto looks like this:
message ListCustomersResponse {
repeated Customer customer = 1;
}
rpc ListCustomers (google.protobuf.Empty) returns (ListCustomersResponse);
rpc GetCustomer (GetCustomerRequest) returns (Customer);
I was trying to follow Googles lead on the style. Originally I had returns (stream Customer) for GetCustomers, but Google seems to favor the ListxxxResponse style. When I generate the code, it ends up being:
public void getCustomers(com.google.protobuf.Empty request,
StreamObserver<ListCustomersResponse> responseObserver) {
vs:
public void getCustomers(com.google.protobuf.Empty request,
StreamObserver<Customer> responseObserver) {
Am I missing something? Why would I want to go through the hassle of creating a ListCustomersResponse when I can just do stream Customer and get the streaming functionality?
The ListCustomersResponse is just streaming the whole list at once vs streaming each customer. Googles preference seems to be to return the ListCustomersResponse style all of the time.
When is it appropriate to use the ListxxxResponse vs the stream response?
This question is hard to answer without knowing what reference you're using. It's possible there's a miscommunication, or that the reference is simply wrong.
If you're looking at the gRPC Basics tutorial though, then I might have an inkling as to what caused a miscommunication. If that's indeed your reference, then it does not recommend returning repeated fields for streamed responses; your intuition is correct: you would just want to stream the singular Customer.
Here is what it says (screenshot intentional):
You might be reading rpc ListFeatures(Rectangle) as meaning an endpoint that returns a list [noun] of features. If so, that's a miscommunication. The guide actually means an endpoint to list [verb] features. It would have been less confusing if they just wrote rpc GetFeatures(Rectangle).
So, your proto should look more like this,
rpc GetCustomers (google.protobuf.Empty) returns (stream Customer);
rpc GetCustomer (GetCustomerRequest) returns (Customer);
generating exactly what you suspected made more sense.
Update
Ah I see, so you're looking at this example in googleapis:
// Lists shelves. The order is unspecified but deterministic. Newly created
// shelves will not necessarily be added to the end of this list.
rpc ListShelves(ListShelvesRequest) returns (ListShelvesResponse) {
option (google.api.http) = {
get: "/v1/shelves"
};
}
...
// Response message for LibraryService.ListShelves.
message ListShelvesResponse {
// The list of shelves.
repeated Shelf shelves = 1;
// A token to retrieve next page of results.
// Pass this value in the
// [ListShelvesRequest.page_token][google.example.library.v1.ListShelvesRequest.page_token]
// field in the subsequent call to `ListShelves` method to retrieve the next
// page of results.
string next_page_token = 2;
}
Yeah, I think you've probably figured the same by now, but here they have chosen to use a simple RPC, as opposed to a server-side streaming RPC (see here). I emphasize this because, I think the important choice is not the stylistic difference between repeated versus stream, but rather the difference between a simple request-response API versus a more complex and less-ubiquitous streaming API.
In the googleapis example above, they're defining an API that returns a fixed and static number of items per page, e.g. 10 or 50. It would simply be overcomplicated to use streaming for this, when pagination is already so well-understood and prevalent in software architecture and REST APIs. I think that is what they should have said, rather than "a small number." So the complexity of streaming (and learning cost to you and future maintainers) has to justified, that's all. Suppose you're actually fetching thousands of (x, y, z) items for a Point Cloud or you're creating a live-updating bid-ask visualizer for some cryptocurrency, e.g.
Then you'd start asking yourself, "Is a simple request-response API my best option here?" So it just tends to be that, the larger the number of items needing to be returned, the more streaming APIs start to make sense. And that can be for conceptual reasons, e.g. the items are a live-updating stream in time like the above crypto example, or architectural, e.g. it would be more efficient to start displaying results in the UI as partial data streams back. I think the "small number" thing you read was an oversimplification.

Chaining Handlers with MediatR

We are using MediatR to implement a "Pipeline" for our dotnet core WebAPI backend, trying to follow the CQRS principle.
I can't decide if I should try to implement a IPipelineBehavior chain, or if it is better to construct a new Request and call MediatR.Send from within my Handler method (for the request).
The scenario is essentially this:
User requests an action to be executed, i.e. Delete something
We have to check if that something is being used by someone else
We have to mark that something as deleted in the database
We have to actually delete the files from the file system.
Option 1 is what we have now: A DeleteRequest which is handled by one class, wherein the Handler checks if it is being used, marks it as deleted, and then sends a new TaskStartRequest with the parameters to Delete.
Option 2 is what I'm considering: A DeleteRequest which implements the marker interfaces IRequireCheck, IStartTask, with a pipeline which runs:
IPipelineBehavior<IRequireCheck> first to check if the something is being used,
IPipelineBehavior<DeleteRequest> to mark the something as deleted in database and
IPipelineBehavior<IStartTask> to start the Task.
I haven't fully figured out what Option 2 would look like, but this is the general idea.
I guess I'm mainly wondering if it is code smell to call MediatR.Send(TRequest2) within a Handler for a TRequest1.
If those are the options you're set on going with - I say Option 2. Sending requests from inside existing Mediatr handlers can be seen as a code smell. You're hiding side effects and breaking the Single Responsibility Principle. You're also coupling your requests together and you should try to avoid situations where you can't send one type of request before another.
However, I think there might be an alternative. If a delete request can't happen without the validation and marking beforehand you may be able to leverage a preprocessor (example here) for your TaskStartRequest. That way you can have a single request that does everything you need. This even mirrors your pipeline example by simply leveraging the existing Mediatr patterns.
Is there any need to break the tasks into multiple Handlers? Maybe I am missing the point in mediatr. Wouldn't this suffice?
public async Task<Result<IFailure,ISuccess>> Handle(DeleteRequest request)
{
var thing = await this.repo.GetById(request.Id);
if (thing.IsBeignUsed())
{
return Failure.BeignUsed();
}
var deleted = await this.repo.Delete(request.Id);
return deleted ? new Success(request.Id) : Failure.DbError();
}

Async ZeroMQ for nim

I have never used ZeroMQ and first heard of it an hour ago. But from the guide (this guide) it sounds like there are async I/O.
It also happens that there is a nim port : this one
So I was wondering, does the async magic has something to do with async/await which are keywords not present in the nim port (which is just c2nim). So is it just something that's internal to ZMQ and the API doesn't have to bother about it ?
I thought async/await was a vernacular thing that has to bubble up to the upper most main loop (framework loop) so the API would have to be async-aware.
Is this a complete misconception on my part ?
Native ZeroMQ API supports both blocking and non-blocking I/O-s.
For this purpose, there are flags, where zmq.NOBLOCK could be added, so as to achieve a non-blocking mode of operation.
The respective language-wrapper functionality decides . . .
If I read the nim ZeroMQ-wrapper, that you have mentioned above, it seems to me, that there is a hardcoded blocking version for both send() and recv() function-wrappers.
The wrapper also seems not to support correct wireline message sizing in case a nim-based node of a distributed-system meets another node, which is using ZeroMQ version 2.1.+, which is still interesting and common in heterogeneous distributed-system realms.
ZeroMQ has also a poll() method, equipped with a timeout parameter, so that your multiplexed I/O-operations may yield all wanted ways of how to operate multiple I/O-channels under some soft real-time control constraints.
While the accepted answer was true at the time; async with ZMQ now is built with the wrapper and there are examples provided :
See :
https://github.com/nim-lang/nim-zmq/blob/master/zmq/asynczmq.nim
https://github.com/nim-lang/nim-zmq/blob/master/examples/ex08_async_reqrep.nim
You can also work around the blocking behaviour or ZMQ in order to not block the async-dispatch loop manually with poll / sleepAsync :
let
zmq_timeout = 50
async_loop_time = 450 # spend more time on async stuff than on zmq stuff
var
conn = listen("tcp://127.0.0.1:36000", mode = PAIR=
poller = initZPoll([conn], ZMQ_POLLIN)
if poller.poll(timeout):
if events(poller[0]):
var res = poller[0].receive()
# Do async stuff
else:
waitFor sleepAsync(async_loop_time) # Calling sleepAsync is a trick to make the async dispatch loop progress for a time

Dealing with DB handles and initialization in functional programming

I have several functions that deal with database interactions. (like readModelById, updateModel, findModels, etc) that I try to use in a functional style.
In OOP, I'd create a class that takes DB-connection-parameters in the constructor, creates the database-connection and save the DB-handle in the instance. The functions then would just use the DB-handle from "this".
What's the best way in FP to deal with this? I don't want to hand around the DB handle throughout the entire application. I thought about partial application on the functions to "bake in" the handle, but that creates ugly boilerplate code, doing it one by one and handing it back.
What's the best practice/design pattern for things like this in FP?
There is a parallel to this in OOP that might suggest the right approach is to take the database resource as parameter. Consider the DB implementation in OOP using SOLID principles. Due to Interface Segregation Principle, you would end up with an interface per DB method and at least one implementation class per interface.
// C#
public interface IGetRegistrations
{
public Task<Registration[]> GetRegistrations(DateTime day);
}
public class GetRegistrationsImpl : IGetRegistrations
{
public Task<Registration[]> GetRegistrations(DateTime day)
{
...
}
private readonly DbResource _db;
public GetRegistrationsImpl(DbResource db)
{
_db = db;
}
}
Then to execute your use case, you pass in only the dependencies you need instead of the whole set of DB operations. (Assume that ISaveRegistration exists and is defined like above).
// C#
public async Task Register(
IGetRegistrations a,
ISaveRegistration b,
RegisterRequest requested
)
{
var registrations = await a.GetRegistrations(requested.Date);
// examine existing registrations and determine request is valid
// throw an exception if not?
...
return await b.SaveRegistration( ... );
}
Somewhere above where this code is called, you have to new up the implementations of these interfaces and provide them with DbResource.
var a = new GetRegistrationsImpl(db);
var b = new SaveRegistrationImpl(db);
...
return await Register(a, b, request);
Note: You could use a DI framework here to attempt to avoid some boilerplate. But I find it to be borrowing from Peter to pay Paul. You pay as much in having to learn a DI framework and how to make it behave as you do for wiring the dependencies yourself. And it is another tech new team members have to learn.
In FP, you can do the same thing by simply defining a function which takes the DB resource as a parameter. You can pass functions around directly instead of having to wrap them in classes implementing interfaces.
// F#
let getRegistrations (db: DbResource) (day: DateTime) =
...
let saveRegistration (db: DbResource) ... =
...
The use case function:
// F#
let register fGet fSave request =
async {
let! registrations = fGet request.Date
// call your business logic here
...
do! fSave ...
}
Then to call it you might do something like this:
register (getRegistrations db) (saveRegistration db) request
The partial application of db here is analogous to constructor injection. Your "losses" from passing it to multiple functions is minimal compared to the savings of not having to define interface + implementation for each DB operation.
Despite being in a functional-first language, the above is in principle the same as the OO/SOLID way... just less lines of code. To go a step further into the functional realm, you have to work on eliminating side effects in your business logic. Side effects can include: current time, random numbers, throwing exceptions, database operations, HTTP API calls, etc.
Since F# does not require you to declare side effects, I designate a border area of code where side effects should stop being used. For me, the use case level (register function above) is the last place for side effects. Any business logic lower than that, I work on pushing side effects up to the use case. It is a learning process to do that, so do not be discouraged if it seems impossible at first. Just do what you have to for now and learn as you go.
I have a post that attempts to set the right expectations on the benefits of FP and how to get them.
I'm going to add a second answer here taking an entirely different approach. I wrote about it here. This is the same approach used by MVU to isolate decisions from side effects, so it is applicable to UI (using Elmish) and backend.
This is worthwhile if you need to interleave important business logic with side effects. But not if you just need to execute a series of side effects. In that case just use a block of imperative statements, in a task (F# 6 or TaskBuilder) or async block if you need IO.
The pattern
Here are the basic parts.
Types
Model - The state of the workflow. Used to "remember" where we are in the workflow so it can be resumed after side effects.
Effect - Declarative representation of the side effects you want to perform and their required data.
Msg - Represents events that have happened. Primarily, they are the results of side effects. They will resume the workflow.
Functions
update - Makes all the decisions. It takes in its previous state (Model) and a Msg and returns an updated state and new Effects. This is a pure function which should have no side effects.
perform - Turns a declared Effect into a real side effect. For example, saving to a database. Returns a Msg with the result of the side effect.
init - Constructs an initial Model and starting Msg. Using this, a caller gets the data it needs to start the workflow without having to understand the internal details of update.
I jotted down an example for a rate-limited emailer. It includes the implementation I use on the backend to package and run this pattern, called Ump.
The logic can be tested without any instrumentation (no mocks/stubs/fakes/etc). Declare the side effects you expect, run the update function, then check that the output matches with simple equality. From the linked gist:
// example test
let expected = [SendEmail email1; ScheduleSend next]
let _, actual = Ump.test ump initArg [DueItems ([email1; email2], now)]
Assert.IsTrue(expected = actual)
The integrations can be tested by exercising perform.
This pattern takes some getting-used-to. It reminds me a bit of an Erlang actor or a state machine. But it is helpful when you really need your business logic to be correct and well-tested. It also happens to be a proper functional pattern.

How to change the dart-sqlite code from synchronous style to asynchronous?

I'm trying to use Dart with sqlite, with this project dart-sqlite.
But I found a problem: the API it provides is synchronous style. The code will be looked like:
// Iterating over a result set
var count = c.execute("SELECT * FROM posts LIMIT 10", callback: (row) {
print("${row.title}: ${row.body}");
});
print("Showing ${count} posts.");
With such code, I can't use Dart's future support, and the code will be blocking at sql operations.
I wonder how to change the code to asynchronous style? You can see it defines some native functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/lib/sqlite.dart#L238
_prepare(db, query, statementObject) native 'PrepareStatement';
_reset(statement) native 'Reset';
_bind(statement, params) native 'Bind';
_column_info(statement) native 'ColumnInfo';
_step(statement) native 'Step';
_closeStatement(statement) native 'CloseStatement';
_new(path) native 'New';
_close(handle) native 'Close';
_version() native 'Version';
The native functions are mapped to some c++ functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/src/dart_sqlite.cc
Is it possible to change to asynchronous? If possible, what shall I do?
If not possible, that I have to rewrite it, do I have to rewrite all of:
The dart file
The c++ wrapper file
The actual sqlite driver
UPDATE:
Thanks for #GregLowe's comment, Dart's Completer can convert callback style to future style, which can let me to use Dart's doSomething().then(...) instead of passing a callback function.
But after reading the source of dart-sqlite, I realized that, in the implementation of dart-sqlite, the callback is not event-based:
int execute([params = const [], bool callback(Row)]) {
_checkOpen();
_reset(_statement);
if (params.length > 0) _bind(_statement, params);
var result;
int count = 0;
var info = null;
while ((result = _step(_statement)) is! int) {
count++;
if (info == null) info = new _ResultInfo(_column_info(_statement));
if (callback != null && callback(new Row._internal(count - 1, info, result)) == true) {
result = count;
break;
}
}
// If update affected no rows, count == result == 0
return (count == 0) ? result : count;
}
Even if I use Completer, it won't increase the performance. I think I may have to rewrite the c++ code to make it event-based first.
You should be able to write a wrapper without touching the C++. Have a look at how to use the Completer class in dart:async. Basically you need to create a Completer, return Completer.future immediately, and then call Completer.complete(row) from the existing callback.
Re: update. Have you seen this article, specifically the bit about asynchronous extensions? i.e. If the C++ API is synchronous you can run it in a separate thread, and use messaging to communicate with it. This could be a way to do it.
The big problem you've got is that SQLite is an embedded database; in order to process your query and provide your results, it must do computation (and I/O) in your process. What's more, in order for its transaction handling system to work, it either needs its connection to be in the thread that created it, or for you to run in serialized mode (with a performance hit).
Because these are fairly hard constraints, your plan of switching things to an asynchronous operation mode is unlikely to go well except by using multiple threads. Since using multiple connections complicates things a lot (as you can't share some things between them, such as TEMP TABLEs) let's consider going for a single serialized connection; all activity will be serialized at the DB level, but for an application that doesn't use the DB a lot it will be OK. At the C++ level, you'd be talking about calling that execute from another thread and then sending messages back to the caller thread to indicate each row and the completion.
But you'll take a real hit when you do this; in particular, you're committing to only doing one query at a time, as the technique runs into significant problems with semantic effects when you start using two connections at once and the DB forces serialization on you with one connection.
It might be simpler to do the above by putting the synchronous-asynchronous coupling at the Dart level by managing the worker thread and inter-thread communication there. That would let you avoid having to change the C++ code significantly. I don't know Dart well enough to be able to give much advice there.
Myself, I'd just stick with synchronous connection processing so that I can make my application use multi-threaded mode more usefully. I'd be taking the hit with the semantics and giving each thread its own connection (possibly allocated lazily) so that overall speed was better, but I do come from a programming community that regards threads as relatively heavyweight resources, so make of that what you will. (Heavy threads can do things that reduce the number of locks they need that it makes no sense to try to do with light threads; it's about overhead management.)

Resources