Let's assume that Flink receives a stream of 1000's of tweets per second and that somewhere in the process, it needs to classify them as spam or not. I have a cluster of e.g. 20 machines that provide the "classification" microservice through a REST API and they can give max throughput of 10k tweets per second and their latency is 3 seconds. This means that at worst case, I might have 30k tweets on the fly and that's ok. I guess that to consume this service from Flink, an implementation will be something like this:
public class Classifier implements MapFunction<Tweet, TweetWithClass> {
#Override
public TweetWithClass map(Tweet tweet) {
TweetWithClass twc = new TweetWithClass(tweet);
twc.classes = (new Post('http://my.classifier.com', data = tweet.body)).bodyAsStringArrayFromJson();
return twc;
}
}
DataSet<TweetWithClass> outTweets = inTweets.map(new Classifier()).setParallelism(30000);
Now, given this API, my guess is that Flink would have no other choice other than starting 30k threads and that would be potentially bad. I see in the source code that Flink uses Netty, I guess it could support this operation more efficiently by using asynchronous calls... If fictitious beautiful Netty, Flink and Java API existed, this would look something like this:
public class Classifier implements MapFunction<Tweet, TweetWithClass> {
#Override
public Future<TweetWithClass> map(Tweet tweet) {
Future<String[]> classes = (new NettyPost('http://my.classifier.com', data = tweet.body)).asyncBodyAsStringArrayFromJson();
return classes.onGet( (String[] classes) -> new TweetWithClass(tweet, twc.classes) );
}
}
DataSet<TweetWithClass> outTweets = inTweets.nettyMap(new Classifier()).setMaxParallelism(30000);
Is there a way to use asynchronous calls to have massive scalability with very few threads in Flink?
I know this is a relatively old question but as of Flink 1.2 (Which came out in February 2017), Flink offers an API for exactly this purpose.
It's called async I/O.
With async I/O you can perform asynchronous calls to external databases or in your case external web service and get the results with a callback inside a future.
More Information can be found here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html
Related
gRPC newbie. I have a simple api:
Customer getCustomer(int id)
List<Customer> getCustomers()
So my proto looks like this:
message ListCustomersResponse {
repeated Customer customer = 1;
}
rpc ListCustomers (google.protobuf.Empty) returns (ListCustomersResponse);
rpc GetCustomer (GetCustomerRequest) returns (Customer);
I was trying to follow Googles lead on the style. Originally I had returns (stream Customer) for GetCustomers, but Google seems to favor the ListxxxResponse style. When I generate the code, it ends up being:
public void getCustomers(com.google.protobuf.Empty request,
StreamObserver<ListCustomersResponse> responseObserver) {
vs:
public void getCustomers(com.google.protobuf.Empty request,
StreamObserver<Customer> responseObserver) {
Am I missing something? Why would I want to go through the hassle of creating a ListCustomersResponse when I can just do stream Customer and get the streaming functionality?
The ListCustomersResponse is just streaming the whole list at once vs streaming each customer. Googles preference seems to be to return the ListCustomersResponse style all of the time.
When is it appropriate to use the ListxxxResponse vs the stream response?
This question is hard to answer without knowing what reference you're using. It's possible there's a miscommunication, or that the reference is simply wrong.
If you're looking at the gRPC Basics tutorial though, then I might have an inkling as to what caused a miscommunication. If that's indeed your reference, then it does not recommend returning repeated fields for streamed responses; your intuition is correct: you would just want to stream the singular Customer.
Here is what it says (screenshot intentional):
You might be reading rpc ListFeatures(Rectangle) as meaning an endpoint that returns a list [noun] of features. If so, that's a miscommunication. The guide actually means an endpoint to list [verb] features. It would have been less confusing if they just wrote rpc GetFeatures(Rectangle).
So, your proto should look more like this,
rpc GetCustomers (google.protobuf.Empty) returns (stream Customer);
rpc GetCustomer (GetCustomerRequest) returns (Customer);
generating exactly what you suspected made more sense.
Update
Ah I see, so you're looking at this example in googleapis:
// Lists shelves. The order is unspecified but deterministic. Newly created
// shelves will not necessarily be added to the end of this list.
rpc ListShelves(ListShelvesRequest) returns (ListShelvesResponse) {
option (google.api.http) = {
get: "/v1/shelves"
};
}
...
// Response message for LibraryService.ListShelves.
message ListShelvesResponse {
// The list of shelves.
repeated Shelf shelves = 1;
// A token to retrieve next page of results.
// Pass this value in the
// [ListShelvesRequest.page_token][google.example.library.v1.ListShelvesRequest.page_token]
// field in the subsequent call to `ListShelves` method to retrieve the next
// page of results.
string next_page_token = 2;
}
Yeah, I think you've probably figured the same by now, but here they have chosen to use a simple RPC, as opposed to a server-side streaming RPC (see here). I emphasize this because, I think the important choice is not the stylistic difference between repeated versus stream, but rather the difference between a simple request-response API versus a more complex and less-ubiquitous streaming API.
In the googleapis example above, they're defining an API that returns a fixed and static number of items per page, e.g. 10 or 50. It would simply be overcomplicated to use streaming for this, when pagination is already so well-understood and prevalent in software architecture and REST APIs. I think that is what they should have said, rather than "a small number." So the complexity of streaming (and learning cost to you and future maintainers) has to justified, that's all. Suppose you're actually fetching thousands of (x, y, z) items for a Point Cloud or you're creating a live-updating bid-ask visualizer for some cryptocurrency, e.g.
Then you'd start asking yourself, "Is a simple request-response API my best option here?" So it just tends to be that, the larger the number of items needing to be returned, the more streaming APIs start to make sense. And that can be for conceptual reasons, e.g. the items are a live-updating stream in time like the above crypto example, or architectural, e.g. it would be more efficient to start displaying results in the UI as partial data streams back. I think the "small number" thing you read was an oversimplification.
I am working on a real-time project with Flink and I need to enrich the state of each card with prior transactions for computing transactions features as below:
For each card I have a feature that counts the number of transactions in the past 24 hours. On the other hand I have 2 data sources:
First, a database table which stores the transactions of cards until the end of yesterday.
Second, the stream of today's transactions.
So the first step is to fetch the yesterday transactions of each card from database and store them in card state. Then the second step is to update this state with today’s transactions which come on stream and compute the number of transactions in the past 24 hours for them.
I tried to read the database data as a stream and connect it to the today transactions. So, to reach above goal, I used RichFlatMap function. However, because the database data was not stream inherently, the output was not correct. RichFlatMap function is in following:
transactionsHistory.connect(transactionsStream).flatMap(new
RichCoFlatMapFunction<History, Tuple2<String, Transaction>,
ExtractedFeatures>() {
private ValueState<History> history;
#Override
public void open(Configuration config) throws Exception {
this.history = getRuntimeContext().getState(new
ValueStateDescriptor<>("card history", History.class));
}
//historical data
#Override
public void flatMap1(History history,
Collector<ExtractedFeatures> collector) throws Exception {
this.history.update(history);
}
//new transactions from stream
#Override
public void flatMap2(Tuple2<String, Transaction>
transactionTuple, Collector<ExtractedFeatures> collector) throws
Exception {
History history = this.history.value();
Transaction transaction = transactionTuple.f1;
ArrayList<History> prevDayHistoryList =
history.prevDayTransactions;
// This function returns transactions which are in 24 hours
//window of the current transaction and their count.
Tuple2<ArrayList<History>, Integer> prevDayHistoryTuple =
findHistoricalDate(prevDayHistoryList,
transaction.transactionLocalDate);
prevDayHistoryList = prevDayHistoryTuple.f0;
history.prevDayTransactions = prevDayHistoryList;
this.history.update(history);
ExtractedFeatures ef = new ExtractedFeatures();
ef.updateFeatures(transaction, prevDayHistoryTuple.f1);
collector.collect(ef);
}
});
What is the right design pattern to achieve the above enriching requirement in a Flink streaming program?
I found the blow question on stack overflow which is similar to my question but I couldn’t solve my problem so I decided to ask for help :)
Enriching DataStream using static DataSet in Flink streaming
Any help would be really appreciated.
However, because the database data was not stream inherently, the output was not correct.
It certainly is possible to enrich streaming data with information coming from a relational database. What can be tricky, though, is to somehow guarantee that the enrichment data is ingested before it is needed. In general you may need to buffer the stream to be enriched until the enrichment data has been bootstrapped/ingested. One approach that is sometimes taken, for example, is to
run the app with the stream-to-be-enriched disabled
take a savepoint once the enrichment data has been fully ingested and stored in flink state
restart the app from the savepoint with the stream-to-be-enriched enabled
In the case you describe, however, it seems like a simpler approach would work. If you only need 24 hours of historic data, then why not ignore the database of historic transactions? Just run your application until it has seen 24 hours of streaming data, after which the historic database becomes irrelevant anyway.
But if you must ingest the historic data, and you don't like the savepoint-based approach outlined above, here are a couple of other possibilities:
buffer the un-enriched events in flink state (e.g. ListState or MapState) until the historic stream has been ingested
write a custom SourceFunction that blocks the primary stream until the historic data has been ingested
For a more thorough exploration of this topic, see Bootstrapping State In Apache Flink.
Better support for this use case is planned for a future release, btw.
I am working on an ASP .NET MVC 5 application that requires me to use the Task objects that were introduced in .NET 4.0. I am browsing through a few links that give an overview of Task objects. However, I could use a bit of help to check if I am going in the right direction.
Here is the stub that is generated by Visual Studio:
public Task<MyAppUser> FindByNameAsync(string userName) {
throw new System.NotImplementedException();
}
I have written a method called mySearch() that searches through a list. I could use this function for my implementation:
public Task<MyAppUser> FindByNameAsync(string userName) {
MyAppUser val = mySearch(userName);
return Task<MyAppUser>.FromResult<MyAppUser>(val);
}
While this may work, I am thinking I am not really utilizing the Task paradigm properly. Perhaps I can write the code as follows:
public Task<MyAppUser> FindByNameAsync(string userName) {
return Task<MyAppUser>.Factory.StartNew(() => mySearch(userName));
}
As I understand, I am simply returning a delegate as a Task object which the ASP.NET engine will execute as needed.
Am I using the Task paradigm correctly?
Don't ever return a new Task from an XXXAsync method - that's almost the worst thing you can do. In your case, using Task.FromResult is probably the best option (if you are indeed forced to use the XXXAsync methods and if you really don't have asynchronous I/O for the search method). In a web application, it's better to do the whole thing synchronously rather than appearing asynchronous while still taking up a different thread.
The reasoning is simple - asynchronous methods are a great way to conserve resources. Asynchronous I/O doesn't require a thread, so you can afford to reuse the current thread for other work until the data is actually ready. In ASP.NET, the callback will be posted back to a ThreadPool thread, so you've managed to increase your throughput essentially for free.
If you fake the asynchronous method by using Task.FromResult, it's true that this is lost. However, unlike in WinForms or WPF, you're not freezing the GUI, so there's no point in masking the lacking asynchronicity by spawning a new thread.
When you do the faking by using TaskFactory.StartNew or Task.Run, you're only making things worse, essentially - it's true that you release the original thread as with proper async I/O, but you also claim a new thread from the ThreadPool - so you're still blocking one thread, you just added a bunch of extra work for the plumbing.
#Luaan's answer is quite good. I just want to expound on a couple of principles for using async on ASP.NET.
1) Use synchronous method signatures for synchronous work.
I'm not sure why VS is generating an asynchronous stub. Since your mySearch just "searches through a list" (a synchronous operation), then your method should look like this instead:
public MyAppUser FindByName(string userName) {
return mySearch(userName);
}
2) Use async/await for asynchronous work (i.e., anything doing I/O). Do not use Task.Run or (even worse) Task.Factory.StartNew to fake asynchronous work within a request context.
For example, if you needed to search in a database (I/O), then that would be naturally asynchronous, and you should use the asynchronous APIs (e.g., EF6 has asynchronous queries):
public Task<MyAppUser> FindByNameAsync(string userName) {
return dbContext.Users.Where(x => x.Name == userName).FirstAsync();
}
If you're planning to have asynchronous APIs but for now you're just doing test/stub code, then you should use FromResult:
public Task<MyAppUser> FindByNameAsync(string userName) {
return Task.FromResult(mySearch(userName));
}
I have a really long integration test that simulates a sequential process involving many different interactions with a couple of Java servlets. The servlets' behavior depends on the values of the parameters being posted in the request, so I wanted to test every permutation to make sure my servlets are behaving as expected.
Currently, my integration test is in one long function called "testServletFunctionality()" that goes something like this:
//Configure a mock request
//Post request to servlet X
//Check database for expected changes
//Re-configure mock request
//Re-post request to servlet X
//Check database for expected changes
//Re-configure mock request
//Post request to servlet Y
//Check database for expected changes
...
and each configure/post/check step has about 20 lines of code, so the function is very long.
What is the proper way to break up or organize a long, sequential, repetitive integration tests like this?
The main problem with integration tests (IT) is usually that the setup is very expensive. Tests usually should not depend on each other and the order in which they are executed but for ITs, test #2 will always fail if you don't run test #1 (login).
Sad.
The solution is to treat these tests like production code: Split long methods into several smaller ones, build helper objects that perform certain operations, so you can do this in your test:
#Test
public void someComplexText() throws Exception {
new LoginHelper().loginAsAdmin();
....
}
or move this code into a base test class:
#Test
public void someComplexText() throws Exception {
loginHelper().loginAsAdmin();
....
}
I am using Google AppEngine, in conjunction with PyAMF to provide RemoteObject support. In my Flex code I make several RemoteObject method calls at once which tends to batch the AMF Messages into a single HTTP request.
Most of the time this is fine but AppEngine applies some strict per request limits (in this case I am hitting a DeadlineExceededError - max 30 seconds). A number of service methods are expected to take upwards of 10 seconds and if these are batched by the RemoteObject into 1 HTTP .. you see where this is going.
Now you could say refactor your service calls and that is also going on but not really the question being asked here. Is there a way to prevent Flex RemoteObject from batching AMF requests for situations like this?
I have done a fair amount of Googling on the subject and come up with bupkis. It seems to me that I would need to implement a custom version of mx.messaging.channels.AMFChannel or something of that nature, which seems waay too hardcore for a feature like this ..
Anyone have any pointers/insight?
Check out the concurrency property on RemoteObject.
The batching of AMF requests into HTTP happens at the NetConnection level. So unfortunately the best way to stop AMF requests from batching is to implement a custom version of the mx.messaging.channels.AMFChannel. However this is quite easy to do, and probably easier that queuing requests and calling them later.
Instead of using the default AMFChannel use the following instead:
package services
{
import flash.events.AsyncErrorEvent;
import flash.events.IOErrorEvent;
import flash.events.NetStatusEvent;
import flash.events.SecurityErrorEvent;
import flash.net.NetConnection;
import mx.messaging.MessageResponder;
import mx.messaging.channels.AMFChannel;
public class NonBatchingAMFChannel extends mx.messaging.channels.AMFChannel
{
public function NonBatchingAMFChannel(id:String = null, uri:String = null)
{
super(id, uri);
}
override protected function internalSend(msgResp:MessageResponder):void
{
// AMFChannel internalSend
super.internalSend(msgResp);
// Reset the net connection.
_nc = new NetConnection();
_nc.addEventListener(NetStatusEvent.NET_STATUS, statusHandler);
_nc.addEventListener(SecurityErrorEvent.SECURITY_ERROR, securityErrorHandler);
_nc.addEventListener(IOErrorEvent.IO_ERROR, ioErrorHandler);
_nc.addEventListener(AsyncErrorEvent.ASYNC_ERROR, asyncErrorHandler);
_nc.connect(this.url);
}
}
}
The magic happens by overriding the internalSend method. After running the super internalSend method (which queues the message responder), we will reset the NetConnection and all of its event handlers. This gets a new NetConnection ready for the next remoting message.
Note:
It's important to note that this is a custom non batching AMFChannel, if you want send AMF messages securely you'll need to copy this class and extend the mx.messaging.channels.SecureAMFChannel class.
Credit:
Credit to Nick Joyce who answered his question here on a different forum.
you can create a pool of connections, and create another another class that triggers the connections. Your application does not make the connections, only feeds the pool.
Well, one way is apparently to roll your own AMFChannel that doesn't use NetConnection... I haven't tried it so I don't know how well it works.
http://blogs.adobe.com/pfarland/2008/06/using_amf_with_flashneturlload.html
I think what njoyce like to do is to prevent AMF batching. This ie. is good for multiple small calls but if you have very server-intensive calls you AMF batching should be prevented. Why?
one AMF call => one thread on server side
multiple AMF calls => all requests get handled through multiple threads
Pseudo Code:
private static var _collectionFillIndex:int;
private static var _collectionsToFill:Array = [];
public function doFillCollections():void {
_collectionFillIndex = _collectionsToFill.length;
Application.application.addEventListener( Event.ENTER_FRAME, onFrameEventHandler );
}
private function onFrameEventHandler( event:Event ):void {
--_collectionFillIndex;
if( _collectionFillIndex < 0 ) {
Application.application.removeEventListener( Event.ENTER_FRAME, onFrameEventHandler );
return;
}
_collectionsToFill[ _managerFillIndex ].fill();
}