How to use historical dataset for enriching Flink DataStream - bigdata

I am working on a real-time project with Flink and I need to enrich the state of each card with prior transactions for computing transactions features as below:
For each card I have a feature that counts the number of transactions in the past 24 hours. On the other hand I have 2 data sources:
First, a database table which stores the transactions of cards until the end of yesterday.
Second, the stream of today's transactions.
So the first step is to fetch the yesterday transactions of each card from database and store them in card state. Then the second step is to update this state with today’s transactions which come on stream and compute the number of transactions in the past 24 hours for them.
I tried to read the database data as a stream and connect it to the today transactions. So, to reach above goal, I used RichFlatMap function. However, because the database data was not stream inherently, the output was not correct. RichFlatMap function is in following:
transactionsHistory.connect(transactionsStream).flatMap(new
RichCoFlatMapFunction<History, Tuple2<String, Transaction>,
ExtractedFeatures>() {
private ValueState<History> history;
#Override
public void open(Configuration config) throws Exception {
this.history = getRuntimeContext().getState(new
ValueStateDescriptor<>("card history", History.class));
}
//historical data
#Override
public void flatMap1(History history,
Collector<ExtractedFeatures> collector) throws Exception {
this.history.update(history);
}
//new transactions from stream
#Override
public void flatMap2(Tuple2<String, Transaction>
transactionTuple, Collector<ExtractedFeatures> collector) throws
Exception {
History history = this.history.value();
Transaction transaction = transactionTuple.f1;
ArrayList<History> prevDayHistoryList =
history.prevDayTransactions;
// This function returns transactions which are in 24 hours
//window of the current transaction and their count.
Tuple2<ArrayList<History>, Integer> prevDayHistoryTuple =
findHistoricalDate(prevDayHistoryList,
transaction.transactionLocalDate);
prevDayHistoryList = prevDayHistoryTuple.f0;
history.prevDayTransactions = prevDayHistoryList;
this.history.update(history);
ExtractedFeatures ef = new ExtractedFeatures();
ef.updateFeatures(transaction, prevDayHistoryTuple.f1);
collector.collect(ef);
}
});
What is the right design pattern to achieve the above enriching requirement in a Flink streaming program?
I found the blow question on stack overflow which is similar to my question but I couldn’t solve my problem so I decided to ask for help :)
Enriching DataStream using static DataSet in Flink streaming
Any help would be really appreciated.

However, because the database data was not stream inherently, the output was not correct.
It certainly is possible to enrich streaming data with information coming from a relational database. What can be tricky, though, is to somehow guarantee that the enrichment data is ingested before it is needed. In general you may need to buffer the stream to be enriched until the enrichment data has been bootstrapped/ingested. One approach that is sometimes taken, for example, is to
run the app with the stream-to-be-enriched disabled
take a savepoint once the enrichment data has been fully ingested and stored in flink state
restart the app from the savepoint with the stream-to-be-enriched enabled
In the case you describe, however, it seems like a simpler approach would work. If you only need 24 hours of historic data, then why not ignore the database of historic transactions? Just run your application until it has seen 24 hours of streaming data, after which the historic database becomes irrelevant anyway.
But if you must ingest the historic data, and you don't like the savepoint-based approach outlined above, here are a couple of other possibilities:
buffer the un-enriched events in flink state (e.g. ListState or MapState) until the historic stream has been ingested
write a custom SourceFunction that blocks the primary stream until the historic data has been ingested
For a more thorough exploration of this topic, see Bootstrapping State In Apache Flink.
Better support for this use case is planned for a future release, btw.

Related

Returning multiple items in gRPC: repeated List or stream single objects?

gRPC newbie. I have a simple api:
Customer getCustomer(int id)
List<Customer> getCustomers()
So my proto looks like this:
message ListCustomersResponse {
repeated Customer customer = 1;
}
rpc ListCustomers (google.protobuf.Empty) returns (ListCustomersResponse);
rpc GetCustomer (GetCustomerRequest) returns (Customer);
I was trying to follow Googles lead on the style. Originally I had returns (stream Customer) for GetCustomers, but Google seems to favor the ListxxxResponse style. When I generate the code, it ends up being:
public void getCustomers(com.google.protobuf.Empty request,
StreamObserver<ListCustomersResponse> responseObserver) {
vs:
public void getCustomers(com.google.protobuf.Empty request,
StreamObserver<Customer> responseObserver) {
Am I missing something? Why would I want to go through the hassle of creating a ListCustomersResponse when I can just do stream Customer and get the streaming functionality?
The ListCustomersResponse is just streaming the whole list at once vs streaming each customer. Googles preference seems to be to return the ListCustomersResponse style all of the time.
When is it appropriate to use the ListxxxResponse vs the stream response?
This question is hard to answer without knowing what reference you're using. It's possible there's a miscommunication, or that the reference is simply wrong.
If you're looking at the gRPC Basics tutorial though, then I might have an inkling as to what caused a miscommunication. If that's indeed your reference, then it does not recommend returning repeated fields for streamed responses; your intuition is correct: you would just want to stream the singular Customer.
Here is what it says (screenshot intentional):
You might be reading rpc ListFeatures(Rectangle) as meaning an endpoint that returns a list [noun] of features. If so, that's a miscommunication. The guide actually means an endpoint to list [verb] features. It would have been less confusing if they just wrote rpc GetFeatures(Rectangle).
So, your proto should look more like this,
rpc GetCustomers (google.protobuf.Empty) returns (stream Customer);
rpc GetCustomer (GetCustomerRequest) returns (Customer);
generating exactly what you suspected made more sense.
Update
Ah I see, so you're looking at this example in googleapis:
// Lists shelves. The order is unspecified but deterministic. Newly created
// shelves will not necessarily be added to the end of this list.
rpc ListShelves(ListShelvesRequest) returns (ListShelvesResponse) {
option (google.api.http) = {
get: "/v1/shelves"
};
}
...
// Response message for LibraryService.ListShelves.
message ListShelvesResponse {
// The list of shelves.
repeated Shelf shelves = 1;
// A token to retrieve next page of results.
// Pass this value in the
// [ListShelvesRequest.page_token][google.example.library.v1.ListShelvesRequest.page_token]
// field in the subsequent call to `ListShelves` method to retrieve the next
// page of results.
string next_page_token = 2;
}
Yeah, I think you've probably figured the same by now, but here they have chosen to use a simple RPC, as opposed to a server-side streaming RPC (see here). I emphasize this because, I think the important choice is not the stylistic difference between repeated versus stream, but rather the difference between a simple request-response API versus a more complex and less-ubiquitous streaming API.
In the googleapis example above, they're defining an API that returns a fixed and static number of items per page, e.g. 10 or 50. It would simply be overcomplicated to use streaming for this, when pagination is already so well-understood and prevalent in software architecture and REST APIs. I think that is what they should have said, rather than "a small number." So the complexity of streaming (and learning cost to you and future maintainers) has to justified, that's all. Suppose you're actually fetching thousands of (x, y, z) items for a Point Cloud or you're creating a live-updating bid-ask visualizer for some cryptocurrency, e.g.
Then you'd start asking yourself, "Is a simple request-response API my best option here?" So it just tends to be that, the larger the number of items needing to be returned, the more streaming APIs start to make sense. And that can be for conceptual reasons, e.g. the items are a live-updating stream in time like the above crypto example, or architectural, e.g. it would be more efficient to start displaying results in the UI as partial data streams back. I think the "small number" thing you read was an oversimplification.

Why does flutter firestore plugin not worry about closing its sinks (snapshot streams)?

In the flutter firestore codebase you can find a comment about the stream it creates when you run snapshots() on a query.
// It's fine to let the StreamController be garbage collected once all the
// subscribers have cancelled; this analyzer warning is safe to ignore.
StreamController<QuerySnapshotPlatform> controller; // ignore: close_sinks
I want to wrap my resulting snapshot streams with a BehaviorSubject so I can keep track of the latest entry. This is useful when I have one stream that is at the top of a page that I want to be consumed through different widgets farther down in my tree without reloading the stream each time. Without keeping track in a BehaviorSubject or elsewhere if a new widget starts listening to that stream it does not get the most recent information from Firestore as it missed that event.
Can I also not worry about closing the behavior subject I am going to create as it will be garbage collected when there are no more listeners? Or is there another way to achieve what I am wanting?
I'm picturing code like this:
final snapshotStream = _firestore.collection('users').snapshots();
final behaviorSubjectStream = BehaviorSubject();
behaviorSubjectStream.addStream(snapshotStream);
return behaviorSubjectStream;
This will get a complaint that I don't close the behaviorSubjectStream. Is it ok to ignore?
That depends on how you listen to the subject.
From what you describe, it sounds safe to ignore the hint. When the subscriptions that listen to the subject are cancelled, the subject will be cancelled as well (when the garbage collector finds it).
There are situations where you have a subscription that is still listening, but you want the subject to stop emitting. In that case you will need to close() the subject.
You can test that the subject is correctly cancelled by adding
behaviorSubjectStream.onCancel = () {
print("onCancel");
};
Then you can test it by playing around with your app.

Queries depending on dataset in Firestore

Recently I have migrated from firebase realtime database to firebase firestore because of the fact that it says the speed of the query depends on the size of the dataset(number of documents) being retreived from collection and not on number of documents in a collection. I checked with varying number of documents in a collection 100, 5000 and 10000 and I was retreiving 20 documents in one query. What I saw was the resulting time for the query increased when I moved from 100, 5000 and 10000 documents in a collection. Why is this happening ?
Is it because firestore is in beta ?
Querying on android (collection with 21000 docs)
collectionReference= FirebaseFirestore.getInstance().collection("_countries").document("+91").collection("_classes-exams").document(String.valueOf(mItem)).collection("_profiles").document(mOtherUserUid==null?uid:mOtherUserUid).collection("user_data");
collectionReference.document("_content").collection(articlesOrQuestions)
.orderBy("mTimeStampAsDate", Query.Direction.DESCENDING).limit(20).get().addOnCompleteListener(mCompleteListener)
.addOnFailureListener(new OnFailureListener() {
#Override
public void onFailure(#NonNull Exception e) {
Toast.makeText(getContext(), e.getMessage(), Toast.LENGTH_LONG).show();
}
});
Image of android monitor when querying the above collection reference: https://i.stack.imgur.com/QZaVX.jpg
You can see the query took almost one minute by looking at the heap(after one minute memory didn't changed much and remained constant and there is sudden spike in network section after 1 minute by which you can infer onComplete is called). What is happening between calling 'get()' function and 'onComplete' callback. This doesn't happen when querying small collections. and why the query on web is fast but slow on android ?
Link to jsbin: http://jsbin.com/fusoxoviwa/1/edit?html,css,js,console,output
Did you write these collections from the same Android client that's now loading a subset of documents from them? If so, that would explain.
The client-side cache in that case will contain information about all docs, and your app is spending time churning through that information.
If you try on a "clean client", it won't have any information cached and it should only spend time on documents that the client is requesting (or has requested before).
The behavior you're currently seeing should improve before Firestore exits beta, since the index will become more efficient, but also because it'll get some form of GC.

Does Realm support SELECT FOR UPDATE style read locking

I've spent a fair amount of time looking into the Realm database mechanics and I can't figure out if Realm is using row level read locks under the hood for data selected during write transactions.
As a basic example, imagine the following "queue" logic
assume the queue has an arbitrary number of jobs (we'll say 5 jobs)
async getNextJob() {
let nextJob = null;
this.realm.write(() => {
let jobs = this.realm.objects('Job')
.filtered('active == FALSE')
.sorted([['priority', true], ['created', false]]);
if (jobs.length) {
nextJob = jobs[0];
nextJob.active = true;
}
});
return nextJob;
}
If I call getNextJob() 2 times concurrently, if row level read blocking isn't occurring, there's a chance that nextJob will return the same job object when we query for jobs.
Furthermore, if I have outside logic that relies on up-to-date data in read logic (ie job.active == false when it actually is true at current time) I need the read to block until update transactions complete. MVCC reads getting stale data do not work in this situation.
If read locks are being set in write transactions, I could make sure I'm always reading the latest data like so
let active = null;
this.realm.write(() => {
const job = this.realm.pseudoQueryToGetJobByPrimaryKey();
active = job.active;
});
// Assuming the above write transaction blocked the read until
// any concurrent updates touching the same job committed
// the value for active can be trusted at this point in time.
if (active === false) {
// code to start job here
}
So basically, TL;DR does Realm support SELECT FOR UPDATE?
Postgresql
https://www.postgresql.org/docs/9.1/static/explicit-locking.html
MySql
https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html
So basically, TL;DR does Realm support SELECT FOR UPDATE?
Well if I understand the question correctly, the answer is slightly trickier than that.
If there is no Realm Object Server involved, then realm.write(() => disallows any other writes at the same time, and updates the Realm to its latest version when the transaction is opened.
If there is Realm Object Server involved, then I think this still stands locally, but the Realm Sync manages the updates from remote, in which case the conflict resolution rules apply for remote data changes.
Realm does not allow concurrent writes. There is at most one ongoing
write transaction at any point in time.
If the async getNextJob() function is called twice concurrently, one of
the invocations will block on realm.write().
SELECT FOR UPDATE then works trivially, since there are no concurrent updates.

Flink transformation which does REST call (async, Future, Netty)

Let's assume that Flink receives a stream of 1000's of tweets per second and that somewhere in the process, it needs to classify them as spam or not. I have a cluster of e.g. 20 machines that provide the "classification" microservice through a REST API and they can give max throughput of 10k tweets per second and their latency is 3 seconds. This means that at worst case, I might have 30k tweets on the fly and that's ok. I guess that to consume this service from Flink, an implementation will be something like this:
public class Classifier implements MapFunction<Tweet, TweetWithClass> {
#Override
public TweetWithClass map(Tweet tweet) {
TweetWithClass twc = new TweetWithClass(tweet);
twc.classes = (new Post('http://my.classifier.com', data = tweet.body)).bodyAsStringArrayFromJson();
return twc;
}
}
DataSet<TweetWithClass> outTweets = inTweets.map(new Classifier()).setParallelism(30000);
Now, given this API, my guess is that Flink would have no other choice other than starting 30k threads and that would be potentially bad. I see in the source code that Flink uses Netty, I guess it could support this operation more efficiently by using asynchronous calls... If fictitious beautiful Netty, Flink and Java API existed, this would look something like this:
public class Classifier implements MapFunction<Tweet, TweetWithClass> {
#Override
public Future<TweetWithClass> map(Tweet tweet) {
Future<String[]> classes = (new NettyPost('http://my.classifier.com', data = tweet.body)).asyncBodyAsStringArrayFromJson();
return classes.onGet( (String[] classes) -> new TweetWithClass(tweet, twc.classes) );
}
}
DataSet<TweetWithClass> outTweets = inTweets.nettyMap(new Classifier()).setMaxParallelism(30000);
Is there a way to use asynchronous calls to have massive scalability with very few threads in Flink?
I know this is a relatively old question but as of Flink 1.2 (Which came out in February 2017), Flink offers an API for exactly this purpose.
It's called async I/O.
With async I/O you can perform asynchronous calls to external databases or in your case external web service and get the results with a callback inside a future.
More Information can be found here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html

Resources