Event-Driven-API with Real Time Streaming Analytics from Datastream? (Kappa-Architecture, IoT) - bigdata

I've recently read up common Big Data architectures (Lambda and Kappa) and I'm trying to put it into practice in the context of an IoT Application.
As of right now, events are produced, ingested into a database, queried and provided as a REST-API (Backend) for a (React) Frontend. However, this architecture is not event driven as the front end isn't notified or updated if there are new events. I use frequent HTTP-Requests to "simulate" a real time application.
Now at first glance, the Kappa Architecture seems like the perfect fit for my needs, but I'm having trouble finding a technology that lets me write dynamic aggregation queries and serve them to a frontend.
As I understand, Frameworks like Apache Flink (or Spark Structured Streaming) are a great way to write such queries and apply them to the datastream, but they are static and can't be changed.
I'd like to find a way, how to filter, group, and aggregate events from a stream and provide them to a frontend using WebSockets or SSE. As of right now, the aggregates don't need to be persisted as they are strictly for visualization (this will probably change in the future).
I implemented a Kafka Broker into my application and all events are ingested into a topic and ready for consumption.
Before I implemented Kafka I tried to apply Aggregation Pipelines on my MongoDB Change Feed, which isn't fully supported and therefore doesn't fit my needs.
I tried using Apache Druid, but it seems as if it only supports a request/response-pattern and can't stream query results for consumption
I've looked into Apache Flink, but it seems as if you can only define static queries that are then committed to the Flink Cluster. It seems as if Interactive/Ad-hoc queries are not possible which is really sad, as it looked very promising otherwise.
I think I've found a way that could maybe work using Kafka + Kafka Streams, but I'm not really satisfied with it and this is why I'm writing this post.
My problem boils down to 2 questions:
How can I properly create interactive queries (filter, group (windowing), aggregate) and receive a continuous stream of results?
How can I serve this result stream to a frontend for visualization and therefore create an truly event-driven API?
I'd like to only rely on open-source/free software (Apache etc.).

Related

How to minimise Firebase Function Latency

As per the documentation, Firebase Functions are currently supported for 4 regions only - “us-central1”, “us-east1", “europe-west1”, “asia-northeast1"
That means locations further away would incur more latency, and often that translates to lower performance.
How can this limitation be worked around?
1) Choosing a location that is closest to you. You can set up test cloud functions in different regions, and test the round-trip latency. Only you can discover the specifics about your location.
2) Focus your software architecture on infrastructure that is locally available.
Use the client-side Firestore library directly as much as possible. It supports offline data, queueing data to send out later if you don't have internet, and caching read data locally - you can't get faster latency than that! So make sure you use Firestore for CRUD operations.
3) Architect to use CloudFunctions for batch and background processesing. If any business-logic processing is required, write the data to Firestore (using client libraries), and have a FF trigger to do some processing upon the write data-event. Have that trigger update that record with the additional processing, and state. I believe that if you're using the client-side libraries there is a way to have the updated data automatically pushed back to the client-side. (edited)
You also have the bonus benefit of being able to control authorisation with Firestore Auth, where Functions don't have an admin-level authorisation control.
4) Reduce chatter - minimising the amount of CloudFunction calls overall, and ensuring your CloudFunctions themselves do more in one go and return more complete data in one go.

One big call to db vs multiple small calls to implement auto suggest

I am working on a project and implementing the search functionality.
I have a text box and there will be an auto suggestion implemented.
I have two ways to go.
Make a single call to the DB and filter the list of the auto-suggest or.
Make multiple calls in the DB and update the auto-suggest list using ajax
what is the best solution performance-wise and why?
It depends on how "heavy" both approaches are from the database perspective and how fast should auto-suggest response be. Well-behaved application built on connection pool pattern should not take too many resources for 2nd approach, however, this way network traffic and latency come into play. On the other hand, 1st approach might take more resources.
So I would recommend testing it out in real conditions using a load testing tool like Apache JMeter, producing the same load against 2 implementations, and measuring which one works faster and consumes fewer resources. See The Real Secret to Building a Database Test Plan With JMeter to get familiarized with the databases load testing concept.

Effective subscription to data feeds

How to effectively implement subscription mechanism in G-Wan? Suppose, I want to make g-wan aggregate data from various tickers and farther process it. And, obviously, every feed provides the data in its unique format.
The straightforward way would be to create connections and subscribe to data in the init() function of the connection handler, then parse source info from the responses and dispatch data from the main() function to dedicated queues. But this approach doesn't seem to make any use of the effective task scheduling engine of G-Wan. So, may be a dedicated software would solve the problem faster?
Another approach would be creating dedicated servlets for every subscription. For that, in the main() func of the connection handler, I would need rewriting headers and including names of corresponding servlets. In this case I would employ the whole g-wan machinery. But doesn't the rewriting headers negate all performance advantage of g-wan?
G-WAN already provides a simple publisher/subscriber engine, see the Comet servlet example.
This works fine with slow (typically 1 update per second) feeds.
For real-time and BigData feeds, there's no alternative to using G-WAN protocol handlers (to bypass connection handler rewrites and to precisely define the desired latency).
That's what happened for this project distributing 150 million messages per second via 75,000 channels to 1.5 million subscribers.
We have also made a (now famous) demo for the ORACLE OpenWorld expo in SFO that processed 1.2 billion TPS (transactions per second) on one single server, by using G-WAN as a cache for the ORACLE noSQL database (a Java KV store).
So the limits are more a question of precise tuning than G-WAN's core engine limitations.

How efficient is Meteor's DDP at syncing very large collections?

Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.
However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.
How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?
Some additional thoughts:
As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.
See also:
How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
DDP vs Straight MongoDB access for synching large amounts of data
EDIT:
After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.
Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.
Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).
The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.
Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.
In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).
Also take a look on this excellent answer concerning huge collections, what is done under the hood.

Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api
storage on AMAZON EC2 .
My Question is:
Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel, put on top a GenericItemBasedRecommender, and use the servlet-based wrapper in the examples module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.
When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.

Resources