Effective subscription to data feeds - servlets

How to effectively implement subscription mechanism in G-Wan? Suppose, I want to make g-wan aggregate data from various tickers and farther process it. And, obviously, every feed provides the data in its unique format.
The straightforward way would be to create connections and subscribe to data in the init() function of the connection handler, then parse source info from the responses and dispatch data from the main() function to dedicated queues. But this approach doesn't seem to make any use of the effective task scheduling engine of G-Wan. So, may be a dedicated software would solve the problem faster?
Another approach would be creating dedicated servlets for every subscription. For that, in the main() func of the connection handler, I would need rewriting headers and including names of corresponding servlets. In this case I would employ the whole g-wan machinery. But doesn't the rewriting headers negate all performance advantage of g-wan?

G-WAN already provides a simple publisher/subscriber engine, see the Comet servlet example.
This works fine with slow (typically 1 update per second) feeds.
For real-time and BigData feeds, there's no alternative to using G-WAN protocol handlers (to bypass connection handler rewrites and to precisely define the desired latency).
That's what happened for this project distributing 150 million messages per second via 75,000 channels to 1.5 million subscribers.
We have also made a (now famous) demo for the ORACLE OpenWorld expo in SFO that processed 1.2 billion TPS (transactions per second) on one single server, by using G-WAN as a cache for the ORACLE noSQL database (a Java KV store).
So the limits are more a question of precise tuning than G-WAN's core engine limitations.

Related

Event-Driven-API with Real Time Streaming Analytics from Datastream? (Kappa-Architecture, IoT)

I've recently read up common Big Data architectures (Lambda and Kappa) and I'm trying to put it into practice in the context of an IoT Application.
As of right now, events are produced, ingested into a database, queried and provided as a REST-API (Backend) for a (React) Frontend. However, this architecture is not event driven as the front end isn't notified or updated if there are new events. I use frequent HTTP-Requests to "simulate" a real time application.
Now at first glance, the Kappa Architecture seems like the perfect fit for my needs, but I'm having trouble finding a technology that lets me write dynamic aggregation queries and serve them to a frontend.
As I understand, Frameworks like Apache Flink (or Spark Structured Streaming) are a great way to write such queries and apply them to the datastream, but they are static and can't be changed.
I'd like to find a way, how to filter, group, and aggregate events from a stream and provide them to a frontend using WebSockets or SSE. As of right now, the aggregates don't need to be persisted as they are strictly for visualization (this will probably change in the future).
I implemented a Kafka Broker into my application and all events are ingested into a topic and ready for consumption.
Before I implemented Kafka I tried to apply Aggregation Pipelines on my MongoDB Change Feed, which isn't fully supported and therefore doesn't fit my needs.
I tried using Apache Druid, but it seems as if it only supports a request/response-pattern and can't stream query results for consumption
I've looked into Apache Flink, but it seems as if you can only define static queries that are then committed to the Flink Cluster. It seems as if Interactive/Ad-hoc queries are not possible which is really sad, as it looked very promising otherwise.
I think I've found a way that could maybe work using Kafka + Kafka Streams, but I'm not really satisfied with it and this is why I'm writing this post.
My problem boils down to 2 questions:
How can I properly create interactive queries (filter, group (windowing), aggregate) and receive a continuous stream of results?
How can I serve this result stream to a frontend for visualization and therefore create an truly event-driven API?
I'd like to only rely on open-source/free software (Apache etc.).

How can I track WebSockets in Node.js (ws) using Azure Application Insights?

I'm using Node.js and ws for my WebSocket servers and want to know the best practice methods of tracking connections and incoming and outgoing messages with Azure Azure Application Insights.
It appears as though this service is really only designed for HTTP requests and responses so would I be fine if I tracked everything as an event? I'm currently passing the JSON.parse'd connection message values.
What to do here really depends on the semantics of your websocket operations. You will have to track these manually since the Application Insights SDK can't infer the semantics to map to Request/Dependency/Event/Trace the same way it can for HTTP. The method names in the API do indeed make this unclear for non-HTTP, but it becomes clearer if you map the methods to the telemetry schema generated and what those item types actually represent.
If you would consider a receiving a socket message to be semantically beginning an "operation" that would trigger dependencies in your code, you should use trackRequest to record this information. This will populate the information in the most useful way for you to take advantage of the UI in the Azure Portal (eg. response time analysis in the Performance blade or failure rate analysis in the Failures blade). Because this request isn't HTTP, you'll have to mend your data to fit the schema a bit. An example:
client.trackRequest({name:"WS Event (low cardinality name)", url:"WS Event (high cardinality name)", duration:309, resultCode:200, success:true});
In this example, use the name field to describe that items that share this name are related and should be grouped in the UI. Use the url field as information that more completely describes the operation (like GET parameters would in HTTP). For example, name might be "SendInstantMessage" and url might be "SendInstantMessage/user:Bob".
In the same way, if you consider sending a socket message to be request for information from your app, and has meaningful impact how your "operation" acts, you should use trackDependency to record this information. Much like above, doing this will populate the data in most useful way to take advantage of the Portal UI (Application Map in this case would then be able to show you % of failed Websocket calls)
If you find you're using websockets in a way that doesn't really fit into these, tracking as an event as you are now would be the correct use of the API.

How to handle network calls in Microservices architecture

We are using Micro services architecture where top services are used for exposing REST API's to end user and backend services does the work of querying database.
When we get 1 user request we make ~30k requests to backend service. We are using RxJava for top service so all 30K requests gets executed in parallel.
We are using haproxy to distribute the load between backend services.
However when we get 3-5 user requests we are getting network connection Exceptions, No Route to Host Exception, Socket connection Exception.
What are the best practices for this kind of use case?
Well you ended up with the classical microservice mayhem. It's completely irrelevant what technologies you employ - the problem lays within the way you applied the concept of microservices!
It is natural in this architecture, that services call each other (preferably that should happen asynchronously!!). Since I know only little about your service APIs I'll have to make some assumptions about what went wrong in your backend:
I assume that a user makes a request to one service. This service will now (obviously synchronously) query another service and receive these 30k records you described. Since you probably have to know more about these records you now have to make another request per record to a third service/endpoint to aggregate all the information your frontend requires!
This shows me that you probably got the whole thing with bounded contexts wrong! So much for the analytical part. Now to the solution:
Your API should return all the information along with the query that enumerates them! Sometimes that could seem like a contradiction to the kind of isolation and authority over data/state that the microservices pattern specifies - but it is not feasible to isolate data/state in one service only because that leads to the problem you currently have - all other services HAVE to query that data every time to be able to return correct data to the frontend! However it is possible to duplicate it as long as the authority over the data/state is clear!
Let me illustrate that with an example: Let's assume you have a classical shop system. Articles are grouped. Now you would probably write two microservices - one that handles articles and one that handles groups! And you would be right to do so! You might have already decided that the group-service will hold the relation to the articles assigned to a group! Now if the frontend wants to show all items in a group - what happens: The group service receives the request and returns 30'000 Article numbers in a beautiful JSON array that the frontend receives. This is where it all goes south: The frontend now has to query the article-service for every article it received from the group-service!!! Aaand your're screwed!
Now there are multiple ways to solve this problem: One is (as previously mentioned) to duplicate article information to the group-service: So every time an article is assigned to a group using the group-service, it has to read all the information for that article form the article-service and store it to be able to return it with the get-me-all-the-articles-in-group-x query. This is fairly simple but keep in mind that you will need to update this information when it changes in the article-service or you'll be serving stale data from the group-service. Event-Sourcing can be a very powerful tool in this use case and I suggest you read up on it! You can also use simple messages sent from one service (in this case the article-service) to a message bus of your preference and make the group-service listen and react to these messages.
Another very simple quick-and-dirty solution to your problem could also be just to provide a new REST endpoint on the articles services that takes an array of article-ids and returns the information to all of them which would be much quicker. This could probably solve your problem very quickly.
A good rule of thumb in a backend with microservices is to aspire for a constant number of these cross-service calls which means your number of calls that go across service boundaries should never be directly related to the amount of data that was requested! We closely monitory what service calls are made because of a given request that comes through our API to keep track of what services calls what other services and where our performance bottlenecks will arise or have been caused. Whenever we detect that a service makes many (there is no fixed threshold but everytime I see >4 I start asking questions!) calls to other services we investigate why and how this could be fixed! There are some great metrics tools out there that can help you with tracing requests across service boundaries!
Let me know if this was helpful or not, and whatever solution you implemented!

Use-cases for reactive streams using java 9 Flows in Servlets?

I'm looking for use-cases for using reactive streams within a servlet container (or just a HTTP server).
The Jetty project has started being asked: "is Jetty reactive?" and we've noticed the proposal to add reactive streams to java 9.
So we've started some experiments with using the reactive streams API for async servlet IO, which are interesting enough..... but lack any focus because we lack real use-cases to focus which concerns are most important.
So does anybody have any good use-cases that they could share/explain so that we can direct our jetty experiments to meet their needs. The sort of thing I've imagined is having a RS based database publisher sending objects all the way out on a HTTP response or websocket connection using Flow.Processors for the conversions along the way.
A viable use case is when consuming the POSTing of multi-part form data, particularly when uploading files.
The Typesafe ConductR project (disclaimer: I'm the Tech Lead for it), receives multi-part form data when a user loads a bundle. We use akka-streams/http.
We read off the first two parts of the stream as our protocol specifies that they must declare some meta data in order for us to know which node to write the bundles to. After some validation, we then determine the node to write them to, and connect the partially consumed stream. Thus the node that receives the request to upload the bundle negotiates which node it is going to write it to, while not having to consume the entire stream (which could be 200MB) and then write it out again.
Writing out multi-part form data is also a great use-case given that you can stream the file from disk as a source and pass it on to some http endpoint i.e. the client-side of what I describe above.
The benefits with both use-cases are that you minimise the amount of memory needed to move bytes over a network, and you only perform file IO where it is necessary.

How efficient is Meteor's DDP at syncing very large collections?

Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.
However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.
How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?
Some additional thoughts:
As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.
See also:
How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
DDP vs Straight MongoDB access for synching large amounts of data
EDIT:
After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.
Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.
Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).
The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.
Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.
In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).
Also take a look on this excellent answer concerning huge collections, what is done under the hood.

Resources