I have a use case which involves pulling/streaming data from numerous HTTP endpoints, in excess of over 100.
I have a standalone java app that can manage up to 30+ client requests using ning async client http library but was looking for some ideas on how I could scale this up to handle much more.
The use case is to pull the data from the end points and push them into a kafka queue (similar to jms queue) for processing by a storm topology. The bit I'm stuck on is how to best efficiently get the http end point data into the Kafka queues in the first place.
thanks
Related
.. when the http response entity is not consumed, or the client tcp buffer becomes full, or when the rate of client taking from its tcp buffer is lower then the rate of server pushing data to it?
I am looking for a way for to achieve the following:
Let's assume that there is a backpressure-able source of data on the server, such as an Apache Kafka topic.
If I consume this source from a remote location it may be possible that the rate at which that remote location can consume is lower - this is solved if Kafka client or consumer is used.
However let's assume that the client is a browser and that exposing direct Kafka protocol / connectivity is not a possibility.
Further, let's assume that there is a possibility of getting all the value even if jumping over some messages.
For instance in case of compacted topics, getting only the latest values for each key is enough for a client, no need to go through intermediate values.
This would be equivalent to Flowable.onBackpressureLatest() or AkkaStreams.aggregateOnBackpressure or onBackpressureAggregate.
Would it be a way to expose the topic over HTTP REST (e.g. Server Side Events / chunked transfer-encoding) or over web-sockets, that would achieve this effect of skipping over intermediate values for each key?
Please advise, thanks
Akka http supports back pressure based on TCP protocol very well and you can read about using it in combination with streaming here
Kafka consumption and exposure via http with back pressure can be easily achieved in combination of akka-http, akka-stream and alpakka-kafka.
Kafka consumers need to do polling and alpakka covers back pressure with reduction of polling requests.
I don't see the necessity of skipping over the messages when back pressure is fully supported. Kafka will keep track of the offset consumed by a consumer group (the one you pick for your service or http connection) and this will guarantee eventual consumption of all messages. Of course, if you produce messages way faster in a topic, the consumer will never catch up. Let me know if this is your case.
As a final note, you may check out Confluent REST Proxy API, which allows you to read Kafka messages in a restful manner.
A gRPC newbie question here.
We have a source system that exposes a bi directional gRPC stream. In order to scale our application, we wanted to process the stream data in parallel. Is it possible to have concurrent / multiple gRPC clients consuming from the stream without any conflicts in data processing / during acknowledgement process etc?
Thanks
Is this in the context of a single streaming call? In that case the answer is no. You have a single gRPC client receiving one response stream and it can use worker threads to hand off messages from the stream.
If you are thinking of multiple gRPC clients in an application talking to the same server (I don't see any advantage of doing that) each one will make a separate call and will receive a separate response stream.
I want to connect to a tcp server with proprietary protocol which provides data upon queries.
The tcp data is successfully decoded using Netty4 ServerClientHandler.
Camel consumer(client mode) is used for connecting to the server.
I can send first query through ServerClientHandler on activation.
Requirement
Query for 2000 messages.
After processing 1000 messages ask for another 1000(total 3000) messages.
Repeat
Problem is after processing 1000 messages there is no way to request another 1000 messages.
I have tried to use Netty4 sync option. But what I need is multiple responses for one request.
I was thinking of using netty4 producer. Is there any way to send one request and receive multiple responses using camel netty4 producer. If it is possible I can route responses through camel actors.
I am going through camel netty source code with no success. If you can provide any starting point it will be really helpful.
We have a requirement to to support 10k+ users, where every user initiate a request and waits for a response from the server (the response can take as long as 20-30 seconds to arrive). it is only one request from the client, and after a long processing by the server, a response will be transmitted and then the connection will disconnect.
in the background, the server will do some DB search and wait for other background processes to notify on completion before responding to the client.
after doing some research i figured out we will need to use something like the atmosphere framework to support websockets/sse event/long polling along with an asynchronous server like netty (=> nettosphere) or jetty.
As for my experience - mostly Java EE world and Tomcat server.
my questions are:
what will be easier to implement in regard to my experience and our requirement: atmosphere + netty or atmoshphere+jetty? which one can scale better, has an easier learning curve and easier to implement other java technologies?
how do u implement in atmosphere a response that is sent only to the originating client and not broadcast to the rest of the clients? (all the examples i found are broadcast).
how can i implement in netty (or jetty) when using the atmosphere framework our response? i.e., the client send a request, after it is received in the server some background processes are run, and when they finish i need to locate the connection and transmit the response. is that achievable?
Some thoughts:
At 10k+ users, with 20-30 second response latency, you likely hit file descriptor limits if using just 1 network interface. Consider a solution that uses multiple network interfaces.
Your description of your request/response can be handled entirely with standard Servlet 3.0, standard HTTP/1.1, Async request handling, and large timeouts.
If your clients are web browsers, and you don't start sending a response from the server until the 20-30 second window, you might hit browser idle timeouts.
Atmosphere and Cometd do the same things, supporting long duration connections, with connection technique fallbacks, and with logical channel APIs.
I believe the AKKA framework will handle this sort of need. I am looking at using it to handle scaling issues possibly with a RabbitMQ to help off load work to potentially other servers that may be added later to scale as needed.
What's a good way to connect the synchronous http request/response model with an asynchronous queue based model?
When the user's HTTP request comes it generates a work request that goes onto a queue (beanstalkd in this case). One of the workers picks up the request, does the work, and prepares a response.
The queue model is not request/response - there are only requests, not responses. So the question is, how best do we get the response back into the world of HTTP and back to the user?
Ideas:
Beanstalkd supports light weight topics or queues (they call them tubes). We could create a tube for each request, have the worker create a message on that tube, and have the http process sit and wait on the tube for the response. Don't particularly like this one since it has apache processes sitting around taking memory.
Have the http client poll for the response. The user's initial HTTP request kicks off the job on the queue and returns immediately. The client (the user's browser) polls periodically for a response. On the backend the worker puts its response into memcached, and we connect nginx to memcached so the polling is light weight.
Use Comet. Similar to the second option, but with fancier http communication to avoid polling.
I'm leaning towards 2 since it's easy and well know (I haven't used comet yet). I'm guessing there's probably also a much better obvious model I haven't thought of. What do you think?
Here's how to implement request-response efficiently on JMS which might be helpful (though Java/JMS centric). The general idea is to create a temporary queue per client/thread then use correlationIDs to correlate requests to replies etc.
Polling is the simple solution; comet is the more efficient solution. You've got it nailed :)
I personally love comet (although I'm biased, since I helped write WebSync), it nicely lets your clients subscribe to a channel and get the message when your server process is ready. Works like a champ.
I'm looking to implement a Beanstalkd and memcached system to run a number of processes following a request - in this case, looking up information when a user logs in (the number of messages a user has waiting for example). The info is stored in Memcached and then read back on the next page load.
Without knowing a little more about what tasks you are doing though, it's not so easy to say what needs to be done, or how. Option #2 is however the simplest, and that may be all you need - depending on what you are pushing back into the workers.