Storm transformation which does REST call (async, Future, Netty) - asynchronous

Let's assume that Storm receives a stream of 1000's of tweets per second and that somewhere in the process, it needs to classify them as spam or not. I have a cluster of e.g. 20 machines that provide the "classification" microservice through a REST API and they can give max throughput of 10k tweets per second and their latency is 3 seconds. This means that at worst case, I might have 30k tweets on the fly and that's ok. I guess that to consume this service from Storm, an implementation will be something like this:
public static class RestBolt implements IRichBolt {
...
#Override
public void execute(Tuple tuple) {
String classes = (new Post('http://my.classifier.com', data = tuple.getString(0)));
_collector.emit(tuple, new Values(classes));
_collector.ack(tuple);
}
...
}
topologyBuilder.setBolt("rest-bolt", new RestBolt(), 30000);
Now, given this API, my guess is that Storm would have no other choice other than starting 30k threads and that would be potentially bad. I see in the source code that Storm uses Netty, I guess it could support this operation more efficiently by using Storm calls... If fictitious beautiful Netty, Storm and Java API existed, this would look something like this:
public static class RestBolt implements IRichBolt {
...
#Override
public void execute(Tuple tuple) {
Future<String> classes = (new AsyncPost('http://my.classifier.com', data = tuple.getString(0)));
_collector.emit(tuple, new FutureValues(classes));
_collector.ack(tuple);
}
...
}
Is there a way to use asynchronous calls to have massive scalability with very few threads in Storm?

Related

Starting multiple Kafka listeners in Spring Kafka?

One of our dev teams is doing something I've never seen before.
First they're defining an abstract class for their consumers.
public abstract class KafkaConsumerListener {
protected void processMessage(String xmlString) {
}
}
Then they use 10 classes like the one below to create 10 individual consumers.
#Component
public class <YouNameIt>Consumer extends KafkaConsumerListener {
private static final String <YouNameIt> = "<YouNameIt>";
#KafkaListener(topics = "${my-configuration.topicname}",
groupId = "${my-configuration.topicname.group-id}",
containerFactory = <YouNameIt>)
public void listenToStuff(#Payload String message) {
processMessage(message);
}
}
So with this they're trying to start 10 Kafka listeners (one class/object per listener). Each listener should have own consumer group (with own name) and consume from one (but different) topic.
They seem to use different ConcurrentKafkaListenerContainerFactories, each with #Bean annotation so they can assign different groupId to each container factory.
Is something like that supported by Spring Kafka?
It seems that it worked until few days ago and now it seems that one consumer group gets stuck all the time. It starts, reads few records and then it hangs, the consumer lag is getting bigger and bigger
Any ideas?
Yes, it is supported, but it's not necessary to create multiple factories just to change the group id - the groupId property on the annotation overrides the factory property.
Problems like the one you describe is most likely the consumer thread is "stuck" in user code someplace; take a thread dump to see what the thread is doing.

gRPC bidirectional function returns RequestType instead of ResponseType

I'm learning gRPC using the official doc, but found the method signature of client-streaming and bidirectional-streaming very confusing (the two are the same).
From the doc here, the function takes StreamObserver<ResponseType> as the input parameter and returns a StreamObserver<ResponseType> instance, as the following:
public StreamObserver<RequestType> bidirectionalStreamingExample(
StreamObserver<ResponseType> responseObserver)
But in my mind, it should take the RequestType type as input and returns the ResponseType type:
public StreamObserver<ResponseType> bidirectionalStreamingExample(
StreamObserver<RequestType> responseObserver)
This confuses me very much and I'm actually a little surprised that the answer didn't prompt up when I search is in google, I thought many people would have the same question. Am I missing something obvious here? Why would gRPC defines the signature like this?
Your confusion probably stems from being used to REST or non-streaming frameworks, where request-response is often mapped to a function's parameter-return. The paradigm shift here is that you're no longer supplying request-response, but rather channels to drop requests and responses. If you've studied C or C++, it's very much like going from
int get_square_root(int input);
to
void get_square_root(int input, int& output);
See how output's now a parameter? But in case that makes no sense at all (my fault :-) here's a more organic path:
Server Streaming
Let's start with the server streaming stub, even if your eventual goal is client streaming.
public void serverStreamingExample(
RequestType request,
StreamObserver<ResponseType> responseObserver)
Q: Why is the "response" in the parameter list? A: It's not the response that's in the parameter list, but rather a channel to feed the eventual response to. So for example:
public void serverStreamingExample(
RequestType request,
StreamObserver<ResponseType> responseObserver) {
ResponseType response = processRequest(request);
responseObserver.onNext(response); // this is the "return"
responseObserver.onCompleted();
}
Why? Because, the point of streaming is to keep alive a channel on which responses can keep flowing through. If you could only return 1 response and that's that, the function's done, then that's not a stream. By supplying a channel, you as the developer can choose to pass it along as needed, feeding it as many responses as you'd like via onNext() until you're satisfied and call onCompleted().
Client Streaming
Now, let's move on to the client streaming stub:
public StreamObserver<RequestType> clientStreamingExample(
StreamObserver<ResponseType> responseObserver)
Q: Wait, what! We know why the response is in the parameter list now, but how does it make sense to return a request? A: Again, we're not actually returning a request, but a channel for the client to drop requests! Why? Because the point of client streaming is to allow the client to supply requests in pieces. It can't do that with a single, traditional call to the server. So here's one way this can be implemented:
class ClientStreamingExample {
int piecesRcvd = 0;
public StreamObserver<RequestType> myClientStreamingEndpoint(
StreamObserver<ResponseType> responseObserver) {
return new StreamObserver<RequestType>() {
#Override
public void onNext(RequestType requestPiece) {
// do whatever you want with the request pieces
piecesRcvd++;
}
#Override
public void onCompleted() {
// when the client says they're done sending request pieces,
// send them a response back (but you don't have to! or it can
// be conditional!)
ResponseType response =
new ResponseType("received " + piecesRcvd + " pieces");
responseObserver.onNext(response);
responseObserver.onCompleted();
piecesRcvd = 0;
}
#Override
public void onError() {
piecesRcvd = 0;
}
};
}
}
You might have to spend a little time studying this to fully understand, but basically, since the client may now send a stream of requests, you have to define handlers for each request piece, as well as handlers for the client saying it's done or errored out. (In my example, I have the server only respond when the client says it's done, but you're free to do anything you want. You can even have the server respond even before the client says it's done or not respond at all.)
Bidirectional Streaming
This isn't really a thing! :-) What I mean is, tutorials just mean to point out that nothing's stopping you from implementing exactly the above, just on both sides. So you end up with 2 applications that send and receive requests in pieces, and send and receive responses. They call this setup bidirectional streaming, and they're correct to, but it's just a little misleading since it's not doing anything technically different from client streaming. That's exactly why the signatures are the same. IMHO, tutorials should just mention a note like I have here, rather than repeat the stub.
Optional: Just for "fun"...
We began with the C++ analogy of going from
int get_square_root(int input); // "traditional" request-response
to
void get_square_root(int input, int& output); // server streaming
Do we want to carry on this analogy? Of course we do.
🎵 Hello, C++ function pointers, my old friend... 🎶
void (*fnPtr)(int) get_square_root_fn(int& output); // client streaming
And a demonstration of its use(lessness):
int main() { // aka the client
int result;
void (*fnPtr)(int) = server.get_square_root_fn(result);
fnPtr(2);
std::cout << result << std::endl; // 1.4142 assuming the fn actually does sqrt
}

autoStartup for #StreamListener

Unlike #KafkaListener, it looks like #StreamListener does not support the autoStartup parameter. Is there a way to achieve this same behavior for #StreamListener? Here's my use case:
I have a generic Spring application that can listen to any Kafka topic and write to its corresponding table in my database. For some topics, the volume is low and thus processing a single message with very low latency is fine. For other topics that are high volume, the code should receive a microbatch of messages and write to the database using Jdbc batch on a less frequent basis. Ideally the definition for the listeners would look something like this:
// low volume listener
#StreamListener(target = Sink.INPUT, autoStartup="${application.singleMessageListenerEnabled}")
public void handleSingleMessage(#Payload GenericRecord message) ...
// high volume listener
#StreamListener(target = Sink.INPUT, autoStartup="${application.multipleMessageListenerEnabled}")
public void handleMultipleMessages(#Payload List<GenericRecord> messageList) ...
For a low-volume topic, I would set application.singleMessageListenerEnabled to true and application.multipleMessageListenerEnabled to false, and vice versa for a high-volume topic. Thus, only one of the listeners would be actively listening for messages and the other not actively listening.
Is there a way to achieve this with #StreamListener?
First, please consider upgrading to functional programming model which would take you minutes to refactor. We've all but deprecated the annotation-based programming model.
If you do then what you're trying to accomplish is very easy:
#SpringBootApplication
public class SimpleStreamApplication {
public static void main(String[] args) throws Exception {
SpringApplication.run(SimpleStreamApplication.class);
}
#Bean
public Consumer<GenericRecord> singleRecordConsumer() {...}
#Bean
public Consumer<List<GenericRecord>> multipleRecordConsumer() {...}
}
Then you can simply use --spring.cloud.function.definition=singleRecordConsumer property for a single case and --spring.cloud.function.definition=multipleRecordConsumer when starting the application, this ensuring which specific listener you want to activate.

Spring Data JPA - Java 8 Stream Support & Transactional Best Practices

I have a pretty standard MVC setup with Spring Data JPA Repositories for my DAO layer, a Service layer that handles Transactional concerns and implements business logic, and a view layer that has some lovely REST-based JSON endpoints.
My question is around wholesale adoption of Java 8 Streams into this lovely architecture: If all of my DAOs return Streams, my Services return those same Streams (but do the Transactional work), and my Views act on and process those Streams, then by the time my Views begin working on the Model objects inside my Streams, the transaction created by the Service layer will have been closed. If the underlying data store hasn't yet materialized all of my model objects (it is a Stream after all, as lazy as possible) then my Views will get errors trying to access new results outside of a transaction. Previously this wasn't a problem because I would fully materialize results into a List - but now we're in the brave new world of Streams.
So, what is the best way to handle this? Fully materialize the results inside of the Service layer as a List and hand them back? Have the View layer hand the Service layer a completion block so further processing can be done inside of a transaction?
Thanks for the help!
In thinking through this, I decided to try the completion block solution I mentioned in my question. All of my service methods now have as their final parameter a results transformer that takes the Stream of Model objects and transforms it into whatever resulting type is needed/requested by the View layer. I'm pleased to report it works like a charm and has some nice side-effects.
Here's my Service base class:
public class ReadOnlyServiceImpl<MODEL extends AbstractSyncableEntity, DAO extends AbstractSyncableDAO<MODEL>> implements ReadOnlyService<MODEL> {
#Autowired
protected DAO entityDAO;
protected <S> S resultsTransformer(Supplier<Stream<MODEL>> resultsSupplier, Function<Stream<MODEL>, S> resultsTransform) {
try (Stream<MODEL> results = resultsSupplier.get()) {
return resultsTransform.apply(results);
}
}
#Override
#Transactional(readOnly = true)
public <S> S getAll(Function<Stream<MODEL>, S> resultsTransform) {
return resultsTransformer(entityDAO::findAll, resultsTransform);
}
}
The resultsTransformer method here is a gentle reminder for subclasses to not forget about the try-with-resources pattern.
And here is an example Controller calling in to the service base class:
public abstract class AbstractReadOnlyController<MODEL extends AbstractSyncableEntity,
DTO extends AbstractSyncableDTOV2,
SERVICE extends ReadOnlyService<MODEL>>
{
#Autowired
protected SERVICE entityService;
protected Function<MODEL, DTO> modelToDTO;
protected AbstractReadOnlyController(Function<MODEL, DTO> modelToDTO) {
this.modelToDTO = modelToDTO;
}
protected List<DTO> modelStreamToDTOList(Stream<MODEL> s) {
return s.map(modelToDTO).collect(Collectors.toList());
}
// Read All
protected List<DTO> getAll(Optional<String> lastUpdate)
{
if (!lastUpdate.isPresent()) {
return entityService.getAll(this::modelStreamToDTOList);
} else {
Date since = new TimeUtility(lastUpdate.get()).getTime();
return entityService.getAllUpdatedSince(since, this::modelStreamToDTOList);
}
}
}
I think it's a pretty neat use of generics to have the Controllers dictate the return type of the Services via the Java 8 lambda's. While it's strange for me to see the Controller directly returning the result of a Service call, I do appreciate how tight and expressive this code is.
I'd say this is a net positive for attempting a wholesale switch to Java 8 Streams. Hopefully this helps someone with a similar question down the road.

Business object to multiple tables

I want to try custom code, this is for my uni project. What if I have Tables - UserCar, CarMake, and CarModel.
UserCar - userId, carId, carMakeId, CarModelId
CarMake - CarMakeId, MakeName
CarModel - CarModelId, ModelName
So I want to display on the page User car, using 3 layer architecture. So how do I map this tables??? to business object or objects??? Could you help me please?
Well, you mention 3-layer architecture, so I guess you're looking at a Data/Application/Presentation approach. Of course, in order for that to make sense you may need to provide more than the brief details you gave in your question.
For instance, when we talk about the Application tier, it really makes sense to have one if you have "Appalication logic". With your brief info there isn't really application logic other than displaying your data to screen. See this wikipedia entry for more info on the topic of Multitier (or n-tier) architecture (and 3-tiers as a subset) in general.
That being said, if you have your 3 tables in a data storage of sort (such as a database), we can quickly make a 3-tiers app like this.
1~ Data Tier:
Create classes that match the storage tables, such as (using C# syntax):
public class DT_UserCar
{
public string userId;
public string carId;
public string carMakeId;
public string CarModelId;
}
I'm using the DT_ prefix to indicate this class belongs to the Data Tier.
In addition, you need to have some code to let instance of these classes be read from the storage and probably be saved to storage. Of course you have options already. You could create a separate class that knows how to do all that, like
public class Storage
{
public DT_UserCar ReadUserCar(string carId) { /* implementation */ }
public DT_CarMake ReadCarMake(string carmakeId) { /* implementation */ }
/* and so on... */
}
Or you could decide that each class should know how to serialize/deserialize itself to/from the storage, and go with:
public class DT_UserCar
{
public string userId;
public string carId;
public string carMakeId;
public string CarModelId;
public static DT_UserCar Read(string carId) { /* implementation */ }
public void Write() { /* implementation */ }
}
A third, and much better alternative (for bigger projects) is to find a third-party tool that takes care of all of this for you. After all, given the storage structure (e.g.: the database schema) all of this code can be automated... I won't go into details here since you can find a lot of information about this sort of tools (ORM tools) and their characteristics, but mostly because it doesn't seem ot be part of your exercise.
2~ Application Tier:
As I said, your use case doesn't seem to include a lot of 'business logic'. However, you do mention that the data from those 3 storage tables should be merged somehow, so I'll take that as your one piece of business logic. Hence, we should create a business class (or business entity, or Domain Entity, or Domain model, whichever term you feel like using, they all have different connotations but a lot in common) like this:
public class AT_UserCar
{
public DT_UserCar _userCar;
public DT_CarMake _carMake;
public DT_CarModel _carModel;
public AT_UserCar(DT_UserCar userCar, DT_CarMake carMake, DT_CarModel carModel)
{
_userCar = userCar;
_carMake = carMake;
_carModel = carModel;
}
}
I'm using the AT_ prefix to indicate this class belongs to the Application Tier. Note that I would rather have those 3 as private properties, but for the sake of brevity I'm relaxing other guidelines in this sample code.
Now, as we read an instance of this class form the storage, we'll have to merge the proper DT_ objects in it. Again, you can have this code in the same class AT_UserCar, or decide to split it out into some separate class like this one:
public class AT_UserCarReader
{
public AT_UserCar Read(string userCarId, string carMakeId, string carModelId)
{
DT_UserCar userCar = DT_UserCar.read(userCarId);
DT_CarMake carMake = DT_CarMake.Read(carMakeId);
DT_CarModel carModel = DT_Carmodel.read(carModelId);
return new AT_UserCar(userCar, carMake, carModel);
}
}
An equivalent AT_UserCarWriter class would do the inverse operation of receiving a single AT_UserCar object and writing to the data storage 3 separate objects extracted from it: a DT_UserCar, a DT_CarMake, and a DT_CarModel.
Note that most of this code can also be automated and there is a plethora of tools that will take care of it for you.
3~ Presentation Tier:
Finally, we get to display something on screen. The important thing here is to remember that your Presentation Tier should never deal directly with the Data Tier, but only with the Application Tier.
So, for instance, if I have to retrieve a UserCar by id and display it in a web page, I could write something like this
AT_UserCar car = AT_UserCarReader.Read(userCarId, carMakeId, carModelId);
tbox_userId = car._userCar.userId;
That's, of course, a very small example, but I hope the quick run-through can help you out.
The core of 3-tier architecture (and n-tier, in general) is to separate different concerns into different layers. If you look at the example above, we targeted 3 concerns:
talking to the data storage: we did this exclusively in the Data Tier;
dealing with application logic such as 'merging' data from different tables into a single logical unit: we did this exclusively in the Application Tier;
dealing with presenting to screen the data and -in more general terms- interacting with the user: we did this exclusively in the Presentation Tier.
HTH.
Map the tables to Data Access Objects and use those in your Business Layer. Each one of your DAO will have properties corresponding to each column in the respective table; use any ORM of your liking (such as NHibernate) and you are good to go.

Resources