Is there a way to process SignalR data using Spark Streaming? - signalr

I have a data source provided to me using SignalR, which I’ve never used.
I can’t find any documentation on how to ingest it with Spark Streaming, is there a defined process for that?
If not, are there intermediate steps I should take first? Process the data myself from a signal-r client, and create a Kafka producer that’s then reading from that via Structured Streaming?
Alternatively, could try to use airflow but also not sure about that since it’s streaming data.

Related

HTTP response times GUI

I'm looking for an application available on CentOS, that allows me to check periodic connectivity response times between that server and a specific port of a remote server (in this case servers a SOAP API).
Something that preferentially allows me to send periodic API calls, but if not possible, just telnet's that remote port, but shows results in a graphic.
Does someone know about an application that allows this, without the need for me to create a script that writes results to a log file that is less readable in terms of time perspective?
After digging and testing a bit more, ended up using netdata:
https://www.netdata.cloud/
Awesome tool, extremely simple to use and install.

In the C++ streaming gRPC API, is there any way to check how many messages are buffered in the grpc queue?

grpc_impl::ServerReaderWriter/grpc_impl::internal::ReaderInterface implement NextMessageSize(), but from the naming it looks like it'd only return the size of the immediate next message, and from this thread and the documentation it seems that the return value is only an upper bound.
For streaming applications (e.g. audio/video, text, any real time duplex streams), it'd be helpful to know how much data arrived from the client, so that it could be e.g. processed in bulk, or to measure non-realtimeness, or to adapt to variable streaming rates, etc.
Thanks for any pointers and explanations.
The current API does not provide such capabilities. It is normally recommended to keep reading from the stream especially if the application is expecting to receive messages. If the application stops reading, gRPC would also stop reading at some point depending on how resource quota is configured. Even if the configuration is such that gRPC never stops reading, we risk gRPC consuming too much memory.
It seems to me that what you want is to build a layer on top of gRPC that will buffer messages so that you can process them in bulk and perform measurements.

Airflow should integrate with NiFi/StreamSets?

I know Airflow is called workflow manager, nifi dataflow manager, but what this means exactly? The best explanation so far was that nifi cares about data while airflow cares about tasks, but I don't quite get this definition, and I couldn't find any other good explanation/article/video that explains how to integrate this systems, if it is a good idea or is better to use each one in their own.
Also I was thinking if it is better StreamSets or NiFi, I think streamsets looks better in UI and monitor the data, but I heard that depends on the case, that nifi is better if I only ingest data, but again I can't find much information about this questions.
As you said, Airflow is a workflow manager. It means that it only tells other programs to run. It doesn't process data, but tells other to run.
NiFi and StreamSets on the other hand, process data, transform it, receive it and send it. That's why they are dataflow managers.

Most efficient way to transfer images continuously across the network

I have an architecture in which there are separate services which run independently. My one of the service is continuously fetching frames from camera and sending them to another service which performs some processing on the frame (like face detection, face recognition etc) and sends back the results. Services can be run on different machines.
Please suggest any good library or something which is fast at transferring frames between services. I have already a few options in my mind like Kafka and ZeroMq but I am also confused between them, which to choose.
Any good pipeline design is also welcome. Thanks
After experimenting with the Kafka pipeline in a very similar scenario my suggestion is that Kafka is not a suitable choice for it. Since the frequency and size of messages are the concern here while dealing with the live image streams.
The choices that can be tested are:
Zero MQ
Rabbit MQ
Zero MQ will highly likely perform well in the given scenario.

Best practice for selecting which Redis server to read in real time

I have a nginx server that has a redis master and two salves of the master. The slaves are read and the master is read and write. Nginx server is fastcgi using spawed python apps and using pyredis.
When is comes for a read from my nginx app, what is best practice for determining which server gets the read among the three? Is it determined in realtime? Do I just do simple random selection using round robin in real time?
Again, I just have on master. Soon I will have two and will use consistent hashing in python using http://pypi.python.org/pypi/hash_ring so select which server gets the keys.
For the interim, is it wise to select which server will get the read using the hash ring even though they should be exact copies?
Thanks,
what you should do is abstract the code that does that so it doesn't change your app logic later when you split the data.
and as for reading - I'd use just the slaves for that. you can use the hashing if you want, provided it doesn't affect your code and is abstracted.

Resources