Airflow should integrate with NiFi/StreamSets? - airflow

I know Airflow is called workflow manager, nifi dataflow manager, but what this means exactly? The best explanation so far was that nifi cares about data while airflow cares about tasks, but I don't quite get this definition, and I couldn't find any other good explanation/article/video that explains how to integrate this systems, if it is a good idea or is better to use each one in their own.
Also I was thinking if it is better StreamSets or NiFi, I think streamsets looks better in UI and monitor the data, but I heard that depends on the case, that nifi is better if I only ingest data, but again I can't find much information about this questions.

As you said, Airflow is a workflow manager. It means that it only tells other programs to run. It doesn't process data, but tells other to run.
NiFi and StreamSets on the other hand, process data, transform it, receive it and send it. That's why they are dataflow managers.

Related

Apache Airflow: Orchestration Engine vs. Execution Engine

I continue to read that Airflow is meant to be an orchestration engine and not an execution engine. But what does that mean?
I know that Kubernetes is also called an orchestration tool. So is the idea that Airflow is simply meant to define high-level dependency graphs but not any of the code to actually perform said tasks? I am still really struggling to understand the paradigm that Airflow is operating with.

Is there a way to process SignalR data using Spark Streaming?

I have a data source provided to me using SignalR, which I’ve never used.
I can’t find any documentation on how to ingest it with Spark Streaming, is there a defined process for that?
If not, are there intermediate steps I should take first? Process the data myself from a signal-r client, and create a Kafka producer that’s then reading from that via Structured Streaming?
Alternatively, could try to use airflow but also not sure about that since it’s streaming data.

Differences between matillion and apache airflow

I want to use an ETL service, but i am stuck between Apache Airflow and Matillion.
Are they the same?
What are the main differences?
Airflow's primary use-case is orchestration / scheduling, not ETL. You can perform ETL tasks inside Airflow DAGs, but unless you're planning on implementing Airflow using a containerized / K8 architecture, you'll quickly see performance bottlenecks and even hung / stuck processes. There are ways to mitigate this, certainly, but it's not the primary use case.
Matillion's primary use-case is ETL (really ELT), so it's not going to suffer the same performance issues, or require a complex infrastructure to achieve that performance. It also provides a GUI based code-optional interface, so that you don't have to be a Python expert to achieve results quickly.
I actually view Airflow and Matillion as complimentary (potentially). If you have inter-application dependencies, for example, you can orchestrate Matillion workflow with Airflow, or another third-party scheduler, and gain the benefits of both.
I've never used Matillion. So I can't answer with respect to any specific use case you have.
But with the quick analysis on Matillion I can very well tell that Matillion and Airflow aren't the same at all.
Matillion is a Extract/Transform/Load tool. You can compare it with tools like AWS Glue / Apache NiFi / DMExpress.
Airflow is an orchestration tool. You can compare it with tools like oozie.
More importantly Matillion doesn't come free of cost.

Most efficient way to transfer images continuously across the network

I have an architecture in which there are separate services which run independently. My one of the service is continuously fetching frames from camera and sending them to another service which performs some processing on the frame (like face detection, face recognition etc) and sends back the results. Services can be run on different machines.
Please suggest any good library or something which is fast at transferring frames between services. I have already a few options in my mind like Kafka and ZeroMq but I am also confused between them, which to choose.
Any good pipeline design is also welcome. Thanks
After experimenting with the Kafka pipeline in a very similar scenario my suggestion is that Kafka is not a suitable choice for it. Since the frequency and size of messages are the concern here while dealing with the live image streams.
The choices that can be tested are:
Zero MQ
Rabbit MQ
Zero MQ will highly likely perform well in the given scenario.

Service Bus (Maybe?) for web farm

Part of this question is I'm not even sure what exactly I'll need to ask, so I'll start with the situation and work out from there.
One project I'm working on involves use of COMET via the aspComet library. The use case of the program is somewhat of a collaborative slideshow. One person runs the bulk of it, with one or more participants able to perform certain actions. Low latency between when an action is performed on screen
Previously, it was just running on one server. Now, we're wanting to scale it out a bit, more for reliability than performance reasons. So, we have some boxes out in Rackspace's cloud and all that fun stuff.
I knew from the start of this that I was going to need to make some changes to the way the COMET stuff works since different people in the same "show" might be on different servers, and I have no way of knowing what "show" they belong to until after they have already arrived on the site.
I initially tackled this using the WCF Mesh provider, which wasn't well documented to start with, now I'm running into issues with dispatching messages to it sometimes get lost, or delayed (I'm not 100% sure what is going on there), but it screws up the long poll for COMET and breaks things in rather strange ways (Clicking a button may trigger an event, or it'll hang for 10 seconds {long poll duration} and not actually do anything).
More research leads me to believe one of the .Net service bus providers may do what I need. However, I can't find examples that would cover what I need:
No single point of failure (outside of a database)
No hardcoding of peers.
Near-realtime (no polling, event based would be best)
My ideal solution would involve that when a server comes up, it lets other servers know of its existence (Even if it's just a row in a table somewhere), and they can start sending broadcast messages among each other, with each server being both a publisher and subscriber. This is what I somewhat had in the WCF Mesh provider, but I'm not overly confident in that code.
Can anyone point me in the right direction with this? Even proper terms to look for in the docs for service bus providers would be good at this point. Or are service buses not what I want? At this point I would settle for setting up a Jabber server on each web server and use that, if it could fit within my constraints.
I can't speak a ton to NServiceBus, but I expect the answers will be similar.
Single point of failure: MSMQ can use multicasting, which means each endpoint will broadcast it's existence and no DB table is needed. RabbitMQ uses this Exchange-to-Queue binding process which means as long as the Rabbit instance or cluster is up then messages still exist. RabbitMQ can be clustered, MSMQ cannot be. *Note: You might have issues with multicasting with Rackspace, no idea how they work. If so, you'll have to fall back on the runtime services for MSMQ (not RabbitMQ), that would create a single point of failure because everyone has a single point to coordinate control messages through.
Hard coding of peers: discussed above a bit; MSMQ's multicast handles it. Rabbit it can also be done, just bind queues to an exchange you want to listen to. MassTransit takes care of this for you.
Near-realtime: These both use messaging which is near real time. There's no polling in your message consumer code.
I think a service bus seems like a reasonable solution for what you're trying. Some more details would likely be needed, but the general messaging approach is correct. There are other more light weight messaging libraries if you decided you just want something on top of RabbitMQ and configure Rabbit to handle most of the stuff.
To get started with MassTransit, we have documentation up: http://readthedocs.org/projects/masstransit/ and mailing list http://groups.google.com/group/masstransit-discuss. Join the mailing list if you have future questions and someone will try and help you out.

Resources