Apache Nifi vs Gobblin - bigdata

I am assessing a big-data project, we would need to pull lots of big data sets from various internet sources (ftp, api, etc), do light transformations and light data quality / sanity checking (eg: row and columnar inspections), and push it downstream. Immediate focus is batchy, but anticipate supporting streaming down the line. Ease of support at scale is an important requirement.
We are looking at Apache Nifi and Gobblin, which seem to overlap in intention. What sort of use cases fit which platform best? How would they conform to the use case above?
Thanks!

My experience is with NiFi, and I've just had a look at Gobblin, but mainly, NiFi is an application in itself, where Gobblin is a framework.
In NiFi, you'll have a GUI, with very granular authorizations, that allow, several users to intervene on different part of the flow, monitor it, etc ...
One other thing is that NiFi is 'always on' and 'always in production' you are potentially able to make your modifications directly on the target, and as such, there are a few safeguards in order to avoid losing data (by mistake, I mean).
So, where I think both solutions can do more or less the same thing, if you have a workflow where you want to deploy once from time to time, Gobblin might be a better fit, but if you want something where you give some users permissions to intervene on parts of the flow directly in production, NiFi will be the best.
In the end, to keep the question oriented on programming:
NiFi allows to you program graphically, to give very granular permissions to your 'developers', and well as to update the 'program' (the NiFi flow) while it is running
Gobblin seems (from what little I've looked up) to work by defining jobs with text files, which seems to be more of a 'classical' development workflow, but that may fit better for your usage.

Related

How to avoid the forge model derivative queue

I want to use the forge viewer as a preview tool in my web app for generated data.
The problem I have is that the model derivative API is sometimes slow sometimes fast.
I read that this happens because the files are placed in a queue and being processed subsequentially.
In my opinion, this can be solved by:
Having the extraction.update webhook also tell me where I am in the queue. So I can inform my users with better progress information. Or when the queue is too long I can not stop the process.
Being able to have a private queue. I have no problem paying more credits if necessary.
Being able to generate svf2 files on my own server.
But I don't know if any of these options are possible. Or if there is another workaround.
Yes, that could be useful. I logged that request in our system: DERI-7940
Might be considered later on, but no plans currently
I'm not aware of any plans for that
We're always working on making the translation service better, but unfortunately, I cannot tell when it will meet your requirements - including the implementation of the webhook feature you mentioned.
SVF2 is specifically for very large models - is that what you are working with? If not, then I'm quite certain that translating to SVF would be faster.

Best .Net Actor/Process framework for coordinating Local and Networked clusters

We have a process that involves loading a large block of data, applying some transformations to it, and then outputting what has changed. We currently run a web app where multiple instances of these large blocks of data are processed in the same CLR instance, and this leads to garbage collection thrashing and OOM errors.
We have proven that hosting some tracked state in a longer running process works perfectly to solve our main problem. The issue we now face is, as a stateful system, we need to host it and manage coordination with other parts of the system (also change tracking instances).
I'm evaluating Actors in Service Fabric and Akka at the moment, there are a number of other options, but before I proceed, I would like peoples thoughts on this approach with the following considerations:
We have a natural partition point in our system (Authority) which means we can divide our top level data set easily. Each partition will be represented by a top level instance that needs to organise a few sub-actors in its own local cluster, but we would expect a single host machine to be able to run multiple clusters.
Each Authority Cluster of actors would ideally be hosted together on a single machine to benefit from local communication and some use of shared local resources to get around limits on message size.
The actors themselves should be separate processes on the same box (Akka seems to run local Actors in the same CLR instance, which would crash everything on OOM - is this true?), this will enable me to spin up a process, run the transformation through it, emit the results and tear it down without impacting the other instances memory / GC. I appreciate hardware resource contention would still be a problem, but I expect this to be more memory than CPU intensive, so expect a RAM heavy box.
Because the data model is quite large, and the messages can contain either model fragments or changes to model fragments, it's difficult to work with immutability. We do not want to clone every message payload into internal state and apply it to the model, so ideally any actor solution used would enable us to work with the original message payload. This may cause problems with restoring an actor state as it wants to save and replay these on wakeup, but as we have state tracking internally, we can just store the resulting output of this on sleep.
We need a coordinator that can spin up instances of an Authority Cluster. There needs to be some elasticity in terms of the number of VM's/Machines and the number of Authority Clusters hosted on them, and something needs to handle creation and destruction of these.
We have a lot of .NET code, all our models, transformations and validation is defined in it, and will need to be heavily re-used. Whatever solution will need to support .Net
My questions then are:
While this feels like a good fit for Actors, I have reservations and wonder if there is something more appropriate? Everything I have tried has come back to a hosted processes of some kind.
If actors are the right way to go, which tech stack would put me closest to what I am trying to achieve with the above concerns taken into account?
IMO (coming at this from a JVM Akka perspective, thus why I changed the akka tag to akka.net; I don't have a great knowledge about the CLR side of things), there seems to be a mismatch between
We do not want to clone every message payload into internal state and apply it to the model, so ideally any actor solution used would enable us to work with the original message payload.
and
The actors themselves should be separate processes on the same box (Akka seems to run local Actors in the same CLR instance, which would crash everything on OOM - is this true?)
Assuming that you're talking about the same OS process, those are almost certainly mutually incompatible: exchanging messages strongly suggests serialization and is thus isomorphic to a copy operation. It's possible that something using shared memory between OS processes could work, but you may well have to make a choice about which is more important.
Likewise, the parent/child relationship in the "traditional" (Erlang/Akka) style actor model trivially gives you the local cluster of actors (which, since they're running in the same OS process allows the Akka optimization of not copying messages until you cross an OS process boundary), while "virtual actor" implementations as found in Service Fabric or Orleans (or, I'd argue Cloudstate or Lagom) basically assume distribution.
Semantically, the virtual actor models implicitly assume that actors are eternal (though their eternal essence may not always be incarnate). For your use-case, this doesn't necessarily seem to be the case.
I think a cluster of Akka.Net instances with sharded Authority actors spawning shorter-lived child actors best fits, assuming that you're getting OOM issues from trying to process multiple large blocks of data simultaneously. You would have to implement the instance scale-up/down logic yourself.
I have not worked with Akka.net so I can't speak to that at all, but I'd be happy to speak to what you're talking about in a Service Fabric context.
Service Fabric has no issue with the concept of running multiple clusters. In its terminology, the whole of your system would be called an Application and would have a version when deployed to the SF cluster. If you wanted to create multiple instances of it, all you'd need to do is select what you wanted to call the deployed app instance and it'll stand up provisioning for you.
SF has a notion of placement constraints, metric balancing and custom rules that you can utilize if you think you can better balance the various resources than its automatic balancing (or you need to for network DMZ purposes). While I've never personally grouped things down to a single machine, I frequently limit access of services to single VM scale sets (we host in Azure).
To the last point though, you'll still have message size limits, but you can also override them to some degree. In your project containing service interfaces, just set the following attribute above your namespace:
[assembly:FabricTransportRemotingSettings(MaxMessageSize=<(long)new size in bytes>)] and you're good to go.
Services can be configured to run using a Shared or Exclusive process model.
Regarding your state requirement, it's not necessarily clear to me what you're trying to do, but I think you're saying that that it's not critical that your actors store any state since they can work from some centrally-provided model.
You might look then at volatile state persistence then as it'll mean that state is saved for the actors in memory, but should you lose the replicas, nothing is written to disk so it's all lost. Or if you don't care and are ok just sending the model to the actors for any work, you can configure them to be stateless.
On the other hand, if you're still looking to retain state in the actors and simply are concerned about immutability, rest assured that actor state isn't immutable and can be updated trivially. There are simply order of operation concerns you need to keep in mind (e.g. if you retrieve the state, make a change, save it, 1) you must commit the transaction for it to take and 2) if you modify the state but don't save it, it'll obviously not persist - pull a fresh copy in a new transaction for any modifications). There's a whole pile of guidelines here.
Assuming your coordinator is intended to save some sort of state, might I recommend a singleton stateful service. Presumably it's not receiving an inordinate amount of use so a single instance is sufficient and it can easily save state (without the annoyance of identifying which state is on which partition). As for spinning up services, I covered this in the first bullet, but use the ApplicationManager on the built-in FabricClient to set up new applications and the ServiceManager to create instances of necessary services within each.
Service Fabric supports .NET Core 3.1 through .NET 5 as of the latest 8.0 release though note a minor serialization issues with an easy workaround with .NET 5.
If you have an Azure support subscription, I'd encourage you to write to the team under Development questions and share your concerns. Alternatively, on the third Thursday of each month at 10 AM PST, they also have a community call on Teams that you're welcome to join and you can find past calls here.
Again, I can't speak to whether this is a better fit than Akka.NET, but our stack is built atop Service Fabric. While it has some shortcomings (what framework doesn't?) it's an excellent platform for distributed software development.

How to build a predictive dialer?

I need to build a reliable predictive dialer based on Asterisk. Currently the system we use includes Wombat and Asterisk, and we do not find this solution usable as Wombat provides a poor API and it's impossible to use it without regular manual operations.
The system we want:
Can be used solely via API or direct database queries (adding lists to campaigns, updating lists, starting campaigns, stopping campaigns etc.) so that it can be completely integrated into an existing product
Is free, or paid for annually independent to the usage rate
Is considered stable
Should be able to handle tens of thousands of calls per day, if it matters
Use vicidial.org or hire freelancer to build new core with your needed api.
You can also check OSdial for this, it also developed using asterisk.
We have been working with a preview of the next version of Wombat, through the Early Access program, and Wombat has a complete configuration and reporting JSON API and you can deploy it "headless" in order to scale up to thousands of parallel lines. If you ask Loway they can likely get you access to the Early Access program.
BTW, Vicidial is great for agent-based outbound, but imposes quite a large penalty on the number of agents per server - you cannot reasonably use it to do telecasting at the scale we are looking for as it would require too many servers. Wombat is leaner and can drive over one thousands channel per server. YMMV.
This question would be better placed on a "hire-a-freelancer" site like oDesk ... if you need custom programing done, those are the sorts of places to go to get manpower.
Your specifications are well within what is possible with Asterisk. I'd strongly recommend looking at Vici Dial and OS Dial as others have suggested; out of the can, they are pretty good.
The hard part of any auto-dialer is not the dialer, oddly enough. It's the prediction algorithms, the answering machine detection algorithms and the agent UI. Those are what makes or breaks an auto-dialer application for a company.

Sharing Logic Between the Browser and the Server

I'm working on an app which will, like most apps, have a whole boat load of buisness logic, almost all of which will need to be executed both on the server and the Flash-based client… And I'm trying to figure out the best (read: least complex) way to implement the rules engine.
These are the parameters of the problem:
The rules engine must both run in a web browser (ie, in Flash Player) and on the server. Duplicating the logic (eg, by writing a "server" version and a "client" version) would be an unacceptable risk.
The input/output data is fairly complex, so serialization is a nontrivial problem. We are currently using AMF for all of our serialization needs, and using another protocol would add significant complexity… So it should probably be avoided.
It is infeasible to implement a "rules description language". Experimentation has shown that rules are sufficiently complex that any such language would need to be Turing complete… Which would also add a significant amount of complexity.
The rules engine will not need to make some, but not very many, service calls.
Currently, the best contenders are:
Writing the code in ActionScript, then running it on the server. In theory it's possible to start up an AVM instance, get it long-polling a gateway, then pass data back and forth that way… But that seems less than ideal. Is there a "good" way of doing this?
Writing the code in Haxe. I don't know anything about Haxe's AMF support, so that could be a deal-breaker.
Something involving Tamarin. Seems like a viable option, but I haven't done enough research to tell either way.
So, what do you think? Are any of these options clearly better than others? Is there something I haven't though of that's worth considering?
Finally, thanks for reading this wall of text :)
How much data are you talking about? You can use Air if you want to run it on the server and access a queue or something.

Determining Website Capacity

A client of mine has a website and they need to determine how 'scalable' the site currently is. What I mean by this is the number of users browsing around the site concurrently.
It's a custom e-commerce app in .net, not written by myself and the code is... well lets just say, a bit dubious.
A much bigger company is looking to buy them / throw funding their way but they need some form of metrics to show how much load it can take before it falls apart. This big company has the ability to 'turn on the taps' to a huge user base - and obviously doesn't want to do that if the site is going to fall over with a sneeze of traffic.
What is a good metric to provide here? And how can I obtain it?
Edit: Question revised
I always use Apache's "ab" tool: link text
Run it from a different machine, preferably a BSD or Linux machine with no firewall rules that will limit the performance of the tool. Because otherwise the result might not be as reliable. If you use a Windows machine, make sure you're using one that isn't limiting the number of active TCP connections.
When using "ab", the number you're looking for it "Requests per second". Experiment with the concurrency switch to see how many concurrent users you can handle before you're getting a lot of errors, or when the requests per seconds is dropping rapidly.
When you are noticing the webserver is having serious issues you should restart the webserver, and let it rest for a while before continuing the test.
You'd be better off with a hosted load test, as this might give you more insight on realworld scenario's (something like http://www.scl.com/software-quality/hosted-load-test, no experience with them though).
Furthermore: scalability is as far as I know, not how many concurrent users can be served, but the way how easy it is to serve more when the site grows bigger (by adding extra servers etc, how easy is it for the website to scale up, does the codebase allow to use unlimited number of servers, etc.)
Well, I suppose it'll depend on what the client cares about.
Do they care about how many users to can access the site at once? Report on that, but running simultaneous requests from another server until it dies, then get the number.
Do they care about something else?
For me, when someone says they want it to 'scale', it really means they have no idea what they want. So try and talk to them, and get specific details of what, exactly, they want to see 'scaling', and then, once you find the areas to analyse, you can do so trivially, and attempt to improve them.

Resources