Difference between Apache Storm and Flink - bigdata

I'm working with these two real time data stream framework processor. I've searched everywhere but I can't find big difference between these two framework. In particular I would like to know how they work based on size of data or topology etc.

The difference is mainly on the level of abstraction you have on processing streams of data.
Apache Storm is a bit more low level, dealing with the data sources (Spouts) and processors (Bolts) connected together to perform transformations and aggregations on individual messages in a reactive way.
There is a Trident API that abstracts a little from this low level message driven view, into more aggregated query like constructs, which makes things a bit easier to integrate together. (There is also an SQL-like interface for querying data streams, but it is still marked as experimental.)
From the documentation:
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(6);
Apache Flink has a more functional-like interface to process events. If you are used to the Java 8 style of stream processing (or to other functional-style languages like Scala or Kotlin), this will look very familiar. It also has a nice web based monitoring tool.
The nice thing about it is that it has built-in constructs for aggregating by time windows etc. (Which in Storm you can probably do too with Trident).
From the documentation:
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
#Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction<WordWithCount>() {
#Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
When I was evaluating the two, I went with Flink, simply because at that time it felt more well documented and I got started with it much more easily. Storm was slightly more obscure. There is a course on Udacity which helped me understand it much more, but in the end Flink still felt more fit for my needs.
You might also want to look at this answer here, albeit a bit old so both projects must have evolved since then.

Related

Can ZeroMQ provide grounds for a bidirectional non-blocking asynchronous transmission?

I have a system which consists of two applications. Currently, two applications communicate using multiple ZeroMQ PUB/SUB patterns generated for each specific type of transmission. Sockets are programmed in C.
For example, AppX uses a SUB formal-socket archetype for receiving an information struct from AppY and uses another PUB formal-socket archetype for transmitting raw bit blocks to AppY and same applies to AppY. It uses PUB/SUB patterns for transmission and reception.
To be clear AppX and AppY perform the following communications:
AppX -> AppY :- Raw bit blocks of 1 kbits (continous),- integer command (not continuous, depends on user)
AppY -> AppX :Information struct of 10kbits (continuous)
The design target:
a) My goal is to use only one socket at each side for bidirectional communication in nonblocking mode.
b) I want two applications to process queued received packets without an excess delay.
c) I don't want AppX to crash after a crashed AppY.
Q1: Would it be possible with ZeroMQ?
Q2: Can I use ROUTER/DEALER or any other pattern for this job?
I have read the guide but I could not figure out some aspects.
Actually I'm not well experienced with ZeroMQ. I would be pleased to hear about additional tips on this problem.
A1: Yes, this is possible in ZeroMQ or nanomsg sort of tools
Both the ZeroMQ and it's younger sister nanomsg share the vision of Scaleable ( which you did not emphasise yet )Formal ( hard-wired formal behaviour )Communication ( yes, it's about this )
Pattern ( that are wisely carved and ready to re-use and combine as needed )
This said, if you prefer to have just one socket-pattern on each "side", then you have to choose such a Formal Pattern, that would leave you all the freedom from any hard-wired behaviour, so as to meet your goal.
So, a) "...only one" is doable -- by a solo of zmq.PAIR (which some parts of documentation flag as a still an experimental device) or NN.BUS or a pair of PUSH/PULL if you step back from allowing just a single one ( which in fact does eliminate all the cool powers of the sharing of the zmq.Context() instantiated IO-thread(s) for re-using the low-level IO-engine. If you spend a few minutes with examples referred to below, you will soon realise, that the very opposite policy is quite common and beneficial to the design targets, when one uses more, even many, patterns in a system architecture.
The a) "...non-blocking" is doable, by stating proper directives zmq.NOBLOCK for respective .send() / .recv() functions and by using fast, non-blocking .poll() loops in your application design architecture.
On b) "...without ... delay" is related to the very noted remark on application design architecture, as you may loose this just by relying on a poor selection and/or not possible tuning of the event-handler's internal timings and latency penalties. If you shape your design carefully, you might remain in a full control of the delay/latency your system will experience and not bacoming a victim of any framework's black-box event-loop, where you can nothing but wait for it's surprises on heavy system or traffic loads.
On c) "... X crash after a Y crashed" is doable on { ZeroMQ | nanomsg }-grounds, by a carefull combination of non-blocking mode of all functions + by your design beeing able to handle exceptions in the situations it does not receive any POS_ACK from the intended { local | remote }-functionality. In this very respect, it is fair to state, that some of the Formal Communication Patters do not have this very flexibility, due to some sort of a mandatory internal behaviour, that is "hard wired" internally, so a due care is to be taken in selecting a proper FCP-archetype for each such still scaleable but fault-resilient role.
Q2: No.
The best next step:
You might feel interested in other ZeroMQ posts here and also do not miss the link to the book, referred there >>>
Q1: yes
Q2: no, ZMQ_DEALER should be used by both AppX and AppY.
See http://zguide.zeromq.org/c:asyncsrv. Notice ZMQ_ROUTER in this example just aim to distribute request from multi-client to different thread where ZMQ_DEALER do real work.

Concurrent Read Access to Thread Object that Emulates Map

I am experiencing (very) slow page load times that increase proportionately to the number of active users on the system. I have a hunch that this is related to a custom defined thread object:
define stageStoreCache => thread {
parent map
public oncreate() => ..oncreate()
}
This stageStoreCache object simply mimics the behavior of a map whose data available across the entire instance.
Many threads are reading it and very few threads are writing to it. Is this a poorly conceived solution to having a large map of data available across the instance? It's a fairly large map of maps that when exported to map->asstring can exceed 5MB. The objective is to prevent translating data stored as JSON in the database to Lasso types on the fly.
It seems that the large size of the stageStoreCache is not what causes problems. It seems to really be the number of concurrent users on the system.
Thanks for any insight you can offer.
You said that this holds a map of maps and is rather large. If those sub-maps are large, it is possible that the way you are accessing the data is causing the issue. Here's what I mean, if you are doing something like this:
// Potential problem as it copies the sub-map each time
stageStoreCache->find('sub-map')->find('data')
stageStoreCache->find('sub-map')->find('other')
The problem comes in that each time stageStoreCache->find('sub-map') is called it actually has to copy all the map data it finds for "sub-map" out of the thread object and into the thread requesting that data. If those sub-maps are large, this takes time. A better approach would be to do this once and stash it in a local variable:
// Better Approach
local(cache) = stageStoreCache->find('sub-map')
#cache->find('data')
#cache->find('other')
This at least only has to copy the "sub-map" over once. Another approach that might be better (only testing could tell) would be to refactor your code so that each call to stageStoreCache drills down to the data you actually want, and have just that small amount of data copied over.
// Might even be better as it just copies the values you want
stageStoreCache->drill('sub-map', 'data')
stageStoreCache->drill('sub-map', 'other')
Ultimately, I would love for Lasso to improve thread objects so that they never blocked for reads. (I had thought this had been submitted as a feature request, but I'm not finding it on Rhinotrac.) Until that happens, if none of my suggestions help then you may need to investigate using something else to cache this data in such as memcached.
Testing is the only way to tell for sure. But I would go a long way to avoid having a thread object that contains some 5 MB of data.
Take this snippet from the Lasso guide into consideration:
"all parameter values given to a thread object method are copied, as well as any return value of a thread object method"
http://www.lassoguide.com/language/threading.html
Meaning that one of the key features that makes Lasso 9 so fast, the extensive use of reference data, is lost.
Each time you have a call for stageStoreCache all the data it contains will first be copied into the thread that asks for it. That is an awful lot of copying.
I have found that having settings and site wide data contained in smallest possible chunks is convenient and fast. And also, to only actually set it up when it is called for. Unlike the old approach that had a config file that was included on every call, setting up a bunch of variables where the majority maybe never got used on that particular call. Here's a Ke trick that I'm using instead. Consider this:
define mysetting1 => var(__mysetting1) || $__mysetting1 := 'Setting 1 value'
define mysetting2 => var(__mysetting2) || $__mysetting2 := 'Setting 2 value'
define mysetting3 => var(__mysetting3) || $__mysetting3 := 'Setting 3 value'
Have this is a file that is read at startup, either in a LassoApp that's initiated or a file in the startup folder.
These settings can then be called like this:
code blabla
mysetting2
more code blabla
mysetting1
mysetting2
With the beauty that, in this case, there is no wasted processing to initiate mysetting3, since it's not called for. And that mysetting2 is called for several times but is still only initiated once.
This technique can be used for simple things like the above, but also to initiate complex types or methods. Like session management, calling post or get params etc.

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

Occasionally Connected Application with Flex Air

It is the first time I work with this kind of application. Very little experience I have. And so do the stakeholders. They want something like an (flex AIR)application that is able:
save data locally if the application is close
synchronize data with the server side if needed
Now, it is time for me to "do something" to this requirement before it get fixed. I have so many questions, but here are some less silly one:
The requirement is about "Rarely" connected, not "Occasionally" connected, right?
If I can't change the requirement, what should I do, "hibernate" the AIR application like Windows OS or save only the data to local DB? Their possibility ?
Please give me some advice/recommendation.
Thanks,
P/S: Internally, we did discuss about ADEP Data Services features. And I have a sample from Adobe: http://help.adobe.com/en_US/enterpriseplatform/10.0/AEPDeveloperGuide/WS562be5d616a63050-3e6e4f7d131900899a6-8000.html ==> I don't think I have fully understood it :)
You could indeed save data locally in SQLITE on application close and next time the application is launched you can persist changes to server and retrieve updates if needed.
Depending on complexity and data volume of your application, you could:
[free] implement the synchronization logic by yourself using only remote
objects and polling
[free] implement messaging on BlazeDS (data push) so the server can push
updates in real time if needed
[expensive] use the all in one Adobe synchronization solution (LiveCycle)
It really depends on what kind of application and data we are talking about.
Cheers.
If possible you can use sharedObject...
public var _rememberSo:SharedObject;
_rememberSo = SharedObject.getLocal("loginData");
if (_rememberSo.data.username != undefined && _rememberSo.data.password != undefined)
{
txtUserName.text = _rememberSo.data.username ;
txtPassword.text = _rememberSo.data.password ;
chkSavePassword.selected = true ;
btnLogin.setFocus();
}
else
{
txtUserName.text="";
txtPassword.text="";
txtUserName.setFocus();
}
I have used for locally save the password.. you can use it as per your data...
It may not be the best choice but we did choose ADEP stuff:
Data model driven:
http://www.adobe.com/devnet/livecycle/articles/lcdses2_mdd_quickstart.html
--> which help to generate code for both server side DAO and SQLite DAO.
Adobe synchronization solution : here we create our own lastModified property to compare offline data and server data
If anyone follow our path in the future, pls note 3 things:
it is commercial, no opensource
the documentation is good but not STRAIGHTFORWARD
the samples are good but not STRAIGHTFORWARD
I had to spend 2 days to understand things. And we are now confidence with a fully

Qt clearing an SQL query

What is the difference between
void QSqlQuery::clear ()
and
void QSqlQuery::finish ()
Based on the documentation, I don't see what the diff is. What is the difference? I'd like to know specifically when to use one over the other.
EDIT - Some more elaboration and info from documentation.
clear()
-Clears the result set and releases any resources held by the query.
Sounds like finish() does the same...
-Sets the query state to inactive.
Finish does the same.
finish()
-Instruct the database driver that no more data will be fetched from this query until it is re-executed.
What does this mean specifically? What is the consequence of this?
-It may be helpful in order to free resources such as locks or cursors if you intend to re-use the query at a later time.
Doesn't clear do the same? Doesn't clear release locks, cursors, etc?
-Sets the query to inactive.
clear does the same I believe.
-Bound values retain their values.
What is the point of this?
Qt comes with source code, you can see what's the difference by simply looking into the qsqlquery.cpp file
So according to the source code:
clear - clears and resets the QSqlQuery object;
finish - resets the result member of the current query into inactive state;
hope this helps, regards
The language used to describe these functions is similar so it can definitely be a little confusing and I hope this explanation helps. Here's how I interpret and use these methods.
void QSqlQuery::finish ()
I think of this as a way of saying I'm done with the query I just requested (eg no more reading/iterating) but I still plan on using that QSqlQuery object to do more work. You're just releasing any memory/resources used to get the values from the previous query. This really only makes a big, noticeable difference when you're dealing with large datasets over and over again, but I view it as good practice to use none the less.
void QSqlQuery::clear ()
This is my way of saying that I'm done with the QSqlQuery object and want to guarantee that none of the resources/memory I was using gets left around while I'm disposing of the object. I rarely, if ever, use this as I've found that it's effectiveness can vary widely depending on the database you use and if you're using modern C++ features, it doesn't do a lot for you.
It's easier to understand the difference if you look at them as being written to solve a similar problem for two different time periods (eg old C code as opposed to modern C++).
They do very similar things but I'd recommend you just use finish().
For all, like me, that are wondering which method to invoke. I will share my research.
NOTE: I read the sources of the SQLite driver, so other databases drivers can be different.
finish() resets the statement; in SQLite context it calls sqlite3_reset;
clear() resets the whole QSqlQuery object; it clears bound values, prepared statement, lastError(), lastQuery() ..., sets the default options for all object's parameters; in SQLite context I think that sqlite3_finalize is also called;
So I should visualize it like that finish < clear. After finish() you could call exec() to reexecute the query, but after clear() you must prepere the query again and bind its values before you can successfully reexecute the query.

Resources