We want to schedule a workflow based on data availability but there is no particular frequency of data arrival. Also there could be multiple re-runs of data and hence multiple versions of the data for the day arriving at any time.
As I understand from the specification, currently it's mandatory to specify frequency parameter in coordinator.
However, we would like to trigger our workflow based on some event (data arrival or partition creation) only without depending on the frequency.
Seems this qualifies for Asynchronous Data Set. Does Oozie support Asynchronous Data Set yet?
The frequency parameter is mandatory, but you can specify an input event, something like this:
<datasets>
<dataset name="mydata" frequency="${coord:days(1)}" initial-instance="${initial_instance}" timezone="UTC">
<uri-template>${hcat_uri}/${hcatDatabase}/${hcatTable}/dt=${YEAR}${MONTH}${DAY}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="MYDATA_IN" dataset="mydata">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html#a6.1.4._Input_Events
So define a relatively low frequency and a meaningful unit and it will wait for the data availability. Probably it's make sense to specify a timeout for the coordinator, what's less than the frequency:
<timeout>[TIME_PERIOD]</timeout>
Or you can coordinate (start) your workflow directly (without coordinator) with e.g.: a cronjob, but that's not nice at all.
Related
I cannot find the answer from the document.
If I run query ingestion using
.set-or-append async
will the result be guaranteed?
Currently, we are running all those data grooming operations without the async keyword. Sometimes they jam our cluster, but we know they fail and can recover from a queue.
If we fire them async, then we don't have any control to know if they fail and recover.
What are the recommended ways to handle this?
For more context detail, we have to groom the data from 2 very large tables to the other 4 tables with aggregated data. UpdatePolicy is not really applied in this use case. We only need to run it once a week, since it's more like a weekly or monthly based aggregation. Unfortunately, when they run, we seem to easily get throttled.
Currently we are running all those data grooming operations without the async keyword. Sometimes they jam our cluster, but we know they fail and can recover from a queue.
This isn't accurate - these commands don't [currently] get queued. if a command fails, due to whatever reason (transient failure, throttling due to hitting ingestion capacity, etc.) - it fails, and it's the responsibility of the caller to retry it (if it makes sense to do so, based on the kind of the failure)
For more context detail, we have to groom the data from 2 very large tables to other 4 tables with aggregated data. UpdatePolicy is not really applied in this use case. We only need to run it once a week, since it's more like a weekly or monthly based aggregation. Unfortunately, when they run, we seem to easily get throttled.
The commands being frequently throttled is an indication that your cluster is frequently fully utilizing its ingestion capacity. as ingestion capacity scales linearly with the amount of cores in the cluster, you could choose to scale out/up your cluster, to get additional ingestion capacity (you could also do that on a schedule, temporarily, only when your additional ingestion load is expected to run).
note that you can (and probably should) monitor the state/status of commands you run using the async keyword - that could be done using the .show operations command (doc)
Say I want to start friendship between A and B.
Say I want to end friendship between A and B.
Those are two tasks I want to send to a queue having multiple consumers (workers).
I want to guarantee processing order so, how to avoid the second task to be performed before the first?
My solution: make tasks sticky (tasks about A are always sent to the same consumer).
Implementation: use RabbitMQ's exchanges and map tasks to the available consumers.
How do I map A to its consumer? I'm thinking about nginx's ip_hash. I think I need something similar.
I don't know if it is relevant but A and B are uuid.v4() UUIDs.
Can you point me out to the algorithm I need to accomplish mapping, please?
Well, there are two options:
make one exchange / queue for all events and guarantee that they're gonna be inserted in proper order. Create one worker for them. This costs more on inserting data (and doesn't give you option of scalability).
prepare your app for such situation, e.g. when you get message destroyFriendship and friendship does not exist - save message to db containing future friendship ending. Then you can have multiple workers making and destroying friendship and do not have to care about proper order. Simply do your job, make friends and if there's row in db about ending of friendship - destroy it (or simply do not create). Of course you need to check timestamp of creation/destroying time and check if destroying time was after creation time!
Of course you can count somehow hash of A/B, but it would be IMO more costfull then preparing app. Scalling app using excahnges/queues is not really good - you're going to create more and more queues and it's going to end up in too many queues/exchanges in rabbitmq.
If you have to use solution you specified - you can for example count crc32 from A and B, and using it's value calcalate to which queue task should be send. But having multiple consumers might result wrong here - what if one of consumers is blocked somehow and other receive message with destroying friendship? Using this solution I'd say that it's dangerous to have more than 1 worker per group of A/B.
Section speaks about a merge operation in FRP streams processing (Sodium library is used). Book shows a below diagram of streams combination, and says that when event enters FRP logic through a stream, it causes a cascade of state changes that happen in a transactional context, so all changes are atomic.
Streams of events - sDeselect, sSelect (see 2 events: "+" and "-") are originating from UI controls, since they happen within the same FRP transaction their carried events are considered simultaneous. Then book says
The merge implementation has to store the events in temporary storage
until a time when it knows it won’t receive any more input. Then it
outputs an event: if it received more than one, it uses the supplied
function to combine them; otherwise, it outputs the one event it
received.
Question: When it is a time when "no more input will come"? How merge function knows this moment? Is it simply the time it gets a value from the 2nd incoming stream on a given diagram or i'm missing smth? Can you illustrate it with a better streams example?
The way Sodium does this is to assign rank numbers to the structure of the directed graph of FRP logic held in memory, in such a way that if B depends on A, then B's rank will be higher than A's. (Cycles are broken in the graph traversal that assigns these ranks.) These numbers are then used as the priorities in a priority queue with low rank values processed first.
During event processing, when the priority queue contains nothing lower than the rank of the merge, then it is known that there can be no more input data for the merge, and it will output a value.
I am thinking about modelling a material flow network. There are processes which operate at a certain speed, buffers which can overflow or underflow and connections between these.
I don't see any problems modelling this in a classic Discrete Event Simulation (DES) fashion using a global event queue. I tried modelling the system without a queue but failed in early stages. Still I do not understand the underlying reason why a queue is needed, at least not for events which originate "inside" the network.
The idea of a queue-less DES is to treat the whole network as a function which takes a stream of events from the outside world and returns a stream of state changes. Every node in the network should only be affected by nodes which are directly connected to it. I have set some hopes on Haskell's arrows and Functional Reactive Programming (FRP) in general, but I am still learning.
An event queue looks too "global" to me. If my network falls apart into two subnets with no connections between them and I only ask questions about the state changes of one subnet, the other subnet should not do any computations at all. I could use two event queues in that case. However, as soon as I connect the two subnets I would have to put all events into a single queue. I don't like the idea, that I need to know the topology of the network in order to set up my queue(s).
So
is anybody aware of DES algorithms which do not need a global queue?
is there a reason why this is difficult or even impossible?
is FRP useful in the context of DES?
To answer the first point, no I'm not aware of any discrete-event simulation (DES) algorithms that do not need a global event queue. It is possible to have a hierarchy of event queues, in which each event queue is represented in its parent event queue as an event (corresponding to the time of its next event). If a new event is added to an event queue such that it becomes the queue's next event, then the event queue needs to be rescheduled in its parent to preserve the order of event execution. However, you will ultimately still boil down to a single, global event queue that is the parent of all of the others in hierarchy, and which dispatches each event.
Alternatively, you could dispense with DES and perform something more akin to a programmable logic controller (PLC) which reevaluates the state of the entire network every small increment of time. However, typically, that would be a lot slower (it may not even run as fast as real-time), because most of the time it would have nothing to do. If you pick too big a time increment, the simulation may lose accuracy.
The simplest answer to the second point is that, ultimately, to the best of my knowledge, it is impossible to do without a global event queue. Each simulation event needs to execute at the correct time, and - since time cannot run backwards - the order in which events are dispatched matters. The current simulation time is defined by the time that the current event executes. If you have separate event queues, you also have separate clocks, which would make things very confusing, to say the least.
In your case, if your subnetworks are completely independent, you could simulate each subnetwork individually. However, if the state of one subnetwork affects the state of the total network, and the state of the total network affects the state of each subnetwork, then - since an event is influenced by the events that preceded it, can only influence the events that follow, but cannot influence what preceded it - you have to simulate the whole network with a global event queue.
If it's any consolation, a true DES simulation does not perform any processing in between events (other that determining what the next event is), so there should be no wasted processing in one subnetwork if all the action is taking place in another.
Finally, functional reactive programming (FRP) is absolutely useful in the context of a DES. Indeed, I now write of lot of my DES simulations in Scala using this approach.
I hope this helps!
UPDATE: Since writing the above, I've used Sodium (an excellent FRP library, which was referenced by the OP in the comments below), and can add some further explanation: Sodium provides a means for subscribing to events, and for performing actions when those events occur. However, here I'm using the term event in a general sense, such as a button being clicked by a user in a GUI, or a network package arriving, etc. In other words, the events are not necessarily simulation events.
You can still use Sodium—or any other FRP library—as part of a simulation, to subscribe to simulation events and perform actions when they occur; however, these tools typically have no built-in support for simulation, and so you must incorporate a simulation engine as the source of simulation events, in the same way that a GUI is incorporated as the source of user interaction events. It is within this engine that the global event queue must reside.
Incidentally, if you are trying to perform parallel or distributed simulation model execution, things get considerably more complicated. You have multiple event queues in these situations, but they must be synchronized (giving the appearance of a single queue). The two basic approaches are conservative synchronization and optimistic synchronization.
We have a BizTalk application where the order of messages being inputted is very important and has to be kept, meaning they have to be outputted in the same order. Normally ordered delivery would do the trick here.
However I read that ordered delivery is only guaranteed when you connect a receive location directly to a send port. The moment you use orchestrations the order delivery isn't guaranteed anymore. Is there a way to work around or fix this? Because this kind of ruins our whole application and we've been working on this for months.
I read a work around from Microsoft where they use an extra field which has a counter and where they use an end orchestration which checks the counters. But this is way too much work for us to do now. So this work around is a no go. Plus not all messages are translated which creates holes in our flow and not all messages are coming from the same source either which makes this work around useless anyway.
Any other ideas?
Check out this page.
It explains that if you have an orchestration that follows the singleton pattern to ensure only one instance of the orchestration exists, and you make sure you set the orchestration's receive port to ordered delivery, than you should get a valid end-to-end ordered delivery scenario
To provide end-to-end ordered delivery the following conditions must be met:
Messages must be received with an adapter that preserves the order of the messages when submitting them to BizTalk Server. In BizTalk Server 2006, examples of such adapters are MSMQ, MQSeries, and MSMQT. In addition, HTTP or SOAP adapters can be used to submit messages in order, but in that case the HTTP or SOAP client needs to enforce the order by submitting messages one at a time.
You must subscribe to these messages with a send port that has the Ordered Delivery option set to True.
If an orchestration is used to process the messages, only a single instance of the orchestration should be used, the orchestration should be configured to use a sequential convoy, and the Ordered Delivery property of the orchestration's receive port should be set to True.
Resequencing strategies for ordered delivery in BizTalk:
I recently responded to a LinkedIn user's question regarding ordered delivery options in BizTalk.
I thought it would be useful for people to understand some of the strategies for re-sequencing messages using BizTalk.
Often as an BizTalk Developer, you are required to integrate to line-of-business systems which are unchangeable. This can be for one or more of many different reasons. As an example, the cost of changing a system can be too high or the vendor license states that support may be withdrawn if the system is changed.
This would not normally represent a problem where the vendor has provided a thoughtfully designed API as a point-of-integration endpoint. However, as many Integration Developers quickly learn, this is very rarely the case.
What do I mean by a thoughtfully designed API? Well, aside from all the SODA principals (service composition, fault contracts etc.), the most important feature of an API is to support the consumption of data which arrives in the wrong order.
This is a fairly simple thing to do. For example, if you are a vendor and you provide a HTTP operation as your integration point then one of the fields you could expose on your operation is a time-stamp or, even better, a sequence number. This means that if a call is made with an out-of-date payload, the relevant compensating mechanism can kick-in - which can be as simple as discarding the data.
This article discusses what to do when the vendor has not built this feature into an API, and as an integrator you therefore are forced to implement end-to-end ordered delivery as part of your integration solution.
As stated in my response to the user's post on LinkedIn (see link above), in BizTalk ordered delivery in any but the simplest of scenarios is complicated at best and at worst can represent a huge cost in increased complexity, both in terms of development and support. The basic reason is that BizTalk is designed to be massively concurrent to enable high throughput, and there is a direct and unavoidable conflict between concurrency and ordering. Shoe-horning E2E ordered delivery into a BizTalk solution relies on artefacts such as singleton orchestrations which introduce complexity and increase both failure rate and cost-per-failure numbers.
A far better solution is to maintain concurrent processing to as near as possible to the line-of-business system endpoints, and then implement what is called a re-sequencer wrapper around each of the endpoints which require data to be delivered in the correct order.
How to implement such a wrapper in BizTalk depends on some factors, which are outlined in the following table:
|Sequencing |Messages|Database |Wrapper |
|field |are |integration?|strategy |
| |deltas? | | |
|--------------|--------|------------|----------------------------------|
|n of a total m| N | Y |Stored procedure |
|n of a total m| N | N |Singleton orchestration |
|n of a total m| Y | Y |Batched singleton orchestration |
|n of a total m| Y | N |Batched singleton orchestration |
|Timestamp | N | Y |Stored procedure |
|Timestamp | N | N |Singleton orchestration |
|Timestamp | Y | Y |Buffer table with staggered reader|
|Timestamp | Y | N |Buffer table with staggered reader|
The first factor Sequencing field relates to the idea that in order to implement any kind of re-sequencer wrapper, as a minimum you will require that your message data contains some sequencing information. This can take the form of a source time-stamp; however a better, though rarer, kind of sequencing information consists of a sequence number combined with the total number of messages, for example, message 1 of 10, 2 of 10, etc..
The second factor Messages are deltas? relates to whether or not the payload of your message contains a single state change to the data or the sum of all past changes to the data. Put another way, is it possible to reconstruct the full current state of the data from this message? If the message payload contains just a single change then it may not be possible to reconstruct the state of the data from the single message, and in this instance your message is a delta.
The third factor Database integration? relates to whether or not the integration-entry-point to a system is a database. The reason this matters is that integrating at the database layer is a fairly common integration scenario, and if available can greatly simplify handling re-sequencing.
The strategies from the above table are described in detail below:
Stored procedure wrapper
This is the simplest of the resequencing strategies. A new stored procedure is created which queries the target data before making a decision about whether to update the target data. The decision can be as simple as Is the data I have newer than the data in the target system?
Of course, in order to implement this strategy, the target data also has to include the sequencing field of the source data, although an approximation can be made if necessary by relying on existing time-stamps which may already exist in the target data. The stored procedure wrapper can be contained either in the target database or ideally in a separate database.
Singleton orchestration wrapper
The idea behind this strategy is the singleton orchestration. This is a pattern you can implement to ensure that only a single instance of the orchestration will exist at any one time. There are many articles on the web demonstrating how to implement this pattern in BizTalk.
The core of the idea is that the singleton simply keeps a track of the most recent successfully processed message sequence (or time-stamp). If the singleton receives a message which is older than the most recent sequence it is simply discarded. This works because the messages are non-deltas, so the target system can commit only the most recent of a number of messages and the data will be in the most recent state. Only when data is committed successfully is the most recent sequence held by the singleton updated.
Batched singleton orchestration wrapper
This strategy is based on the Singleton orchestration wrapper above, except it is more complex. Rather than only keep the most recent sequence information in memory the singleton is required to create and hold a working set of messages in memory which it will re-order and then process once all expected messages from the batch have arrived. This is because the messages are deltas so the target system MUST receive each message in the order they were intended. Once the batch has been sent successfully the singleton can terminate.
To do this it is a requisite of the source data that it contain a correlation identifier of some description which allows the batch of messages to be defined. For example, processing a defined set of orders from a customer, the inbound messages must contain an identifier for the customer. This can then be used to route the messages to the singleton orchestration instance correlated with this customer. Furthermore the message sequence field available must be of the n of a total m form.
Once the singleton is initialised it assembles a working set of messages in memory and proceeds to populate it as new messages arrive. One way I have seen this done is using a System.Collections.Generic.List as the container for the working set. Once the list has been fully populated (list length = m) then it is assumed all messages in the batch have been received and the orchestration then loops over the working set in sequence and processes the messages into the target system.
One of the benefits of the batched singleton orchestration wrapper is it allows concurrent processing by correlation identifier. In the example above this means that messages from two customers would be processed concurrently.
Buffer table with staggered reader wrapper
Arguably the most complex of the strategies presented, this solution is to be used when you have delta messaging with a time-stamp-based sequencing field. It can be implemented with a database of some description which acts as a re-sequencing buffer.
It is worth noting here that this re-sequencing wrapper does not guarantee ordered delivery, but used well it makes ordered delivery highly likely.
As messages arrive, they are written into the buffer and in the same operation the buffer is reordered, so that the order of messages held in the buffer are always correct.
To create the buffer reader, have a receive location which reads the messages in the buffer before passing the messages to a send port with ordered delivery enabled, which then will process the messages into the target system. You can also use a singleton orchestration as an intermediary if your target system's API semantics are too complex for a send port.
However, using this wrapper as I have described it above will not enable ordered delivery, as the messages will almost certainly be committed to the buffer in the wrong order, which will result in the messages being processed into the target system in the same (wrong) order. This is where the staggered query comes in. This is a fancy way of saying your buffer query needs to only select data at intervals of time T, AND only select those rows where the row-number is lower than buffer total row count minus C.
This has the effect of allowing sequencing to occur over an appropriate timespan. T will be familiar to most BizTalk developers as the polling interval of some adapters (such as the WCF-SQL adapter). C is slightly more difficult to set, but by increasing this number you are reducing the chances that when you poll, you will miss a message older than the most recent one in your retrieved data set.
What T and C are depends on many things, although these values should be based on your latency SLA and your message volume (or throughput). As a guideline, if you have a SLA to deliver data into your target system within 30 seconds and you process 10 messages per second then T should be around 10 seconds and C should be around 100 rows.
Of course this only works if your messages for a given correlation id are sent by the source system during a short space of time (ideally back-to-back). The longer the interval between sends, the greater the required value of C, and the less effective the wrapper becomes.
One of the benefits of this strategy is you can also perform de-duplication of messages in the buffer if your data source is prone to sending duplicate messages and your target system endpoint is not idempotent. You can also use the buffer to implement FILO and other non-standard queueing semantics.
Conclusions
The strategies I have discussed here are ways of bending BizTalk to a task which is wasn't designed to do. As a result each has caveats around cost and complexity to support, and also may not work in certain scenarios. I would like to hear from anyone who has implemented other patterns for ordered delivery in BizTalk.