Oozie coordinator - file event based trigger - multiple firing - oozie

Im trying to understand why an oozie 4.2 based coordinator job which should wait for a dataset fires multiple times. My coordinator job looks like this
<coordinator-app name="ConfirmDataMasterTrigger"
frequency="${frequencyMins}"
start="${startTime}"
end="${endTime}"
timezone="${timeZoneDef}"
xmlns="uri:oozie:coordinator:0.4"
xmlns:sla="uri:oozie:sla:0.2">
<controls>
<timeout>${TimeOutMins}</timeout>
<concurrency>${Concurrency}</concurrency>
<execution>${Execution}</execution>
</controls>
<datasets>
<dataset name="inputDS"
frequency="${coord:days(1)}"
initial-instance="${startTime}"
timezone="${timeZoneDef}">
<uri-template>${triggerFileDir}</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="ConfirmDataMasterTrigInput"
dataset="inputDS">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${workflowAppPath}</app-path>
<configuration>
<property>
<name>SaveDateString</name>
<value>${coord:formatTime(coord:actualTime(),"-yyyyMMdd-HHmmss")}</value>
</property>
<property>
<name>WaitForThisInputData</name>
<value>${coord:dataIn('ConfirmDataMasterTrigInput')}</value>
</property>
</configuration>
</workflow>
</action>
With a properties file that looks like this
nameNode=hdfs://hc1m1.nec.co.nz:8020
jobTracker=hc1r1m2.nec.co.nz:8050
hdfsUser=oozie
wfProject=ConfirmDataMaster
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
moveFile=ConfirmDataMaster_edit.csv
sourceDir=${nameNode}/mule/sheets/input/ConfirmDataMaster/
targetDir=/mule/sheets/store/
sourceFile=${sourceDir}${moveFile}
targetFile=${targetDir}${moveFile}
frequencyMins=10
startTime=2016-07-31T12:00Z
endTime=2099-01-01T12:00Z
timeZoneDef=GMT+12:00
TimeOutMins=10
Concurrency=1
Execution=FIFO
triggerDir=trigger/
triggerFileDir=${sourceDir}${triggerDir}
doneFlag=trigger.dat
workflowAppPath=${nameNode}/user/${hdfsUser}/wf/${wfProject}
oozie.coord.application.path=${nameNode}/user/${hdfsUser}/wf/${wfProject}
I am not having a problem in getting a work flow to to be triggered by a
coordinator given a data set based event. What I am seeing is that the under lying workflow is continuously triggered. Can anyone advise changes I should make or my error. Obviously my workflow cleans up and deletes the trigger path. Thanks in advance.

Ill answer my own question because Ive worked out the solution and its a bit obvious really. I was just a little confused. The Firing frequency is controlled by the coordinator and data set frequencies and also the trigger directory and file. If you dont want a trigger file then leave done-flag empty. If that is not added then a default flag file is _SUCCESS.
So if the trigger is available the workflow will fire at those frequencies specified. So I have changed my cord and data set frequencies to be 30 ( mins ). As a final task my workflow removes the trigger.

Related

How to design the Oozie coordinator on arrival of input multiple times in a day

I have a requirement to schedule my coordinator on arrival of input from other application. I may receive one or multiple times in a day. So, whenever I receive an input I need to trigger my treatment. Can anyone please help me how to do it in Oozie coordinator? I have tried with below code but it runs only once in a day, it is not working even if I receive input after first treatment.
<coordinator-app name="TEST_MASTER_C" frequency="${coord:days(1)}" start="${start_date}" end="${end_date}" timezone="Europe/Paris" xmlns="uri:oozie:coordinator:0.2" xmlns:sla="uri:oozie:sla:0.1">
<controls>
<timeout>1430</timeout>
<execution>FIFO</execution>
</controls>
<datasets>
<dataset name="inputFlag" frequency="${coord:days(1)}" initial-instance="${start_date}" timezone="Europe/Paris">
<uri-template>#nameNode#/testpath</uri-template>
<done-flag>${flag}</done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="coordInputFlag" dataset="inputFlag">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>

Application Insights - Custom Performance Counter

I can instantiate a new performance counter object, but how can i "track" it? there's "TrackTrace", "TrackException", and etc, but theres no "TrackPerformanceCounter"... Any work around that?
if you're using a real performance counter, then you'd configure your applicationinsights.config to collect that performance counter, and it just would happen normally. (https://learn.microsoft.com/en-us/azure/application-insights/app-insights-performance-counters)
<Add Type="Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.PerformanceCollectorModule, Microsoft.AI.PerfCounterCollector">
<Counters>
<Add PerformanceCounter="\Objects\Processes"/>
<Add PerformanceCounter="\Sales(photo)\# Items Sold" ReportAs="Photo sales"/>
</Counters>
</Add>
if you're not using real performance counters and simply want to track the value of a number, you'd use either TrackMetric(nameOfThing, valueOfThing) directly, or you can track the value of that metric at any time by passing it in the metrics param in any of the TrackEvent( nameOfEvent, properties, metrics) calls.
There's no workaround... there are some open issues on the GITHUB project to increase/configure the recurrence of data capture but nothing concrete as of sep/2017

Whats the relation between load and for?

I am trying tsung for the first time, however, I need some clarification.
I am using load tag as:
<load>
<arrivalphase phase="1" duration="1" unit="minute">
<users maxnumber="100000" interarrival="0.01" unit="second"/>
</arrivalphase>
</load>
But, how would the for loop below works ?:
<sessions>
<session name="root" probability="100" type="ts_http">
<for from="1" to="2" var="i">
<request>
<http url="/test/counter" method="POST" contents="bla=blu&name=glop">
</http>
</request>
</for>
</session>
What I thought is that the loop will count from 1 to 2, thus, sending only two requests, however, when I run the xml file, I got hundred of requests! Does this mean that each user in arrivalphase will send two requests as in the for loop above?
Can someone explain, what's the relation between the for tag and load tag as in the above example?
Your analysis is right , during the first 1 minute of the test , you created 100 users per second,each user will send two requests as in the for loop above。
The load define tsung generate rules of the user, the session define every user needs to perform logic.

Holding data processing for incomplete data sets with Mule and a collection-aggregator

I need to collect and process sets of files generated by another organization. For simplicity, say that the set consists of two files, a summary file and a detail file named like: SUM20150701.dat and DTL20150701.dat, which would constitute a set for date 20150701. The issue is, sets need to be processed in order, and the transmission of files from an outside organization can be error prone such that a file may be missing. If this occurs, this set of files should hold, as should any following sets that are found. As example, at the start of the mule process, the source folder may have in it: SUM20150701.dat, SUM20150703.dat, DTL20150703.dat. That is, the data set for 20150701 is incomplete while 20150703 is complete. I need to have both data sets hold until DTL20150701.dat arrives, then process them in order.
In this simplified form of my mule process a source folder is watched for files. When found, they are moved to an archive folder and passed to the collection-aggregator using the date as the sequence and correlation values. When a set is complete, it is moved to a destination folder. A lengthy timeout is used on the collector to make sure incomplete sets are not processed:
<file:connector name="File" autoDelete="false" streaming="false" validateConnections="true" doc:name="File">
<file:expression-filename-parser />
</file:connector>
<file:connector name="File1" autoDelete="false" outputAppend="true" streaming="false" validateConnections="true" doc:name="File" />
<vm:connector name="VM" validateConnections="true" doc:name="VM">
<receiver-threading-profile maxThreadsActive="1"></receiver-threading-profile>
</vm:connector>
<flow name="fileaggreFlow2" doc:name="fileaggreFlow2">
<file:inbound-endpoint path="G:\SourceDir" moveToDirectory="g:\SourceDir\Archive" connector-ref="File1" doc:name="get-working-files"
responseTimeout="10000" pollingFrequency="5000" fileAge="600000" >
<file:filename-regex-filter pattern="DTL(.*).dat|SUM(.*).dat" caseSensitive="false"/>
</file:inbound-endpoint>
<message-properties-transformer overwrite="true" doc:name="Message Properties">
<add-message-property key="MULE_CORRELATION_ID" value="#[message.inboundProperties.originalFilename.substring(5, message.inboundProperties.originalFilename.lastIndexOf('.'))]"/>
<add-message-property key="MULE_CORRELATION_GROUP_SIZE" value="2"/>
<add-message-property key="MULE_CORRELATION_SEQUENCE" value="#[message.inboundProperties.originalFilename.substring(5, message.inboundProperties.originalFilename.lastIndexOf('.'))]"/>
</message-properties-transformer>
<vm:outbound-endpoint exchange-pattern="one-way" path="Merge" doc:name="VM" connector-ref="VM"/>
</flow>
<flow name="fileaggreFlow1" doc:name="fileaggreFlow1" processingStrategy="synchronous">
<vm:inbound-endpoint exchange-pattern="one-way" path="Merge" doc:name="VM" connector-ref="VM"/>
<processor-chain doc:name="Processor Chain">
<collection-aggregator timeout="1000000" failOnTimeout="true" doc:name="Collection Aggregator"/>
<foreach doc:name="For Each">
<file:outbound-endpoint path="G:\DestDir1" outputPattern="#[function:datestamp:yyyyMMdd.HHmmss].#[message.inboundProperties.originalFilename]" responseTimeout="10000" connector-ref="File1" doc:name="Destination"/>
</foreach>
</processor-chain>
This correctly processes sets found in order if all sets are complete. It correctly waits for incomplete sets to fill, but does not hold following sets, that is in the above example set 20150703 will process while 20150701 is still waiting for the DTL file.
Is there a setting or another construct which will force the collection-aggregator element to wait if there is an earlier collection which is not complete?
I am using the date part of the file name for both the correlation and sequence ID’s which does control that sets process in the order I want if all sets are complete. It is not important if dates do not exist (as with 20150702 in this case), only that existing files are processed in order and that sets must be complete.
In the end, I could not get the Collection-Aggregator to do this. To overcome this, I built a Java class which contain Maps for the SUM and DTL files, with the Correlation ID as the key, and a sorted list of open keys.
The Java class then monitored for a completed set on the smallest key and signals back to the Mule flow when that set is available for processing.
The Mule flow must be put into synchronous mode while processing the files to prevent a data race situation. When complete, it signals the Java class that the processing is complete and the set of data can be dropped from the list/Maps, and receives an indication back if the next set is ready to process.
It is not the prettiest, and I would have preferred to not have used custom features for this, but it gets the job done.

BizTalk Business Rules Engine - Repeating Elements

I'm trying to create what I think should be a relatively simple business rule to operate over repeating elements in an XML schema.
Consider the following XML snippet (this is simplified with namespaces removed, for readability):
<Root>
<AllAccounts>
<Account id="1" currentPayment="10.00" arrearsAmount="25.00">
<AllCustomers>
<Customer id="20" primary="true" canSelfServe="false" />
<Customer id="21" primary="false" canSelfServe="false" />
</AllCustomers>
</Account>
<Account id="2" currentPayment="10.00" arrearsAmount="15.00">
<AllCustomers>
<Customer id="30" primary="true" canSelfServe="false" />
<Customer id="31" primary="false" canSelfServe="false" />
</AllCustomers>
</AllAccounts>
</Root>
What I want to do is to have two rules:
Set /Root/AllAccounts/Account[x]/AllCustomers/Customer[primary='true']/canSelfServe
= true IF arrearsAmount < currentPayment
Set /Root/AllAccounts/Account[x]/AllCustoemrs/Customer[primary='true']/canSelfServer
= false IF arrearsAmount >= currentPayment
Where [x] is 0...number of /Root/AllAccounts/Account records present in the XML.
I've tried two simple rules for this, and each rule seems to fire x * x times, where x is the number of Account records in the XML. I only want each rule to fire once for each Account record.
Any help greatly appreciated!
Thanks
Andrew
Make sure that the rules have the same Priority, just in case (I had issues with priorities before). I've also saw that at the Rules level, there is a property called maximum Execution Loop Depth, which assigns how many times can a rule be reevaluated. Try to put 1 there, if you're sure that your rules should only be evaluated once per payload. I hope this helps.
Check your predicate. The rule fires once for each matching combo of fields used in the predicate.

Resources