How to design the Oozie coordinator on arrival of input multiple times in a day

How to design the Oozie coordinator on arrival of input multiple times in a day - oozie

I have a requirement to schedule my coordinator on arrival of input from other application. I may receive one or multiple times in a day. So, whenever I receive an input I need to trigger my treatment. Can anyone please help me how to do it in Oozie coordinator? I have tried with below code but it runs only once in a day, it is not working even if I receive input after first treatment.
<coordinator-app name="TEST_MASTER_C" frequency="${coord:days(1)}" start="${start_date}" end="${end_date}" timezone="Europe/Paris" xmlns="uri:oozie:coordinator:0.2" xmlns:sla="uri:oozie:sla:0.1">
<controls>
<timeout>1430</timeout>
<execution>FIFO</execution>
</controls>
<datasets>
<dataset name="inputFlag" frequency="${coord:days(1)}" initial-instance="${start_date}" timezone="Europe/Paris">
<uri-template>#nameNode#/testpath</uri-template>
<done-flag>${flag}</done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="coordInputFlag" dataset="inputFlag">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>

Related

Segment Status Handling: NumAttempts WaitInterval params

At this documentation at page 15 of 22 it gives example:
<EnhancedAirBookRQ xmlns="http://services.sabre.com/sp/eab/v3_2">
<OTA_AirBookRQ>
<HaltOnStatus Code="NN"/>
<OriginDestinationInformation>
<FlightSegment DepartureDateTime="2014-06-03T12:30:00" FlightNumber="1022" NumberInParty="1" ResBookDesigCode="F" Status="NN">
<DestinationLocation LocationCode="LAS"/>
<MarketingAirline Code="AA" FlightNumber="1022"/>
<OriginLocation LocationCode="DFW"/>
</FlightSegment>
</OriginDestinationInformation>
<RedisplayReservation NumAttempts="2" WaitInterval="100"/>
</OTA_AirBookRQ>
</EnhancedAirBookRQ>
Could you help me understand if I set such parameters for NumAttempts and WaitInterval what I am missing.
I could guess that I will have quick answer for segments with no halt on status.

Inside the OTA_AirBook HaltOnStatus and RedisplayReservation work together. Basically, behind the scenes, the service will book what you requested and will attempt to redisplay the reservation up to NumAttempts times, waiting WaitInterval milliseconds between each check.
Each time it will look if the status of the segments in the itinerary changed, and it will validate them against what <HaltOnStatus Code="NN"/> has, (there can be several <HaltOnStatus Code="NN"/>). At least for NN it will keep on checking, for other ones like UC it will get out of the loop.
The reason is that NN is Sabre waiting for the airline's response on the booking request, and (I believe) anything else would be considered their response, so it leaves the loop.

Whats the relation between load and for?

I am trying tsung for the first time, however, I need some clarification.
I am using load tag as:
<load>
<arrivalphase phase="1" duration="1" unit="minute">
<users maxnumber="100000" interarrival="0.01" unit="second"/>
</arrivalphase>
</load>
But, how would the for loop below works ?:
<sessions>
<session name="root" probability="100" type="ts_http">
<for from="1" to="2" var="i">
<request>
<http url="/test/counter" method="POST" contents="bla=blu&name=glop">
</http>
</request>
</for>
</session>
What I thought is that the loop will count from 1 to 2, thus, sending only two requests, however, when I run the xml file, I got hundred of requests! Does this mean that each user in arrivalphase will send two requests as in the for loop above?
Can someone explain, what's the relation between the for tag and load tag as in the above example?

Your analysis is right , during the first 1 minute of the test , you created 100 users per second，each user will send two requests as in the for loop above。
The load define tsung generate rules of the user, the session define every user needs to perform logic.

Oozie coordinator - file event based trigger - multiple firing

Im trying to understand why an oozie 4.2 based coordinator job which should wait for a dataset fires multiple times. My coordinator job looks like this
<coordinator-app name="ConfirmDataMasterTrigger"
frequency="${frequencyMins}"
start="${startTime}"
end="${endTime}"
timezone="${timeZoneDef}"
xmlns="uri:oozie:coordinator:0.4"
xmlns:sla="uri:oozie:sla:0.2">
<controls>
<timeout>${TimeOutMins}</timeout>
<concurrency>${Concurrency}</concurrency>
<execution>${Execution}</execution>
</controls>
<datasets>
<dataset name="inputDS"
frequency="${coord:days(1)}"
initial-instance="${startTime}"
timezone="${timeZoneDef}">
<uri-template>${triggerFileDir}</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="ConfirmDataMasterTrigInput"
dataset="inputDS">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${workflowAppPath}</app-path>
<configuration>
<property>
<name>SaveDateString</name>
<value>${coord:formatTime(coord:actualTime(),"-yyyyMMdd-HHmmss")}</value>
</property>
<property>
<name>WaitForThisInputData</name>
<value>${coord:dataIn('ConfirmDataMasterTrigInput')}</value>
</property>
</configuration>
</workflow>
</action>
With a properties file that looks like this
nameNode=hdfs://hc1m1.nec.co.nz:8020
jobTracker=hc1r1m2.nec.co.nz:8050
hdfsUser=oozie
wfProject=ConfirmDataMaster
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
moveFile=ConfirmDataMaster_edit.csv
sourceDir=${nameNode}/mule/sheets/input/ConfirmDataMaster/
targetDir=/mule/sheets/store/
sourceFile=${sourceDir}${moveFile}
targetFile=${targetDir}${moveFile}
frequencyMins=10
startTime=2016-07-31T12:00Z
endTime=2099-01-01T12:00Z
timeZoneDef=GMT+12:00
TimeOutMins=10
Concurrency=1
Execution=FIFO
triggerDir=trigger/
triggerFileDir=${sourceDir}${triggerDir}
doneFlag=trigger.dat
workflowAppPath=${nameNode}/user/${hdfsUser}/wf/${wfProject}
oozie.coord.application.path=${nameNode}/user/${hdfsUser}/wf/${wfProject}
I am not having a problem in getting a work flow to to be triggered by a
coordinator given a data set based event. What I am seeing is that the under lying workflow is continuously triggered. Can anyone advise changes I should make or my error. Obviously my workflow cleans up and deletes the trigger path. Thanks in advance.

Ill answer my own question because Ive worked out the solution and its a bit obvious really. I was just a little confused. The Firing frequency is controlled by the coordinator and data set frequencies and also the trigger directory and file. If you dont want a trigger file then leave done-flag empty. If that is not added then a default flag file is _SUCCESS.
So if the trigger is available the workflow will fire at those frequencies specified. So I have changed my cord and data set frequencies to be 30 ( mins ). As a final task my workflow removes the trigger.

Holding data processing for incomplete data sets with Mule and a collection-aggregator

I need to collect and process sets of files generated by another organization. For simplicity, say that the set consists of two files, a summary file and a detail file named like: SUM20150701.dat and DTL20150701.dat, which would constitute a set for date 20150701. The issue is, sets need to be processed in order, and the transmission of files from an outside organization can be error prone such that a file may be missing. If this occurs, this set of files should hold, as should any following sets that are found. As example, at the start of the mule process, the source folder may have in it: SUM20150701.dat, SUM20150703.dat, DTL20150703.dat. That is, the data set for 20150701 is incomplete while 20150703 is complete. I need to have both data sets hold until DTL20150701.dat arrives, then process them in order.
In this simplified form of my mule process a source folder is watched for files. When found, they are moved to an archive folder and passed to the collection-aggregator using the date as the sequence and correlation values. When a set is complete, it is moved to a destination folder. A lengthy timeout is used on the collector to make sure incomplete sets are not processed:
<file:connector name="File" autoDelete="false" streaming="false" validateConnections="true" doc:name="File">
<file:expression-filename-parser />
</file:connector>
<file:connector name="File1" autoDelete="false" outputAppend="true" streaming="false" validateConnections="true" doc:name="File" />
<vm:connector name="VM" validateConnections="true" doc:name="VM">
<receiver-threading-profile maxThreadsActive="1"></receiver-threading-profile>
</vm:connector>
<flow name="fileaggreFlow2" doc:name="fileaggreFlow2">
<file:inbound-endpoint path="G:\SourceDir" moveToDirectory="g:\SourceDir\Archive" connector-ref="File1" doc:name="get-working-files"
responseTimeout="10000" pollingFrequency="5000" fileAge="600000" >
<file:filename-regex-filter pattern="DTL(.*).dat|SUM(.*).dat" caseSensitive="false"/>
</file:inbound-endpoint>
<message-properties-transformer overwrite="true" doc:name="Message Properties">
<add-message-property key="MULE_CORRELATION_ID" value="#[message.inboundProperties.originalFilename.substring(5, message.inboundProperties.originalFilename.lastIndexOf('.'))]"/>
<add-message-property key="MULE_CORRELATION_GROUP_SIZE" value="2"/>
<add-message-property key="MULE_CORRELATION_SEQUENCE" value="#[message.inboundProperties.originalFilename.substring(5, message.inboundProperties.originalFilename.lastIndexOf('.'))]"/>
</message-properties-transformer>
<vm:outbound-endpoint exchange-pattern="one-way" path="Merge" doc:name="VM" connector-ref="VM"/>
</flow>
<flow name="fileaggreFlow1" doc:name="fileaggreFlow1" processingStrategy="synchronous">
<vm:inbound-endpoint exchange-pattern="one-way" path="Merge" doc:name="VM" connector-ref="VM"/>
<processor-chain doc:name="Processor Chain">
<collection-aggregator timeout="1000000" failOnTimeout="true" doc:name="Collection Aggregator"/>
<foreach doc:name="For Each">
<file:outbound-endpoint path="G:\DestDir1" outputPattern="#[function:datestamp:yyyyMMdd.HHmmss].#[message.inboundProperties.originalFilename]" responseTimeout="10000" connector-ref="File1" doc:name="Destination"/>
</foreach>
</processor-chain>
This correctly processes sets found in order if all sets are complete. It correctly waits for incomplete sets to fill, but does not hold following sets, that is in the above example set 20150703 will process while 20150701 is still waiting for the DTL file.
Is there a setting or another construct which will force the collection-aggregator element to wait if there is an earlier collection which is not complete?
I am using the date part of the file name for both the correlation and sequence ID’s which does control that sets process in the order I want if all sets are complete. It is not important if dates do not exist (as with 20150702 in this case), only that existing files are processed in order and that sets must be complete.

In the end, I could not get the Collection-Aggregator to do this. To overcome this, I built a Java class which contain Maps for the SUM and DTL files, with the Correlation ID as the key, and a sorted list of open keys.
The Java class then monitored for a completed set on the smallest key and signals back to the Mule flow when that set is available for processing.
The Mule flow must be put into synchronous mode while processing the files to prevent a data race situation. When complete, it signals the Java class that the processing is complete and the set of data can be dropped from the list/Maps, and receives an indication back if the next set is ready to process.
It is not the prettiest, and I would have preferred to not have used custom features for this, but it gets the job done.

Oozie coordinator with sysdate as start time

I want to run oozie coordinator with start time as sysdate. How do I do that?
is it possible to put sysdate as start date ? Will it catch up?

You can make coorodinator's "start" refer to a variable - startTime, then overwrite its value with sysdate from command line, such as:
oozie job -run -config ./coord.properties -DstartTime=`date -u "+%Y-%m-%dT%H:00Z"`
adjust the time format if you are not using UTC time zone in your system.
sample coordinator job xml:
<coordinator-app name="my-coord"
frequency="${frequency}" start="${startTime}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow> ...
coordinator attribute file coord.properties:
...
startTime=2014-05-19T22:00Z
end=2015-01-19T22:08Z
frequency=60 ...