Use case:
I have a coordinator which passes a directory with multiple files to a workflow.
The workflow has the following nodes:
java node 1 : Reads the file, and does some json parsing gets some input values to below nodes. Done using <capture-output>.
pig node 1 : Does some action. Requires above input values from parsed json.
pig node 2 : Same as above
pig node 3 : ................
..................
Problem:
The coordinator passes a directory name to the workflow.
I want to be doing the below:
for every file in directory {
java node 1 : get config from file X
pig node 1 : ...............
..............
}
Please suggest a way using which I can do this.
Below is the coordinator:
LAST_ONLY
<datasets>
<dataset name="input" frequency="${datasetFrequency}" initial-instance="${datasetInitialInstance}" timezone="UTC">
<uri-template>${nameNode}/user/${coord:user()}/alertcampaign/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
<done-flag></done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="inputLogs1" dataset="input">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${nameNode}/user/${coord:user()}/test.xml</app-path>
<configuration>
<property>
<name>wfInput</name>
<value>${coord:dataIn('inputLogs1')}</value>
</property>
</configuration>
</workflow>
What about to create loop with a sub-workflow?
https://blog.cloudera.com/blog/2013/09/how-to-write-an-el-function-in-apache-oozie/
https://github.com/rkanter/oozie-subwf-repeat-example
Related
I was new to oozie process . I was testing the following coordinator.xml,when i submit the job it running in loop but I want to run everyday at 1:00 am .Can someone let me know what mistake i was doing.
<coordinator-app name="cron-coord-jon" frequency="0 1 * * *" start="2009-01-01T05:00Z" end="2036-01-01T06:00Z" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
Your coordinator is likely not running in a loop, but rather submitting every 'missed' job since the start date you specified. Set the start date to the current day (e.g. 2019-06-03T00:00Z) and relaunch your coordinator.
If the start time is before 01:00, you should see a single job be launched for the day.
You may want to pass this in as a variable. Here is the call to date that will provide the current date & time in the correct format.
date -u "+%Y-%m-%dT%H:%MZ"
According to https://oozie.apache.org/docs/3.3.1/WorkflowFunctionalSpec.html#a4.1_Workflow_Job_Properties_or_Parameters we know ..
When submitting a workflow job for the workflow definition above, 3 workflow job properties must be specified:
jobTracker:
inputDir:
outputDir:
I have a PySpark script that has specified input & output locations in the script itself. I don't need and want an inputDir and outputDir in my workflow XML. When running my PySpark script via Oozie, I get this error message.
WARN ParameterVerifier:523 - SERVER[<my_server>] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] The application does not define formal parameters in its XML definition
WARN JobResourceUploader:64 - SERVER[<my_server>] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-05-24 11:52:29,844 WARN JobResourceUploader:171 - SERVER[<my_server>] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
Based on https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/util/ParameterVerifier.java , my first warning is caused by the fact that I dont have a "inputDir"
else {
// Log a warning when the <parameters> section is missing
XLog.getLog(ParameterVerifier.class).warn("The application does not define formal parameters in its XML "
+ "definition");
}
Can I get around this at all ?
Update -- my XML structure
<action name="spark-node">
<spark xmlns="uri:oozie:spark-action:0.1" >
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
<master>yarn-master</master>
<!-- <mode>client</mode> -->
<name>oozie_test</name>
<jar>oozie_test.py</jar>
<spark-opts>--num-executors 1 --executor-memory 10G --executor-cores 1 --driver-memory 1G</spark-opts>
</spark>
<ok to="end" />
<error to="fail" />
</action>
In Oozie site, it was told "Commonly, workflow jobs are run based on regular time intervals and/or data availability. And, in some cases, they can be triggered by an external event."
Anyone has any idea about how to trigger action with external event?
External Event can be availability of file in some directory.
So oozie coordinator has this facility.This is useful when you need to trigger second workflow on completion of first dependent workflow.
Second cordinator keeps on polling for availability of success_trigger.txt in
triggerdirpath
triggerdirpath is hdfs path where success_trigger.txt is created by first workflow
<coordinator-app name="Xxx" frequency="${coord:days(1)}" start="${startTime2}" end="${endTime}" timezone="GMT-0700" xmlns="uri:oozie:coordinator:0.2">
<dataset name="check_for_SUCCESS" frequency="${coord:days(1)}" initial-instance="${startTime2}" timezone="GMT-0700">
<uri-template>${triggerdirpath}</uri-template>
<done-flag>success_trigger.txt</done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="check_for_SUCCESS_data" dataset="check_for_SUCCESS">
<instance>${startTime2}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${WF_path}</app-path>
<configuration>
<property><name>WaitForThisInputData</name><value>${coord:dataIn('check_for_SUCCESS_data')}</value></property>
<property><name>WhenToStart</name><value>${startTime2}</value></property>
<property><name>rundate</name><value>${coord:dataOut('currentFullDate')}</value></property>
<property><name>previousdate</name><value>${coord:dataOut('previousFullDate')}</value></property>
<property><name>currentyear</name><value>${coord:dataOut('currentYear')}</value></property>
<property><name>currentmonth</name><value>${coord:dataOut('currentMonth')}</value></property>
<property><name>currentday</name><value>${coord:dataOut('currentDay')}</value></property>
<property><name>previousbatchtime</name><value>${coord:formatTime(coord:dateOffset(coord:nominalTime(),-1,'DAY'),"yyyy-MM-dd")}</value></property>
<property><name>currentbatchtime</name><value>${coord:formatTime(coord:dateOffset(coord:nominalTime(),0,'DAY'),"yyyy-MM-dd")}</value></property>
<property><name>nextbatchtime</name><value>${coord:formatTime(coord:dateOffset(coord:nominalTime(),1,'DAY'),"yyyy-MM-dd")}</value></property>
</configuration>
</workflow>
</action>
</coordinator-app>
I have a buildModel.jar, and a folder "conf" which contain a configuration file named config.properties.
The command line running it look like this:
hadoop jar /home/user1/buildModel.jar -t fp-purchased-products -i hdfs://Hadoop238:8020/user/user2/recommend_data/bought_together
After doing some analyze, it use the db information in "config.properties" file to store data to a mongo db.
Now i need to run it with Hue Oozie workflow, so I used Hue to upload the jar file and folder "conf" to hdfs then created a workflow. I also added "config.properties" file in workflow
This is the workflow.xml
<workflow-app name="test_service" xmlns="uri:oozie:workflow:0.4">
<start to="run_java_file"/>
<action name="run_java_file">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>xxx.xxx.recommender.buildModel.Application</main-class>
<arg>-t=fp-purchased-products</arg>
<arg>-i=hdfs://Hadoop238:8020/user/user2/recommend_data/bought_together</arg>
<file>/user/user2/service/build_model/conf/config.properties#config.properties</file>
</java>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
And this is the workflow-metadata.json
{"attributes": {"deployment_dir": "/user/hue/oozie/workspaces/_user2_-oozie-31-1416890719.12", "description": ""}, "nodes": {"run_java_file": {"attributes": {"jar_path": "/user/user2/service/build_model/buildModel.jar"}}}, "version": "0.0.1"}
After doing analyze, it got error when save data to mongo db. It seem that the java file can't see the config.properties.
Can anyone guide me how to use Hue Oozie run java which has config file ?
Sorry for late answer.
As Romain explained above. Hue will copy the config.properties to the same directory with the BuildModel.jar. So i changed the code to let BuildModel.jar read config file at the same directory. It worked !
Can I use wildcards (e.g. *) or file patterns (e.g. {}) in Oozie move actions?
I am trying to move the results of my job into archiving directory.
State of the directory structure:
output
- 201304
- 201305
archive
- 201303
My action:
<fs name="archive-files">
<move source="hdfs://namenode/output/{201304,201305}"
target="hdfs://namenode/archive" />
<ok to="next"/>
<error to="fail"/>
</fs>
resulting error:
FS006: move, source path [hdfs://namenode/output/{201304,201305}] does not exist
Is there an easy way to move more than one file in a glob or bash like syntax? Looking to do something similar to this hadoop command:
hadoop fs -mv hdfs://namenode/output/{201304,201305} hdfs://namenode/archive
Am I missing something? The hadoop fs command accepts glob. Does Oozie?
Oozie HDFS action has quite limited functionality, which is fully described in functional specification. To do something more complicated you can use Shell action. It allows to run arbitrary shell commands as part of workflow, e.g. hadoop fs in your case.
No - from my experience it doesn't look like it works.
FS006: move, source path [hdfs://nodename:8020/projects/blah/201*.gz] does not exist
In workflow.xml use this:
<action name="Movefiles">
<fs>
<move source='${SourcePath}' target='${DestinationPath}'/>
</fs>
<ok to="end"/>
<error to="fail"/>
</action>
and in job.properties write:
SourcePath=output/*/
DestinationPath=archive