Can some one let me know why following Oozie coordinator is running in loop - oozie

I was new to oozie process . I was testing the following coordinator.xml,when i submit the job it running in loop but I want to run everyday at 1:00 am .Can someone let me know what mistake i was doing.
<coordinator-app name="cron-coord-jon" frequency="0 1 * * *" start="2009-01-01T05:00Z" end="2036-01-01T06:00Z" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>

Your coordinator is likely not running in a loop, but rather submitting every 'missed' job since the start date you specified. Set the start date to the current day (e.g. 2019-06-03T00:00Z) and relaunch your coordinator.
If the start time is before 01:00, you should see a single job be launched for the day.
You may want to pass this in as a variable. Here is the call to date that will provide the current date & time in the correct format.
date -u "+%Y-%m-%dT%H:%MZ"

Related

Why is the task status always running when the HDFS server in NebulaGraph Explorer cannot be connected?

Sometimes the explorer fails to connect to the HDFS because of network fluctuations. In this case, the task status is always running.
You need to set the timeout period for HDFS connections as follows:
<configuration>
<property>
<name>ipc.client.connect.timeout</name>
<value>3000</value>
</property>
<property>
<name>ipc.client.connect.max.retries.on.timeouts</name>
<value>3</value>
</property>
</configuration>

Is it possible to run an Oozie Spark Action without specifying inputDir & outputDir

According to https://oozie.apache.org/docs/3.3.1/WorkflowFunctionalSpec.html#a4.1_Workflow_Job_Properties_or_Parameters we know ..
When submitting a workflow job for the workflow definition above, 3 workflow job properties must be specified:
jobTracker:
inputDir:
outputDir:
I have a PySpark script that has specified input & output locations in the script itself. I don't need and want an inputDir and outputDir in my workflow XML. When running my PySpark script via Oozie, I get this error message.
WARN ParameterVerifier:523 - SERVER[<my_server>] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] The application does not define formal parameters in its XML definition
WARN JobResourceUploader:64 - SERVER[<my_server>] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-05-24 11:52:29,844 WARN JobResourceUploader:171 - SERVER[<my_server>] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
Based on https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/util/ParameterVerifier.java , my first warning is caused by the fact that I dont have a "inputDir"
else {
// Log a warning when the <parameters> section is missing
XLog.getLog(ParameterVerifier.class).warn("The application does not define formal parameters in its XML "
+ "definition");
}
Can I get around this at all ?
Update -- my XML structure
<action name="spark-node">
<spark xmlns="uri:oozie:spark-action:0.1" >
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
<master>yarn-master</master>
<!-- <mode>client</mode> -->
<name>oozie_test</name>
<jar>oozie_test.py</jar>
<spark-opts>--num-executors 1 --executor-memory 10G --executor-cores 1 --driver-memory 1G</spark-opts>
</spark>
<ok to="end" />
<error to="fail" />
</action>

Make Oozie do not change CLASSPATH of java action

I'm running java application in oozie and oozie adding something to classpath. How do I know? When I run this application without oozie it works perfectly fine, but with oozie I get
java.lang.NoSuchMethodError: org.apache.hadoop.yarn.webapp.util.WebAppUtils.getProxyHostsAndPortsForAmFilter(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
at org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer.initFilter(AmFilterInitializer.java:40)
at org.apache.hadoop.http.HttpServer.<init>(HttpServer.java:272)
at org.apache.hadoop.yarn.webapp.WebApps$Builder$2.<init>(WebApps.java:222)
at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:219)
at org.apache.hadoop.mapreduce.v2.app.client.MRClientService.serviceStart(MRClientService.java:136)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1058)
I even configured
<property>
<name>oozie.use.system.libpath</name>
<value>false</value>
</property>
<property>
<name>oozie.launcher.mapreduce.job.user.classpath.first</name>
<value>true</value>
</property>
But it doesn't help. How I can tell oozie to totally **** off my classpath?

jobTracker property in job.properties of oozie

I'm using hadoop-2.7.2 and oozie-4.0.1, what should be the jobTracker value in job.properties file of oozie workflow. I referred this link;
http://hadooptutorial.info/apache-oozie-installation-on-ubuntu-14-04/
which states that, in YARN architecture the job tracker runs on 8032 port and i'm currently using this. But in mapred-site.xml of hadoop i'm having the value hdfs://localhost:54311 for job tracker property.
I'm confused, can any one explain me or provide some useful links for installing oozie and running jobs on oozie.
Currently, i'm not able to run workflow jobs on oozie, it is in a Running state for a long time and then it is getting suspended with a connection error. Job DAG is also not getting generated, it is throwing some UI Exception.
Please anyone help me with this.
In your properties file just pass the Resorucemanager address which you have configured in the yarn-site.xml or directly parse the resourcemanager address in workflow.xml file as
<job-tracker>localhost:8032</job-tracker>
While running properties file you need to specify in which host the oozie server will be running, I think in that part you didn't face any issues right. Then paste the error message and update the question.
EDITED:
Configurations needed to be in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<description>NM Webapp address.</description>
<name>yarn.nodemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:8042</value>
</property>
<property>
<description>hostname </description>
<name>yarn.nodemanager.hostname</name>
<value>localhost</value>
</property>
you can either specify hostname or localhost for Pesudo node cluster.
for HA cluster need the below
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
in Production Environment , probably you have configured a High-Availbility yarn cluster. In this case , the oozie job tracker config in job.properties should be the configuration value of yarn.resourcemanager.cluster-id.
a cut of my yarn configuration :
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>datayarn</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>resourcemanager1,resourcemanager2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.resourcemanager1</name>
<value>11.11.11.11</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.resourcemanager2</name>
<value>11.11.11.12</value>
</property>
So , the jobTracker value should be:datayarn

oozie > coordinator > how to trigger action with external event

In Oozie site, it was told "Commonly, workflow jobs are run based on regular time intervals and/or data availability. And, in some cases, they can be triggered by an external event."
Anyone has any idea about how to trigger action with external event?
External Event can be availability of file in some directory.
So oozie coordinator has this facility.This is useful when you need to trigger second workflow on completion of first dependent workflow.
Second cordinator keeps on polling for availability of success_trigger.txt in
triggerdirpath
triggerdirpath is hdfs path where success_trigger.txt is created by first workflow
<coordinator-app name="Xxx" frequency="${coord:days(1)}" start="${startTime2}" end="${endTime}" timezone="GMT-0700" xmlns="uri:oozie:coordinator:0.2">
<dataset name="check_for_SUCCESS" frequency="${coord:days(1)}" initial-instance="${startTime2}" timezone="GMT-0700">
<uri-template>${triggerdirpath}</uri-template>
<done-flag>success_trigger.txt</done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="check_for_SUCCESS_data" dataset="check_for_SUCCESS">
<instance>${startTime2}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${WF_path}</app-path>
<configuration>
<property><name>WaitForThisInputData</name><value>${coord:dataIn('check_for_SUCCESS_data')}</value></property>
<property><name>WhenToStart</name><value>${startTime2}</value></property>
<property><name>rundate</name><value>${coord:dataOut('currentFullDate')}</value></property>
<property><name>previousdate</name><value>${coord:dataOut('previousFullDate')}</value></property>
<property><name>currentyear</name><value>${coord:dataOut('currentYear')}</value></property>
<property><name>currentmonth</name><value>${coord:dataOut('currentMonth')}</value></property>
<property><name>currentday</name><value>${coord:dataOut('currentDay')}</value></property>
<property><name>previousbatchtime</name><value>${coord:formatTime(coord:dateOffset(coord:nominalTime(),-1,'DAY'),"yyyy-MM-dd")}</value></property>
<property><name>currentbatchtime</name><value>${coord:formatTime(coord:dateOffset(coord:nominalTime(),0,'DAY'),"yyyy-MM-dd")}</value></property>
<property><name>nextbatchtime</name><value>${coord:formatTime(coord:dateOffset(coord:nominalTime(),1,'DAY'),"yyyy-MM-dd")}</value></property>
</configuration>
</workflow>
</action>
</coordinator-app>

Resources