I have oozie 4.2.0 HDP version, i want to use 'Max-retries' for my spark-action as well as shell action.
When i submit the workflow after ERROR state it goes to USER-RETRY state, and then again retries it.
When i look into oozie -info for that action it gives me number of retries as '0'.
I looked for '-retries' but its available in 5.x.xversion.
<workflow-app xmlns="uri:oozie:workflow:0.3" name="wf-name">
<action name="a" retry-max="2" retry-interval="1">
</action>
Is there any way by which i can look the nunber of retries attempts ?
Yes, you'll get to know if you open oozie job logs. Once you open the OOzie URL, click on the oozie job, then select the action, it gives the link to resource manager page, if you go through the logs, it will be specified there. If it doesn't succeed, it says sleep for a interval and retry 2, ...,etc. Hope this helps.
Related
My robot framework test contains part, where I run .exe file. I am using Run process keyword from Process library.
${result} Run Process path_to_the_file/file.exe cwd=path_to_the_file/
When I run this locally, the file is executed and response is like this:
14:35:34.553 INFO Waiting for process to complete.
14:35:34.634 INFO Process completed.
14:35:34.634 TRACE Return: <robot.libraries.Process.ExecutionResult object at 0x055963D0>
14:35:34.634 INFO ${result} = <result object with rc 0>
When I run this test on Team City, the file is not executed ( properly )
5:27:27.587 INFO Waiting for process to complete.
15:27:27.786 INFO Process completed.
15:27:27.786 TRACE Return: <robot.libraries.Process.ExecutionResult object at 0x012C1310>
15:27:27.786 INFO ${result} = <result object with rc 3221225781>
Edit: I tried google for that return code from Team city run and I found something like this:
3221225781 = [$id=DLL_NOT_FOUND, $desc={Unable To Locate Component}
This application has failed to start because %hs was not found.
Reinstalling the application may fix this problem.]
Does anyone have this sort of experience and what can I do with it?
EDIT2: after deeper analysis I found out, that there are missing DLLs on the agent. So once I'll add missing DLLs I'll know if that was a source of this isue
So the solution for this issue was to add missing DLLs. Once I added the newest version of Microsoft Visual C++ package, the Run Process executed the .exe file properly.
There is a couple of options while re-running a workflow via Oozie command line.
oozie.wf.rerun.failnodes
oozie.wf.rerun.skip.nodes
Option 1 works fine, however, while re-running workflow with option 2, it throws error E0404.
oozie job -oozie http://<url>/oozie -Doozie.wf.rerun.skip.nodes=node1,node2 -rerun WFID
Error: E0404 : E0404: Only one of the properties are allowed [oozie.wf.rerun.skip.nodes OR oozie.wf.rerun.failnodes]
However, below works fine.
oozie job -oozie http://<url>/oozie -Doozie.wf.rerun.failnodes=true -rerun WFID
Everytime an oozie job is executed in a rerun mode, it will try to reuse the previous run's conifg file. you can however pass additional properties to it using -D option and thats how we pass oozie.wf.rerun.failnodes and oozie.wf.rerun.skip.nodes.
If you have execueted your job in rerun mode already once with oozie.wf.rerun.failnodes=true once, then in your next run you cannot use
oozie job -oozie http://<url>/oozie -Doozie.wf.rerun.skip.nodes=node1,node2 -rerun WFID
because when its trying to reuse config file, oozie.wf.rerun.failnodes property is already existing in its properties and that's when oozie tries to throw an error like you have faced.
you could start the workflow from the beginning by giving oozie.wf.rerun.failnodes=false property...thats what i do when i reran a job already, this is similar to skip node oozie.wf.rerun.skip.nodes=, but we cant skip anything
I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.
I am trying to run a sh script through Oozie, but I am facing a problem:
Cannot run program "script.sh" (in directory
"/mapred/local/taskTracker/dell/jobcache/job_201312061003_0001/attempt_201312061003_0001_m_000000_0/work"):
java.io.IOException: error=2, No such file or directory.
Please help me with necessary steps.
This error is really ambiguous. Here are some issues that have helped me to solve this issue.
-If you are running oozie workflows on a kerberized cluster, make sure to authenticate by passing your Kerberos Keytab as a argument:
...
<shell>
<exec>scriptPath.sh</exec>
<file>scriptPath.sh</file>
<file>yourKeytabFilePath</file>
</shell>
...
-In your shell File (scriptPath.sh), make sure ro remove first line shell reference.
#!usr/bin/bash
indeed, if this shell reference isn't deployed on all data nodes, this can lead to this error code.
I had this same issue because of something really silly. I added a shell block in the workflow, then I selected the corresponding sendMail.sh, but I forgot to add the file sendMail.sh in FILE +.
An Oozie shell action is executed on a random Hadoop node, i.e. not locally on the machine where the Oozie server is running. As Oleksii says, you have to make sure that your script is on the node that executes the job.
See the following complete examples of executing a shell action and an ssh action:
https://github.com/airawat/OozieSamples/tree/master/oozieProject/workflowShellAction
https://github.com/airawat/OozieSamples/tree/master/oozieProject/workflowSshAction
workflow.xml :
...
<shell>
<exec>script.sh</exec>
<file>scripts/script.sh</file>
</shell>
...
Make sure you have scripts/script.sh in the same folder in hdfs.
In addition to what others said, this can also be caused by a shell script having wrong line endings (e.g. CRLF for Windows). At least this happened to me :)
Try to give full path for HDFS like
<exec>/user/nathalok/run.sh</exec>
<file>/user/nathalok/run.sh#run.sh</file>
and ensure that in job.properties the path is mentioned correctly for the library and workflow.xml
oozie.libpath=hdfs://server/user/oozie/share/lib/lib_20150312161328/oozie
oozie.wf.application.path=hdfs://bcarddev/user/budaledi/Teradata_Flow
if your shell file exist in your project correlated dir. then it's your shell file format cause this error. you need to convert the format from dos to linux using dos2linux:dos2linux xxxx.sh
Also make sure that the shell scripts are UNIX compliant. If these shell scripts were written in windows environment then it appends windows specific end of lines (EOL) and these scripts are not recognized by the oozie. So you will get "no such file or directory found" in oozie shell actions.
workflow.xml would look something like this
<workflow-app name="HiveQuery_execution" xmlns="uri:oozie:workflow:0.5">
<start to="shell-3c43"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-3c43">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>/user/path/hivequery.sh</exec>
<file>/user/path/hivequery.sh#hivequery.sh</file>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
Job.properties
jobTracker=xxxx.xxx.xxx.com:port
nameNode=hdfs://xxxx.xxx.xxx.com:port
better configure through UI, as suggested above
I'm getting the below exception while running the sample oozie examples.
I've modified the job.properties located at the /examples/apps/map-reduce with the appropriate nameNode and jobTracker details.
I'm using the below command to run the oozie job:
"sudo oozie job -oozie http://ip-10-0-20-143.ec2.internal:11000/oozie -config examples/apps/map-reduce/job.properties -run"
Error: E0501 : E0501: Could not perform authorization operation, Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "ip-10-0-20-143.ec2.internal/10.0.20.143"; destination host is: "ip-10-0-20-144.ec2.internal":50070;
The hadoop core-site.xml also has the correct proxyuser details for oozie user.
Really, dont know where it is going wrong?? :(
I will answer in case someone will google up this page.
In my case the cause was in using http address for Name Node.
You should check your job configuration and if there stays something like:
nameNode=yourhostname:50070
You should change it to something like this:
nameNode=hdfs://yourhostname:8020
Check your ports first of course!
Please notice that jobTracker parameter has different notation. In my case it's:
jobTracker=yourhostname:8021
and it works fine.
Hope it helps to someone.