Oozie command to rerun workflow with skip nodes not working - oozie

There is a couple of options while re-running a workflow via Oozie command line.
oozie.wf.rerun.failnodes
oozie.wf.rerun.skip.nodes
Option 1 works fine, however, while re-running workflow with option 2, it throws error E0404.
oozie job -oozie http://<url>/oozie -Doozie.wf.rerun.skip.nodes=node1,node2 -rerun WFID
Error: E0404 : E0404: Only one of the properties are allowed [oozie.wf.rerun.skip.nodes OR oozie.wf.rerun.failnodes]
However, below works fine.
oozie job -oozie http://<url>/oozie -Doozie.wf.rerun.failnodes=true -rerun WFID

Everytime an oozie job is executed in a rerun mode, it will try to reuse the previous run's conifg file. you can however pass additional properties to it using -D option and thats how we pass oozie.wf.rerun.failnodes and oozie.wf.rerun.skip.nodes.
If you have execueted your job in rerun mode already once with oozie.wf.rerun.failnodes=true once, then in your next run you cannot use
oozie job -oozie http://<url>/oozie -Doozie.wf.rerun.skip.nodes=node1,node2 -rerun WFID
because when its trying to reuse config file, oozie.wf.rerun.failnodes property is already existing in its properties and that's when oozie tries to throw an error like you have faced.

you could start the workflow from the beginning by giving oozie.wf.rerun.failnodes=false property...thats what i do when i reran a job already, this is similar to skip node oozie.wf.rerun.skip.nodes=, but we cant skip anything

Related

Apache Airflow problem - "a task with task_id create_tag_template_field_result is already in the DAG"

So, I have a problem with even the blank Airflow installation.
As soon as I try to run
airflow test tutorial print_date 2015-06-01
I get a raised exception which says
PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
What is the reason for this (as I made literally no changes to the installation whatsoever)?
I also got that when, in a previous installation, I tried to run my own dag... but the "create_tag_template_field_result" was nowhere to be found in my code.
you can set the config arg load_examples = False to solve it.
This is the test command will call get_dag function which will construct a DagBag object, in the construction function will call collect_dags function.
The collect_dags function when the conf arg LOAD_EXAMPLES=True(default True), will collect all the dags in the example path, that's where the task create_tag_template_field_result comes from.
And in the collect_dags function will call add_task function of every example task, that's where you add the create_tag_template_field_result task again.
And maybe it's quickstart when you added this task before for the first time while you didn't realize.
you can set the config arg load_examples = False to solve it
This warning is occuring in
/usr/local/lib/python3.7/dist-packages/airflow/example_dags/example_complex.py
so i remove or rename (for example, to not working name *.py.back ) this.
I had the same error with a fresh install.
Then I don't know if this helps, but I downgraded Airflow to version 1.10.10 (with python3.7) and the error was gone.

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

Hive query execution for custom udf is exepecting hdfs jar path instead of local path in CDH4 with Oozie flow

We are migrating from CDH3 to CDH4 and as part of this migration we are moving all the jobs that we have on CDH3. We have noticed one critical issue in this, when a work flow is executed through oozie for executing a python script which internally invoked a hive query(hive -e {query}), here in this hive query we are adding a custom jar using add jar {LOCAL PATH FOR JAR}, and created a temporary function for custom udf. And it looks ok till here. But when the query started executing with custom udf funtion it is failing with Distributed cache, File Not Found Exception which is looking for jar in the HDFS path instead of lookig in local path.
I am not sure if I am missing some configuration here.
Execption Trace:
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
Please use org.apache.hadoop.log.metrics.EventCounter in all the
log4j.properties files. Execution log at:
/tmp/yarn/yarn_20131107020505_79b41443-b9f4-4d36-a0eb-4f0d79cd3ce9.log
java.io.FileNotFoundException: File does not exist:
hdfs://aa.bb.com:8020/opt/nfsmount/mypath/custom.jar
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
..... .....
any help on this is highly appreciated.
Regards,
GHK.
There are some few options. All the required jar should be in the classpath before you run hive query.
option 1: Add your custom jar by <file>/hdfs/path/to/your/jar</file> in oozie workflow
option 2: use attribute --auxpath /local/path/to/your/jar while calling your hive script in python. Eg: hive --auxpath /local/path/to/your.jar -e {query}

Getting InvalidProtocolBufferException while running oozie job

I'm getting the below exception while running the sample oozie examples.
I've modified the job.properties located at the /examples/apps/map-reduce with the appropriate nameNode and jobTracker details.
I'm using the below command to run the oozie job:
"sudo oozie job -oozie http://ip-10-0-20-143.ec2.internal:11000/oozie -config examples/apps/map-reduce/job.properties -run"
Error: E0501 : E0501: Could not perform authorization operation, Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "ip-10-0-20-143.ec2.internal/10.0.20.143"; destination host is: "ip-10-0-20-144.ec2.internal":50070;
The hadoop core-site.xml also has the correct proxyuser details for oozie user.
Really, dont know where it is going wrong?? :(
I will answer in case someone will google up this page.
In my case the cause was in using http address for Name Node.
You should check your job configuration and if there stays something like:
nameNode=yourhostname:50070
You should change it to something like this:
nameNode=hdfs://yourhostname:8020
Check your ports first of course!
Please notice that jobTracker parameter has different notation. In my case it's:
jobTracker=yourhostname:8021
and it works fine.
Hope it helps to someone.

Scheduling R Script

I have written an R script that pulls some data from a database, performs several operations on it and post the output to a new database.
I would like this script to run every day at a specific time but I can not find any way to do this effectively.
Can anyone recommend a resource I could look at to solve this issue? I am running this script on a Windows machine.
Actually under Windows you do not even have to create a batch file first to use the Scheduler.
Open the scheduler: START -> All Programs -> Accesories -> System Tools -> Scheduler
Create a new Task
under tab Action, create a new action
choose Start Program
browse to Rscript.exe which should be placed e.g. here:
"C:\Program Files\R\R-3.0.2\bin\x64\Rscript.exe"
input the name of your file in the parameters field
input the path where the script is to be found in the Start in field
go to the Triggers tab
create new trigger
choose that task should be done each day, month, ... repeated several times, or whatever you like
Supposing your R script is mytest.r, located in D:\mydocuments\, you can create a batch file including the following command:
C:\R\R-2.10.1\bin\Rcmd.exe BATCH D:\mydocuments\mytest.r
Then add it, as a new task, to windows task scheduler, setting there the triggering conditions.
You could also omit the batch file. Set C:\R\R-2.10.1\bin\Rcmd.exe in the program/script textbox in task scheduler, and give as Arguments the rest of the initial command: BATCH D:\mydocuments\mytest.r
Scheduling R Tasks via Windows Task Scheduler (Posted on February 11, 2015)
taskscheduleR: R package to schedule R scripts with the Windows task manager (Posted on March 17, 2016)
EDIT
I recently adopted the use of batch files again, because I wanted the cmd window to be minimized (I couldn't find another way).
Specifically, I fill the windows task scheduler Actions tab as follows:
Program/script:
cmd.exe
Add arguments (optional):
/c start /min D:\mydocuments\mytest.bat ^& exit
Contents of mytest.bat:
C:\R\R-3.5.2\bin\x64\Rscript.exe D:\mydocuments\mytest.r params
Now there is built in option in RStudio to do this, to run scheduler first install below packages
install.packages('data.table')
install.packages('knitr')
install.packages('miniUI')
install.packages('shiny')
install.packages("taskscheduleR", repos = "http://www.datatailor.be/rcube", type =
"source")
After installing go to
**TOOLS -> ADDINS ->BROWSE ADDINS ->taskscheduleR -> Select it and execute it.**
Setting up the task scheduler
Step 1) Open the task scheduler (Start > search Task Scheduler)
Step 2) Click "Action" > "Create Task"
Step 3) Select "Run only when the user is logged on", uncheck "Run with highest priveledges", name your task,
configure for "Windows Vista/Windows Server 2008"
Step 4) Under the "Triggers" tab, set when you would like the script to run
Step 5) Under the "Actions" tab, put the full location of the Rscript.exe file, i.e.
"C:\Program Files\R\R-3.6.2\bin\Rscript.exe" (include the quotes)
Put the name of your script with with -e and source() in arguments wrapping it like this:
-e "source('C:/location_of_my_script/test.R')"
Troubleshooting a Rscript scheduled in the Task Scheduler
When you run a script using the Task Scheduler, it is difficult to troubleshoot any issues because you don't get any error messages.
This can be resolved by using the sink() function in R which will allow you to output all error messages to a file that you specify. Here is how you can do this:
# Set up error log ------------------------------------------------------------
error_log <- file("C:/location_of_my_script/error_log.Rout", open="wt")
sink(error_log, type="message")
try({
# insert your code here
})
The other thing that you will have to change to make your Rscript work is to specify the full file path of any file paths in your script.
This will not work in task scheduler:
source("./functions/import_function.R")
You will need to specify the full file path of any scripts you are sourcing within your Rscript:
source("C:/location_of_my_script/functions/import_function.R")
Additionally, I would remove any special characters from any file paths that you are referencing in your R script. For example:
df <- fread("C:/location_of_my_data/file#2342.csv")
may not run. Instead, try:
df <- fread("C:/location_of_my_data/file_2342.csv")
Changing windows passwords
Beware: Changing windows passwords will pause your task scheduler script(s). You will need to log back into the task scheduler and enter your password to get them started again.
I set up my tasks via the SCHTASKS program. For running scripts on startup, you would write something along the lines of
SCHTASKS /Create /SC ONSTART /TN MyProgram /TR "R CMD BATCH --vanilla d:\path\to\script.R"
See this website for more details on SCHTASKS. More details at Microsoft's website.
You can use Windows Task Scheduler.
After following any combination of these steps and you receive the "Argument Batch Ignored" error after R.exe runs, try this, it worked for me.
In Windows Task Scheduler:
Replace BATCH "C:\Users\desktop\yourscript.R"in the arguments field
with
CMD BATCH --vanilla --slave "C:\Users\desktop\yourscript.R"

Resources