I am new to Oozie, Just wondering - How do I schedule a sqoop job using Oozie. I know sqoop action can be added as part of the Oozie workflow. But how can I schedule a sqoop action and get it running like every 2 mins or 8pm every day automatically (just lie a cron job)?
You need to create coordinator.xml file with start, end and frequency. Here is an example
<coordinator-app name="example-coord" xmlns="uri:oozie:coordinator:0.2"
frequency="${coord:days(7)}"
start="${start}"
end= "${end}"
timezone="America/New_York">
<controls>
<timeout>5</timeout>
</controls>
<action>
<workflow>
<app-path>${wf_application_path}</app-path>
</workflow>
</action>
</coordinator-app>
Then create coordinator.properties file like this one:
host=namenode01
nameNode=hdfs://${host}:8020
wf_application_path=${nameNode}/oozie/deployments/example
oozie.coord.application.path=${wf_application_path}
start=2013-07-13T07:00Z
end=2013-09-31T23:59Z
Upload your coordinator.xml file to hdfs and then submit your coordinator job with something like
oozie job -config coordinator.properties -run
Check the documentation http://oozie.apache.org/docs/3.3.2/CoordinatorFunctionalSpec.html it contains some examples.
I think the following blog will be quite useful..
http://www.tanzirmusabbir.com/2013/05/chunk-data-import-incremental-import-in.html
Related
I have a static pipeline with the following architecture:
main.py
setup.py
requirements.txt
module 1
__init__.py
functions.py
module 2
__init__.py
functions.py
dist
setup_tarball
The setup.py and requirements.txt contain the non-native PyPI and local functions which would be used by the Dataflow worker node. The dataflow options are written as follows:
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from module2.functions import function_to_use
dataflow_options = ['--extra_package=./dist/setup_tarball','temp_location=<gcs_temp_location>', '--runner=DataflowRunner', '--region=us-central1', '--requirements_file=./requirements.txt]
So then the pipeline will run something like this:
options = PipelineOptions(dataflow_options)
p = beam.Pipeline(options=options)
transform = (p | ReadFromText(gcs_url) | beam.Map(function_to_use) | WriteToText(gcs_output_url))
Running this locally takes Dataflow around 6 minutes to complete, where most of the time goes to worker startup. I tried getting this code automated with Composer and re-arranged the architecture as follows: my main (dag) function in dags folder, the modules in plugins, and setup_tarball and requirements.txt in data folder... So the only parameters that really changed are:
'--extra_package=/home/airflow/gcs/data/setup_tarball'
'--requirements_file=/home/airflow/gcs/data/requirements.txt'
When I try running this modified code in Composer, it will work... but it'll take much, much longer... Once the worker starts up, it will take anywhere from 20-30 minutes before actually running the pipeline (which is only a few seconds).. This is much longer than triggering Dataflow from my local code, which was taking only 6 minutes to complete. I realize this question is very general, but since the code works, I don't think it's related to the Airflow task itself. Where would be a reasonable place to start looking at for troubleshooting this problem? At the Airflow level, what can be modified? How does Composer (Airflow) interact with Dataflow, and what can potentially cause this bottleneck?
It turns out that the problem was associated with Composer itself. The fix was to increase the capacity of Composer, i.e., increase vCPUs. Not sure why this would be the case, so if anyone has an idea for the foundation behind this issue, your input would be much appreciated!
With an Oozie coordinator and workflow, I see the following in the Coord Job Log for a specific action:
JOB[0134742-190911204352052-oozie-oozi-C] ACTION[0134742-190911204352052-oozie-oozi-C#1] [0134742-190911204352052-oozie-oozi-C#1]::CoordActionInputCheck:: Missing deps: ${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}
It seems the full path names are missing. If the path name is not specified in the coordinator with latest(0), the paths are available as seen here:
JOB[0134742-190911204352052-oozie-oozi-C] ACTION[0134742-190911204352052-oozie-oozi-C#1] [0134742-190911204352052-oozie-oozi-C#1]::CoordActionInputCheck:: Missing deps:hdfs://labs-xxx/data/funcxx/inputs/uploads/reports-for-targeting/20190923/14
Later the paths is resolved as:
JOB[0134742-190911204352052-oozie-oozi-C] ACTION[0134742-190911204352052-oozie-oozi-C#1] [0134742-190911204352052-oozie-oozi-C#1]::ActionInputCheck:: File:hdfs://labs-xxx/data/funcxx/inputs/uploads/reports-for-targeting/20190923/14, Exists? :true
How can I see the full path name instead of the ${coord:latest(0)} strings?
You can check this vis oozie cli -
oozie job -info 0134742-190911204352052-oozie-oozi-C#1
I want to create an event driven oozie coordinator. but the directory path changes regularly. I don't want to hard code the directory in the code.
<datasets>
<dataset name="test_co" frequency="${coord:minutes(120)}" initial-instance="${coordStartDate}" timezone="${timezone}">
<uri-template>**${nameNode}/dynamicName**</uri-template>
<done-flag>_OK</done-flag>
</dataset>
</datasets>
How can i run shell script before this action is triggered it creates the folder name and check if OK file is present inside that folder or not?
Oozie supports creating dynamic directory structure i.e. dated directories using coordinator datasets (if possible use with ).
e.g.
<datasets>
<dataset name="logs" frequency="${coord:hours(1)}" initial-instance="2009- 01-01T01:00Z" timezone="UTC">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
After running above oozie code today viz.22-03-2017 16:00 PM
The directory structure would be like : hdfs://bar:9000/app/logs/2017/03/22/16
I'm a newbie in Oozie and I've read some Oozie shell action examples but this got me confused about certain things.
There are examples I've seen where there is no <file> tag.
Some example, like in Cloudera here, repeats the shell script in file tag:
<shell xmlns="uri:oozie:shell-action:0.2">
<exec>check-hour.sh</exec>
<argument>${earthquakeMinThreshold}</argument>
<file>check-hour.sh</file>
</shell>
While in Oozie's website, writes the shell script (the reference ${EXEC} from job.properties, which points to script.sh file) twice, separated by #.
<shell xmlns="uri:oozie:shell-action:0.1">
...
<exec>${EXEC}</exec>
<argument>A</argument>
<argument>B</argument>
<file>${EXEC}#${EXEC}</file>
</shell>
There are also examples I've seen where the path (HDFS or local?) is prepended before the script.sh#script.sh within the <file> tag.
<shell xmlns="uri:oozie:shell-action:0.1">
...
<exec>script.sh</exec>
<argument>A</argument>
<argument>B</argument>
<file>/path/script.sh#script.sh</file>
</shell>
As I understand, any shell script file can be included in the workflow HDFS path (same path where workflow.xml resides).
Can someone explain the differences in these examples and how <exec>, <file>, script.sh#script.sh, and the /path/script.sh#script.sh are used?
<file>hdfs:///apps/duh/mystuff/check-hour.sh</file> means "download that HDFS file into the Current Working Dir of the YARN container that runs the Oozie Launcher for the Shell action, using the same file name by default, so that I can reference it as ./check-hour.sh or simply check-hour.sh in the <exec> element".
<file>check-hour.sh</file> means "download that HDFS file -- from my user's home dir e.g. hdfs:///user/borat/check-hour.sh -- into etc. etc.".
<file>hdfs:///apps/duh/mystuff/check-hour.sh#youpi</file> means "download that HDFS file etc. etc., renaming it as youpi, so that I can reference it as ./youpi or simply youpi in the element".
Note that the Hue UI often inserts unnecessary # stuff with no actual name change. That's why you will see it so often.
I have a bunch of pig scripts that I'm running as a workflow in oozie. Some of the output files are very short and there are a couple I'd like to concatenate and include in the body of an email action. How would I go about doing this?
Use action and send an email from a script.
workflow.xml :
...
<shell>
<exec>email_hdfs_file.sh</exec>
<file>scripts/email_hdfs_file.sh</exec>
</shell>
...
Make sure you have scripts/email_hdfs_file.sh in the same folder in hdfs.
email_hdfs_file.sh :
#1 download and merge multiple files into one
hadoop fs -getmerge /path/to/your/files part-all.txt
#2 put a command that emails part-all.txt file here
It's up to you how to implement #2