How oozie executes the prepare steps? - oozie

Is oozie execute all the prepare steps (like delete) once at the beginning of the workflow?
Is it run all prepare step even if the given action never "called"?
I experience that it deleted some of my data folder but I never called the action where the prepare step was specified. And it seems it's deleted right after the first action called (which had no prepare step).

The prepare commands are executed at the start of the each action. They are not executed at once in the start of the workflow or anything like that. These commands are issued by the oozie server.

Related

How to successfully exit a task midway within an Airflow dag?

I have a dag that checks for files on an FTP server (airflow runs on separate server). If file(s) exist, the file(s) get moved to S3 (we archive here). From there, the filename is passed to a Spark submit job. The spark job will process the file via S3 (spark cluster on different server). I'm not sure if I need to have multiple dags but here's the flow. What I'm looking to do is to only run a Spark job if a file exist in the S3 bucket.
I tried using an S3 sensor but that fails/timeouts after it meets the timeout criteria, therefore the whole dag is set to failed.
check_for_ftp_files -> move_files_to_s3 -> submit_job_to_spark -> archive_file_once_done
I only want to run everything after the script that does the FTP check ONLY when a file or files were moved into S3.
You can have 2 different DAGs. One only has the S3 sensor and keeps running, lets say, every 5 minutes. If it finds the file, it triggers the second DAG. The second DAG submits the file to S3 and archives if done. You can use TriggerDagRunOperator in the first DAG for triggering.
The answer Him gave will work.
Another option is using the "soft_fail" parameter that Sensors have (it is a parameter from the BaseSensorOperator). IF you set this parameter to True, instead of failing a task, it will skip it and all following tasks in the branch will also be skipped.
See airflow code for more info.

Jenkins - How to stall a job until a notification is received?

Is there anyway that a Jenkins job can be paused until a notification is received. Ideally with a payload as well?
I have a "test" job which does a whole bunch of remote tests and I'd like it to wait until the test are done where I send a HTTP notification via curl with a payload including a test success code.
Is this possible with any default Jenkins plugins?
If Jenkins 2.x is an option for you, I'd consider taking a look at writing a pipeline job.
See https://jenkins.io/doc/book/pipeline/
Perhaps you could create a pipeline with multiple stages, where:
The first batch of work (your test job) is launched by the first pipeline stage.
That stage is configured (via Groovy code) to wait until your tests are complete before continuing. This is of course easy if the command to run your tests blocks, but if your tests launch and then detach without providing an easy way to determine when they exit, you can probably add extra Groovy code to your stage to make it poll the machine where the tests are running, to discover whether the work is complete.
Subsequent stages can be run once the first stage exits.
As for passing a payload from one stage to another, that's possible too - for exit codes and strings, you can use Groovy variables, and for files, I believe you can have a stage archive a file as an artifact; subsequent stages can then access the artifact.
Or, as Hani mentioned in a comment, you could create two Jenkins jobs, and have your tests (launched by the first job) use the Jenkins API to launch the second job when they complete successfully.
As you suggested, curl can be used to trigger jobs via the API, or you can use a Jenkins API wrapper package for to your preferred language (I've had success using the Python jenkinsapi package for this sort of work: http://pythonhosted.org/jenkinsapi/)
If you need to pass parameters from your API client code to the second Jenkins job, that's possible by adding parameters to the second job using the the Parameterized Build features built into Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Build

How to reschedule a coordinator job in OOZIE without restarting the job?

When i changed the start time of a coordinator job in job.properties in oozie, the job is not taking the changed time, instead its running in the old scheduled time.
Old job.properties:
startMinute=08
startTime=${startDate}T${startHour}:${startMinute}Z
New job.properties:
startMinute=07
startTime=${startDate}T${startHour}:${startMinute}Z
The job is not running at the changed time:07th minute,its running at 08th minute in every hour.
Please can you let me know the solution, how i can make the job pickup the updated properties(changed timing) without restarting or killing the job.
You can't really change the timing of the co-ordinator via any methods given by Oozie(v3.3.2) . When you submit a job the contents properties are stored in the database whereas the actual workflow is in the HDFS.
Everytime you execute the co-ordinator it is necessary to have the workflow in the path specified in properties during job submission but the properties file is not needed. What I mean to imply is the properties file does not come into the picture after submitting the job.
One hack is to update the time directly in the database using SQL query.But I am not sure about the implications of it.The property might become inconsistent across the database.
You have to kill the job and resubmit a new one.
Note: oozie provides a way to change the concurrency,endtime and pausetime as specified in the official docs.

Strict coordinator job ordering on Oozie

I have a coordinator on oozie that runs a series of tasks, each of which depends on the output of the last.
Each task outputs a dated folder and looks for the output of its predecessor using
${coord:latest(0)}
This all worked fine on my dev cluster when nothing else was running; every 5 minutes oozie would queue up another job, and in that 5 minutes the previous job had run so when the new job was set up it would see the directory it needed.
I run into problems on the production cluster; the jobs get submitted, but are put in a queue and don't run for a while, but still every 5 minutes oozie queues up another one, and in its initialization stage it is assigned its 'previous' folder, which hasn't been created yet as its predecessor hasn't run so the 'latest' function gives it the same input as the previous job. I then end up with 10 jobs all taking the same input...
What I need is a way of strictly preventing the next job in a coordinator sequence from even being created until its predecessor has finished running.
Is there a way this can be done?
Thanks for reading
This is the exact use case that Oozie designed to solve. Oozie will wait all data dependency before launch.
Please try to understand the following configs in your coordinator.xml
<datasets>
<dataset name="my_data" frequency="${coord:days(1)}" initial-instance="2013-01-27T00:00Z">
<uri-template>YOUR_DATA/${YEAR}${MONTH}${DAY}</uri-template>
</dataset>
...
<datasets>
<input-events>
<data-in name="my_data" dataset="my_data">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="my_data" dataset="my_data">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
the "coord:current(-1)" in input-events means the previous output. It will interpret the dataset URI teamplate to "yesterday", and Oozie will check whether the data exist in HDFS by checking a success flag, which by default is an empty file named "_SUCCESS", right under the output directory. Oozie will keep waiting this flag before launching the current workflow.
btw, you can also set
<coordinator-app name="my_coordinator" frequency="${coord:days(1)}" start="${start_time}" end="${end_time}" ...>
to define start time and end time of a coordinator job, so you can catch up backlog data.

TFS2010 Team build - waiting for an "InvokeProcess" step to complete

I am performing a database restore as part of our TFS 2010 Team build. Since a number of databases are being restored, I am using a batch file which is invoked via the InvokeProcess activity.
I have a number of issues that I am uncertain about:
1. Does the TFS wait for all the command in the batch file to complete or move to the next activity as soon as kicking the InvokeProcess?
2. Is there a way to have the build process wait for successful completion of the batch command?
I am using it as follows:
The FileName property of InvokeProcess has "c:\windows\system32\cmd.exe"
The Arguments property has the full path of my batch file.
Yes the InvokeProcess will wait for the external command to finish.

Resources