I need to configure Oozie coordinator dataset for last previous day - oozie

How can I schedule oozie coordinator for previous days from the latest current day.
For Example:
If I need to pick the payload path the folder format of dataset will be like below
ts=202205190300/
ts=202205190030/
ts=202205190000/
ts=202205181200/
ts=202205180300/
ts=202205180000/
.....etc
Here, If I consider ts=202205190300/ as current dataset instance(i.e 19th of may month), I need to pick the previous days payload that is 18th of may month whichever files present it needs to get picked up. Below files need to get processed
ts=202205181200/
ts=202205180300/
ts=202205180000/
I have tried ${coord:current(-1)} but doesn't work for me.
Below is my Xml file:
`<coordinator-app name="hello-coord" frequency="${coord:days(1)}"
start="2022-05-19T08:00Z" end="2099-05-04T15:00Z" timezone="America/Los_Angeles"
xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="din" frequency="${coord:days(1)}"
initial-instance="2022-05-19T08:00Z" timezone="America/Los_Angeles">
<uri-template>${baseFsURI}/${YEAR}/${MONTH}/${DAY}</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="input" dataset="din">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${wf_app_path}</app-path>
</workflow>
</action>`
Can you guide me that am i configuring the coordinator with correct values or not?
Thanks in advance.

Related

How to specify/use idempotent "date of execution" within dagster assets/jobs?

Coming from airflow, I used jinja templates such as {{ds_nodash}} to translate the date of execution of a dag within my scripts.
For example, I am able to detect and ingest a file at the first of August 2022 if it is in the format : FILE_20220801.csv. I would have a dag with a sensor and an operator that uses FILE_{{ds_nodash}}.csv within its code. In other terms I was sure my dag was idempotent in regards to its execution date.
I am now looking into dagster because of the assets abstraction that is quite attractive. Also, dagster is easy to set-up and test locally. But I cannot find similar jinja templates that can ensure the idempotency of my executions.
In other words, how do I make sure data that was sent to me during a specific date is going to be processed the same way even if I run it 1, 2 or N days later?
If a file comes in every day (or hour, or week, etc.), and some of the assets that depend on the file have a partition for each file, then the recommended way to do this is with partitions. E.g.:
from dagster import DailyPartitionsDefinition, asset, sensor, repository, define_asset_job
daily_partitions_def = DailyPartitionsDefinition(start_date="2020-01-01", fmt=%Y%m%d)
#asset(partitions_def=daily_partitions_def)
def asset1(context):
path = f"FILE_{context.partition_key}.csv"
...
#asset(partitions_def=daily_partitions_def)
def asset2(context):
...
def detect_file() -> Optional[str]:
"""Returns a value like '20220801', or None if no file is detected """
all_assets_job = define_asset_job("all_assets", partitions_def=daily_partitions_def)
#sensor(job=all_assets_job)
def my_sensor():
date_str = detect_file()
if date_str:
return all_assets_job.run_request_for_partition(run_key=None, partition_key=date_str)
#repository
def repo():
return [my_sensor, asset1, asset2]

Remove date from filename UNIX

I am working in UNIX and trying to write the following commands. I am receiving a source file daily whose filename is in the format :
ONSITE_EXTR_ONSITE_EXTR_20170707.
Since I am receiving a file daily, the file name would change based on the current date, so ONSITE_EXTR_ONSITE_EXTR_20170708, ONSITE_EXTR_ONSITE_EXTR_20170709 etc. I need to strip the date out of the filename and rename it to ONSITE_EXTR_ONSITE_EXTR. After I have finished whatever data reading and processing I need to do, I need to change the file name back to ONSITE_EXTR_ONSITE_EXTR_20170707 for example. So since the file is being delivered daily, I cant hard code the date in whatever commands I write. Any help would be greatly appreciated
Depending on your toolchain, this may be as simple as running:
$ mv ONSITE_EXTR_ONSITE_EXTR_$(date +%Y%m%d) ONSITE_EXTR_ONSITE_EXTR
... before running the rest of your script, assuming you're using a Bash-like shell.
Having said that, you can just drop in ONSITE_EXTR_ONSITE_EXTR_$(date +%Y%m%d) into your script when trying to access your file instead.
This is all assuming the script's run the same day and in the same time zone as the file is downloaded.
If you were using bash and you had the file name in a variable, you could do:
IN="ONSITE_EXTR_ONSITE_EXTR_20170707"
echo ${IN:0:23}
to give ONSITE_EXTR_ONSITE_EXTR
Googling gives all sorts of guides here...

Oozie done-flag EL functions

I am trying to use the built in or EL functions provided by oozie in tag of oozie-coordinator xml, but seems like its not supported? Does anyone know of any other way to declare this?
Our done-flags have dates in them.
<done-flag>${YEAR}${MONTH}${DAY}.done</done-flag>
OR
<done-flag>${coord:formatTime(coord:actualTime(), 'yyyyMMdd')}</done-flag>
I get the following error when launching the oozie coordinator workflow.
Error: E1004 : E1004: Expression language evaluation error,
Unable to evaluate :${coord:formatTime(coord:actualTime(), 'yyyyMMdd')}:
Does anyone know of a way to achieve dynamic done-flag names?
Not 100% sure about what you want to do.
My understanding is that the Coordinator waits for a file named as "done-flag" before running a Workflow. Then the "coord:actualTime()" function can be used to know at what time the Workflow was actually started.
=> the documentation should stress that the phrase "coordinator action" actually means "workflow" in most cases...
If you want to check the clock time while the Coordinator is still waiting, the keywords YEAR - MONTH - DAY - HOUR - MINUTE are your only hope.
I got it working using the following way, by using the YEAR, MONTH, DAY variables in the uri-template and emptying out the .
<uri-template>
/donemarkers/dependency-job/${YEAR}${MONTH}${DAY}.done
</uri-template>
<!--<done-flag>${YEAR}${MONTH}${DAY}.done</done-flag>-->
<done-flag></done-flag>
I came to know from the logs that oozie first checks if there is a directory with the name specified by uri-template, if not then it checks if there is a file specified by the uri-template.
2015-07-28 19:40:46,225 INFO CoordActionInputCheckXCommand:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0131647-140520191754742-oozie-oozi-C] ACTION[0131647-140520191754742-oozie-oozi-C#2] [0131647-140520191754742-oozie-oozi-C#2]::ActionInputCheck:: In checkResolvedUris...
2015-07-28 19:40:46,225 INFO CoordActionInputCheckXCommand:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0131647-140520191754742-oozie-oozi-C] ACTION[0131647-140520191754742-oozie-oozi-C#2] [0131647-140520191754742-oozie-oozi-C#2]::ActionInputCheck:: In checkListOfPaths: /donemarkers/dependency-job/20150725.done is Missing.
2015-07-28 19:40:46,241 INFO CoordActionInputCheckXCommand:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0131647-140520191754742-oozie-oozi-C] ACTION[0131647-140520191754742-oozie-oozi-C#2] [0131647-140520191754742-oozie-oozi-C#2]::ActionInputCheck:: File:/donemarkers/dependency-job/20150725.done, Exists? :true

Unitils dataset - insert into sybase datetime column

I have a unitils dataset attempting to enter data into a sybase datetime column, but it errors when I try and run the unit tests.
I cannot find anything online about inserting dates, so I tried declaring it as a String in the same way as the other columns, but I get the error:
org.unitils.core.UnitilsException: Error inserting test data from DbUnit dataset for method etc etc
at org.unitils.dbunit.DbUnitModule.insertDataSet(DbUnitModule.java:156)
at org.unitils.dbunit.DbUnitModule$DbUnitListener.beforeTestSetUp(DbUnitModule.java:556)
at org.unitils.core.Unitils$UnitilsTestListener.beforeTestSetUp(Unitils.java:273)
at org.unitils.UnitilsJUnit4TestClassRunner$TestListenerInvokingMethodRoadie.runBeforesThenTestThenAfters(UnitilsJUnit4TestClassRunner.java:151)
at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:84)
at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:49)
at org.unitils.UnitilsJUnit4TestClassRunner.invokeTestMethod(UnitilsJUnit4TestClassRunner.java:95)
at org.junit.internal.runners.JUnit4ClassRunner.runMethods(JUnit4ClassRunner.java:61)
at org.unitils.UnitilsJUnit4TestClassRunner.access$000(UnitilsJUnit4TestClassRunner.java:44)
at org.unitils.UnitilsJUnit4TestClassRunner$1.run(UnitilsJUnit4TestClassRunner.java:62)
at org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:34)
at org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:44)
at org.unitils.UnitilsJUnit4TestClassRunner.run(UnitilsJUnit4TestClassRunner.java:68)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: org.unitils.core.UnitilsException: Error while executing DataSetLoadStrategy
at org.unitils.dbunit.datasetloadstrategy.impl.BaseDataSetLoadStrategy.execute(BaseDataSetLoadStrategy.java:46)
at org.unitils.dbunit.DbUnitModule.insertDataSet(DbUnitModule.java:230)
at org.unitils.dbunit.DbUnitModule.insertDataSet(DbUnitModule.java:153)
... 18 more
Caused by: org.dbunit.dataset.datatype.TypeCastException: Error casting value for table 'USER' and column 'SUSPEND_SD'
at org.dbunit.operation.AbstractBatchOperation.execute(AbstractBatchOperation.java:202)
at org.dbunit.operation.CompositeOperation.execute(CompositeOperation.java:78)
at org.unitils.dbunit.datasetloadstrategy.impl.CleanInsertLoadStrategy.doExecute(CleanInsertLoadStrategy.java:45)
at org.unitils.dbunit.datasetloadstrategy.impl.BaseDataSetLoadStrategy.execute(BaseDataSetLoadStrategy.java:44)
... 20 more
The only line inserting a date in my dataset is:
<USER ID="4" FULL_NAME="user3" SUSPEND_SD="25/12/2011 00:00" SUSPEND_ED="25/12/2013 00:00" PASSWORD="password3"/>
Can anyone help?
Thanks in advance.
Ok, it turns out it has to be in exactly the format YYYY-MM-DD HH:MM:SS", for example "2011-01-01 02:00:00".
I don't know how unitils works but assuming your dataset is editable why not change the dates to yyyy/mm/dd format.
That java stack mentions casting, and that'll be a failure getting dd/mm/yyyy to datetime I'd think.
Alternatively the middleware probably has some date config means - you can probably tell it dates are British (it might even be regional settings?)

XSL, comparing dates to exclude any past events

I have an RSS of an events feed. I would like to hide previous events.
Assuming XML data subset of
<Navigation Name="ItemList" Type="Children">
<Page ID="x32444" URL="..." Title="Class..."
EventStartDate="20090831T23:00:00" EventEndDate="20090904T23:00:00"
EventStartTime="20090830T15:30:00" EventEndTime="20090830T18:30:00" Changed="20090830T20:28:31" CategoryIds="" Schema="Event"
Name="Class of 2010 BAKE SALE"/>
<Page ID="x32443" URL="x32443.xml?Preview=true&Site=&UserAgent=&IncludeAllPages=true&tfrm=4" Title="Class of 2010 BAKE SALE"
Abstract="Treat yourself with our famous 10-star FRIED ICE CREAM!" EventStartDate="20090831T23:00:00" EventEndDate="20090904T23:00:00"
EventStartTime="20090830T15:30:00" EventEndTime="20090830T18:30:00" Changed="20090830T20:25:35" CategoryIds="" Schema="Event"
Name="Class of 2010 BAKE SALE"/>
<Page ID="x32426" URL="x32426.xml?Preview=true&Site=&UserAgent=&IncludeAllPages=true&tfrm=4" Title="Tribute to ..."
Abstract="Event to recognize and celebrate the lifetime of leadership and service ..."
EventStartDate="20091206T00:00:00" EventEndDate="20091206T00:00:00" EventStartTime="20090828T23:00:00" EventEndTime="20090828T04:00:00"
Changed="20090828T22:09:54" CategoryIds="" Schema="Event" Name="Tribute to ...."/>
</Navigation>
How would I not include anything past today's date
<xsl:apply-template select="Page[#EventStartDate=notBeforeToday()]"/>
Easiest with XSL parameters that you set from outside.
<xsl:param name="today" select="'undefined'" />
<!-- time passes... -->
<xsl:apply-templates select="Page[#EventStartDate < $today]"/>
Your date format is such that you can compare it using string comparison, unless there are different timezones involved. You would simply set
20091001T00:00:00
as the param value for $today. Have a look into your XSLT processor's documentation to see how.
The alternative would be to use an extension function. Here it depends on which extension functions your XSLT processor supports, so this approach won't be portable.
For this purpose, i usually add an extra date attribute in the XML which contains the day number since year 1900.
for example #dateid='9876543' or #seconds="9876675446545"
then i can can easily compare with today or another variable in the XSL.
You can also use this technique to compare times using "Unix time" for example

Resources