I'm currently using Data Factory V1.
I have a pipeline with 2 chained activities:
. The first activity is a Copy Activity that extracts a table from SQLDB into a .tsv file in Data Lake Store
. The second activity is a Data Lake Analytics U-SQL activity that collects the data in the previously created .tsv file and adds it to an existing table in Data Lake database.
Obviously, I only want the second activity to run after the first activity so I used the output dataset from the first activity as the input data to the second activity and it works fine.
But, if the first activity fails, the second activity will be stuck at the state "Waiting: Dataset dependencies (The upstream dependencies are not ready)".
I have the policy->timeout property set for the second activity but it only seems to work after this activity has started. So, since the activity never starts, it doesn't timeout and it stays stuck.
How can I set a timeout for this "waiting" period?
Thank you
That is how v1 works. If your upstream dataset fails, the second will stay in the waiting state until the first dataset has completed successfully.
If you are using a schedule, you would want to fix the problem with the first activity and run the failed slice again. If you're working with a one-time pipeline, you have to run the whole pipeline again after fixing the problem.
The timeout only works when the processing actually starts, as is written in de Data Factory documentation.
If the data processing time on a slice exceeds the timeout value, it is canceled, and the system attempts to retry the processing. The number of retries depends on the retry property. When timeout occurs, the status is set to TimedOut.
Related
I have a DAG that inserts data into a SQL Server database. Some of the tasks take 24+ hours to run as the database its inserting into is not high performing.
I need to mark the tasks as complete automatically if they take more than 24 hours to run, as I need to move on from them so I can start inserting the next days worth of data (the DAG runs daily and the data source has new data coming in every day). How can I do this programmatically, where I don't have to go into the UI to mark it as 'Success' or 'Failed'?
You could follow a similar approach as shown in this StackOverflow post: kill or terminate subprocess when timeout. Then once the timeout occurs, you just need to make sure you don't raise any Exception.
I have built a micro-service where there is an API called deleteToken. This API(when invoked) is supposed to change the status in a tuple in db corresponding to token (identified with token id) to "MARK-DELETE". Once that tuple has status "MARK_DELETE" then after 30 days there should be a rest call made to downstream service API called deleteTokenFromPartner. There is no such mandate like call to deleteTokenFromPartner has to be made right after 30 days, it can be done few hours later 30 days also. So what I thought was I will write a scheduler (using Quartz, Java Executor service) with scheduled period in such a way that it will run once everyday. what it will do is it will query db and find out all rows which has status="MARK_DELETE" and status update is older than 30 days. After then it will iteratively call deleteTokenFromPartner for each and every row. There is one db which is highly available and we may not have any issue with consistency as we delete after 30 days. But the problem I am seeing is, as this is a micro-service which has N instances so every instance will query db, get the same set of rows and make call to same rows. Can I make any tweak so that this duplicated calls can be avoided. FYI we don't make any config changes using hostnames and if only one instance will be capable of running the scheduler that too will be fine.
Azkaban's pipeline mode execution results in failed jobs after a sufficient number of jobs have accumulated in the backlog. This happens if executions of a flow take more time than its scheduling frequency:
Is there a way to make Azkaban launch a new instance only when the previous one has ended while at the same time not skip any instance. We cannot skip instances because the time param passed to each, helps in selecting the duration of data that has to be processed.
We receive many large data files daily in a variety of formats (i.e. CSV, Excel, XML, etc.). In order to process these large files we transform the incoming data into one of our standard 'collection' message classes (using XSLT and a pipeline component - either built-in or custom), disassemble the large transformed message into individual 'object' messages and then call a series of SOAP web service methods to handle business logic and database operations.
Unlike other files received, the latest file will contain all data rows each day and therefore, we have to handle the differences to prevent identical records from being re-processed each day.
I have a suitable mechanism for handling inserts and updates but am currently struggling with the deletes (where the record exists in the database but not in the latest file).
My current thought process is to flag the deleted records in the database using a 'cleanup' task at the end of the entire process but this would require a method to be called once all 'object' messages from the disassembled file have completed.
Is it possible to monitor individual messages from a multi-record file and call a method on completion of the whole file? Currently, all research is pointing to an orchestration with some sort of 'wait' but is this the only option?
Example: File contains 100 vehicle records. This is disassembled into 100 individual XML messages which are processed using 100 calls to a web service method. Wish to call cleanup operation when all 100 messages are complete.
The best way I've found to handle the 'all rows every day' scenario is to pre-stage the data in SQL Server where it's easier to compare the 'current' set to the 'previous' set. The INTERSECT and EXCEPT operators make it pretty easy in most cases.
Then drain the records with a Polling statement.
The component that does the de-batching would need to publish a start of batch message with the number of individual records and a correlation key.
The components that do the insert & update would need to publish a completion message with the same correlation key when it is completed processing.
The start of batch message would have spun up an Orchestration that would would listen for the completion messages with that correlation key and count the number, and either after it has received the correct number or after a timeout period it would call the cleanup or raise an exception.
I've a windows workflow service that is hosted in a console application. I have a count variable in the service and the value is incremented in each call, how I can make the count value to be persistent between calls.
EDITED: The workflow takes a timeout value as input and returns an id. If you pass 10 as the timeout value the workflow delays for 10 seconds and return the id 1. In between if another client pass 3 as the timeout value a new instance of workflow has to be created and it has to wait for 3 seconds returning the new id value as 2.
If you are referring to a variable per workflow instance you can create a variable at the root sequence of you workflow and store. If the workflow is persisted to disk this variable will be saved with it.
However from you question it seems you are referring to a variable per workflow type. In that case there is no static variable per workflow type and you need to manage the state outside of the workflow and persist it yourself.