Airflow- failing a task which returns no data? - airflow

What would the best way be to fail a task which is the result of a BCP query (command line query for MS SQL server I am connecting to)?
I am downloading data from multiple tables every 30 minutes. If the data doesn't exist, the BCP command is still creating a file (0 size). This makes it seem like the task was always successful, but in reality it means that there is missing data on a replication server another team is maintaining.
bcp "SELECT * FROM database.dbo.table WHERE row_date = '2016-05-28' AND interval = 0" queryout /home/var/filename.csv -t, -c -S server_ip -U user -P password
The row_date and interval would be tied to the execution date in Airflow. I would like for airflow to show a failed task instance if the query returned no data though. Any suggestions?
Check for file size as part of the task?
Create an upstream task which reads the first couple of rows and tells Airflow whether the query was valid or not?

I would use your first suggestion and check for the file size as part of the task.
If it is not possible to do this in the same task as the query, create a new one with that specific purpose with an upstream dependency. In the cases that the file is empty just trigger an exception in the task.

Related

How to successfully exit a task midway within an Airflow dag?

I have a dag that checks for files on an FTP server (airflow runs on separate server). If file(s) exist, the file(s) get moved to S3 (we archive here). From there, the filename is passed to a Spark submit job. The spark job will process the file via S3 (spark cluster on different server). I'm not sure if I need to have multiple dags but here's the flow. What I'm looking to do is to only run a Spark job if a file exist in the S3 bucket.
I tried using an S3 sensor but that fails/timeouts after it meets the timeout criteria, therefore the whole dag is set to failed.
check_for_ftp_files -> move_files_to_s3 -> submit_job_to_spark -> archive_file_once_done
I only want to run everything after the script that does the FTP check ONLY when a file or files were moved into S3.
You can have 2 different DAGs. One only has the S3 sensor and keeps running, lets say, every 5 minutes. If it finds the file, it triggers the second DAG. The second DAG submits the file to S3 and archives if done. You can use TriggerDagRunOperator in the first DAG for triggering.
The answer Him gave will work.
Another option is using the "soft_fail" parameter that Sensors have (it is a parameter from the BaseSensorOperator). IF you set this parameter to True, instead of failing a task, it will skip it and all following tasks in the branch will also be skipped.
See airflow code for more info.

Process running in backend after killing

I had ran 1 kitchen.sh command in unix server which will do some INSERT/UPDATE data loading from 1 table to another based on some logic.
But since the input data is having huge count.So i had to kill the process in between by the following command
kill -9 pid(pid =process id)
and then i checked in the server with ps -ef | grep kitchen command and that process was not showing.so i thought it got killed.
But i noticed now that daily few records are updating and somehow the process is running in backend.Any solutions as how to check that and how to resolve it
double check the indexes of your db table, your could have huge improvement on insert/update after adding the proper indexes.

Change authorization level when initializing sas batch job

I run SAS batch jobs on a UNIX server and usually encounter the problem that I cannot overwrite sas datasets in batch that have been created by my user locally without changing the authorization level of each file in Windows. Is it possible to signon using my user id and password when initializing the batch job to enable me to get full authorization (to my own files) in batch?
Another issue is that I don't have authorization to run UNIX commands using PIPE on a local remote session on the server and can hence not terminate my own sessions. It is on the other hand possible to run PIPE in batch, but this only allows me to terminate batch jobs so I also wonder if it is possible to run a pipe command in batch using my id and password as the batch user does not have authorizatio to cancel "local remote sessions" on my user?
Example code for terminating process:
%let processid = 6938710;
%let unixcmd = "kill &processid";
%PUT executing &unixcmd;
filename unixcmd pipe &unixcmd.;
there's a good and complete answer to your first point in the following SAS support page.
You can use the umask Unix command to specify the default file permission policy used for the permanent datasets created during a SAS session (be it batch or not).
If you are lauching a Unix script which invokes a SAS batch session you can put a umask command just before the sas execution.
Otherwise you can adopt a more permanent solution including the umask command in one of the places specified in the above SAS support article.
You are probably interested in something like:
umask 002
This will assign a rw-rw-r-- file permission to all new datasets.

Autosys Job Statistics from 3 months

I want to make a report of start and end times of a Autosys job from last three months.
How can i get it. Do i need to check archived history or logs?
If yes, please lemme know the details.
TIA
Autosys internally uses Oracle or Sybase database. As long as the data is available in the DB you can fetch it using autorep command. To get past run time use -r handle.
For example: autorep -J JobA -r -30
The above will give you last 30th run time for the job.
However, due to performance bottleneck that may arise due to historical data in the DBs the DBAs generally purge the data after a while. I have seen period of 1 day to 7 days based on the number of the jobs and database instance power.
Other approximate way would to be use the log files created by autosys if the option stdout is specified with unique filenames.
For example: you can have the attribute as std_out: $JOB_NAME.out.date +%m.%s
In this case the log file will be created as soon as the job starts which you can get from the filename using text function on unix,etc.
For the end-time, you can use the last modified time - this is where the approximate part comes in as the time would depend if your job had an echo to the log file or not. It can either be close or far based on the command of the script.
This method will not let you know the times for the box jobs as they never have a log attribute, for that you can depend on the first job in the box.

How to check for existence of Unix System Services files

I'm running batch Java on an IBM mainframe under JZOS. The job creates 0 - 6 ".txt" outputs depending upon what it finds in the database. Then, I need to convert those files from Unix to MVS (ebcdic) and I'm using OCOPY command running under IKJEFT01. However, when a particular output was not created, I get a JCL error and the job ends. I'd like to check for the presence or absence of each file name and set a condition code to control whether the IKJEFT01 steps are executed, but don't know what to use that will access the Unix file pathnames.
I have resolved this issue by writing a COBOL program to check the converted MVS files and set return codes to control the execution of subsequent JCL steps. The completed job is now undergoing user acceptance testing. Perhaps it sounds like a kludge, but it does work and I'm happy to share this solution.
The simplest way to do this in JCL is to use BPXBATCH as follows:
//EXIST EXEC PGM=BPXBATCH,
// PARM='pgm /bin/cat /full/path/to/USS/file.txt'
//*
// IF EXIST.RC = 0
//* do whatever you need to
// ENDIF
If the file exists, the step ends with CC 0 and the IF succeeds. If the file does not exist, you get a non-zero CC (256, I believe), and the IF fails.
Since there is no //STDOUT DD statement, there's no output written to JES.
The only drawback is that it is another job step, and if you have a lot of procs (like a compile/assemble job), you can run into the 255 step limit.

Resources