I'm using Airflow 2 and trying to use the AWSAthenaOperator. I can run the operator and it works, but I can't find any way to determine what the file names are that it wrote.
task = AWSAthenaOperator(
task_id="foo",
database="mydb",
query='select * from mytable limit 10;',
aws_conn_id="athena_conn",
output_location='s3://mybucket/myfolder',
)
It drops files in s3://mybucket/myfolder, which is great, but how do I find out from the task output what those file names are? I need to then take those names and pass them to other tasks downstream.
I have been digging through the AWSAthenaOperator and AWSAthenaHook code that seems to do the work underneath, but I can't find where it stores that information or how I'd retrieve it.
I think the problem is that the path that you pass with output_location isn't unique. output_location is a templated field so you can use execution_date to make it unique thus each task save the file to a different and known location so files from different tasks are not getting mixed also using that way you can use the path in downstream tasks.
task = AWSAthenaOperator(
task_id="foo",
database="mydb",
query='select * from mytable limit 10;',
aws_conn_id="athena_conn",
output_location='s3://mybucket/myfolder/{{ execution_date }}',
)
Related
DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?
First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.
Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.
I wanted to use filemask in the GoogleCloudStoragePrefixSensor. I cant use the GoogleCloudStoragePrefixSensor because I also need to see the ending oif the file mask. BAsically my file is like "tv_link_input_*.xml". So, tried using GoogleCloudStorageObjectSensor, but its keep running without any OP. Using airflow 1.10.14. Code
check_file_gcs = GoogleCloudStorageObjectSensor(
task_id='check_file_gcs',
bucket=source_bucket,
object="processing/tv_link_input_*xml",
poke='10',
dag=dag
)
Thanks
I set up an Airflow server successfully. I want to run some test jobs but I am having trouble finding beginner guides which fit into what I am trying to do.
Current status:
Python scripts to download files from SFTP (any file which does not exist on local machine) or create a file from a queryout
Pandas scripts to read the data into memory, modify it in some way to prepare it for the database (look for new dimensions, remap, add calculations). Load data to appropriate table in database. Send email summaries (pandas to_html)
The logic I have for most of my scripts is based on if the file has not been processed, then process it. 'Processed' files are either organized by filename in a db table, or I move the file to a special processed folder.
The other logic I have is based on the date in the filename. I compare the dates of files which exist versus dates which should exist (a range of dates). If the file does not exist, then I create it (usually a BCP or PSQL query).
Do I just have Airflow run these .py files? Or should I alter my scripts to use some of the Airflow parameters/jinja templating?
I almost feel like I could use the BashOperator for almost everything. Would this work
dag_input = sys.argv[1]
def alter_table(query, engine=pg_engine):
fake_conn = engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.execute(query)
fake_conn.commit()
fake_cur.close()
query_list = [
f'SELECT * from table_1 where report_date = \'{dag_input}\'',
f'SELECT * from table_2 where report_date = \'{dag_input}\'',
]
for value in query_list:
alter_table(value)
Then the dag would be something like this, with an airflow parameter used for the sys.argv?
templated_command = """
python download_raw.py "{{ ds }}"
"""
t3 = BashOperator(
task_id='download_raw',
bash_command=templated_command,
dag=dag)
Since the code for this task is in python, I would use a PythonOperator.
Put a method in download_raw.py that takes **kwargs as parameters and you have access to everything in the context.
from download_raw import my_func
t3 = PythonOperator(
task_id='download_raw',
python_callable=my_func,
dag=dag)
#inside download_raw.py
def my_func(**kwargs):
context = kwargs
ds = context['ds']
... (do your logic here)
I would do it like this or your bash command could get hideous when several pieces of the context.
I've tried to find a table with the definition for each COMPSTAT (related to the tool Control-M workload Automation) return code but without any success.
Can anyone tell me if such a table exists?
Thank you.
It's the return code from whatever task was being executed at that time. By convention, a zero value means 'OK', and anything non-zero means an error of some kind.
Different utilities (i.e. external commands) have different possible return codes, so if the command were SCP then you would look up the code in the SCP documentation, and find that for example, '67' meant 'key exchange failed'.
There is no table that contains the definition of each COMPSTAT return code.
OSCOMPSTAT stand for Control-M Operating System Completion Status.
The value of COMPSTAT is set by the exit code of the command that was called.
Example:
After calling the command [cat file1.txt] the value of COMPSTAT will be:
0 if the file "file1.txt" is found
1 if the file "file1.txt" is not found
After calling the command [ctmfw] the value of COMPSTAT will be:
0 if the specified file is found
7 if the specified file is not found
The transaction AL11 returns a mapping of "directory parameters" to file paths on the application server AFAIK.
The trouble with transaction AL11 is that its program only calls c modules, there's almost no trace of select statements or function calls to analize there.
I want the ability to do this dynamically, in my code, like for instance a function module that took "DATA_DIR" as input and "E:\usr\sap\IDS\DVEBMGS00\data" as output.
This thread is about a similar topic, but it doesn't help.
Some other guy has the same problem, and he explains it quite well here.
I strongly suspect that the only way to get these values is through the kernel directly. some of them can vary depending on the application server, so you probably won't be able to find them in the database. You could try this:
TYPE-POOLS abap.
TYPES: BEGIN OF t_directory,
log_name TYPE dirprofilenames,
phys_path TYPE dirname_al11,
END OF t_directory.
DATA: lt_int_list TYPE TABLE OF abaplist,
lt_string_list TYPE list_string_table,
lt_directories TYPE TABLE OF t_directory,
ls_directory TYPE t_directory.
FIELD-SYMBOLS: <l_line> TYPE string.
START-OF-SELECTION-OR-FORM-OR-METHOD-OR-WHATEVER.
* get the output of the program as string table
SUBMIT rswatch0 EXPORTING LIST TO MEMORY AND RETURN.
CALL FUNCTION 'LIST_FROM_MEMORY'
TABLES
listobject = lt_int_list.
CALL FUNCTION 'LIST_TO_ASCI'
EXPORTING
with_line_break = abap_true
IMPORTING
list_string_ascii = lt_string_list
TABLES
listobject = lt_int_list.
* remove the separators and the two header lines
DELETE lt_string_list WHERE table_line CO '-'.
DELETE lt_string_list INDEX 1.
DELETE lt_string_list INDEX 1.
* parse the individual lines
LOOP AT lt_string_list ASSIGNING <l_line>.
* If you're on a newer system, you can do this in a more elegant way using regular expressions
CONDENSE <l_line>.
SHIFT <l_line> LEFT DELETING LEADING '|'.
SHIFT <l_line> RIGHT DELETING TRAILING '|'.
SPLIT <l_line>+1 AT '|' INTO ls_directory-log_name ls_directory-phys_path.
APPEND ls_directory TO lt_directories.
ENDLOOP.
Try the following
data : dirname type DIRNAME_AL11.
CALL 'C_SAPGPARAM' ID 'NAME' FIELD 'DIR_DATA'
ID 'VALUE' FIELD dirname.
Alternatively if you wanted to use your own parameters(AL11->configure) then read these out of table user_dir.