Very often, the files I am downloading have a date in the filename.
csat_surveys_2017_03_05.csv
03062017_roster.csv
My code deals with this individually.
Compare the dates in the processed file list (based on slicing the filename) with the expected dates which should exist (some date range up until the current date)
For each file I process, add the filename to a database table and only process new files which have not been added to that table
Can I (and should I) use the airflow schedule date to replace the need of having to code this logic? Every day, my task is scheduled to run. I take that scheduled date (minus 1 day, perhaps) and use that value as a parameter to pass as part of the filename to read (in pandas). If so, can I please see a clear example that I can use as a template?
Is that a better approach, and would that cover me if a file is missing or delayed for a few days (I would want the task to fail, then keep trying each day until it succeeds or until I notice it and can raise the issue to our clients)?
I'd say yes, using the execution_date is probably best practice.
To access it, you'll need a templated field. Some default operators have those already, or you might want to create your own operator, which will then look something like this:
In your DAG, you'd have the task as:
my_task = MyOperator(
task_id='t1',
filename='prefix_{{ ds }}_suffix')
ds is the airflow macro for accessing the execution_date parameter as a string representation of a date.
And your MyOperator would look like:
class MyOperator(BaseOperator):
template_fields = ('filename')
def __init__(self, filename)
self.filename = filename
def execute(self, context):
download_file(self.filename)
do_other_stuff()
You can find more about how to parametrize tasks in the Macros section https://airflow.incubator.apache.org/code.html#macros
Related
I'm using jupyter and pandas read_sql, this works fine but looks ugly.
for instance I have a query:
SELECT *
FROM table_a AS a
LIMIT 10;
I could show it nicely in a markdown cell as so:
``` mysql
SELECT *
FROM table_a AS a
LIMIT 10;
```
and I could execute it in a code cell as so:
pd.read_sql('SELECT * FROM table_a AS a LIMIT 10;', conn)
this involves copy/paste and displaying the query twice (not too good if I want to simply export my notebook to a pdf report)
is there a way to avoid the duplication by reading the markdown text into a string python variable, or any other way?
The cellmagic answer cited by #Micah Kornfield in the question comments may be a good fit for many situations. In the question however it is said that it is desirable to avoid duplicates. Let's imagine that the SQL is huge and we don't want see the same query more than once.
Unfortunatelly right now in 2021 there's no easy solution for this. In a jupyter notebook there are two worlds, the backend which is the kernel and in our case runs python, and the frontend which runs javascript. Only javascript sees the markdown cells. It is possible to make the backend and frontend communicate with each other, those methods are usually a little hacky, but anyway we will rely on some of them.
I have written a script that does our job in two different ways, which will probably bring similar results. I will call those methods the file read method and the javascript method.
First, please save the following file markdown.py in the same folder as the ipython (we are using a separate file because you specified that your notebook willl eventually go to a report and it is undesirable to have this script together with the notebook):
from IPython.display import Javascript
from urllib.parse import unquote
from json import loads as jsonloads
def markread(cellnumber,notebookname=None,callbackvar=None):
try:
if type(cellnumber) is int:# maybe check if (varname in globals()):
if callbackvar is not None and type(callbackvar) is str:
return Javascript("const mdtjs = Jupyter.notebook.get_cells().filter(c=>c.cell_type==\"markdown\")["+str(cellnumber)+"].get_text(); IPython.notebook.kernel.execute(\"mdtp = unquote('\"+encodeURI(mdtjs)+\"');mdtp=mdtp[mdtp.find('\\\\n',mdtp.find('```'))+1:min(mdtp.rfind('\\\\n'),mdtp.rfind('```'))].strip();"+callbackvar+"=mdtp;del mdtp\");")
if notebookname is not None and type (notebookname) is str:
if not notebookname.endswith('.ipynb'):
notebookname += '.ipynb'
with open(notebookname) as f:
j = jsonloads(f.read())
mdts = [''.join(c['source'][1:]).strip().strip('`').strip() for c in j['cells'] if c['cell_type']=='markdown']
return mdts[cellnumber]
except:
return None
return None
Now back to the notebook, to load the script, you have to import it:
from markdown import markread, unquote
The unquote is needed to use the javascript method, otherwhise you can skip it.
1. File read method:
Usage:
marktext = markread(2, notebookname='mynotebookname')
Here marktext will get the value from the third markdown cell in the mynotebookname (third because we live in a zero-indexed world, so 2 means third; if you skip '.ipynb' extension in the notebookname as in this case it will be automatically appended). Important - this method reads the notebook file writen on disk and not the hot state of things. If you changed anything since last save, things may go wrong.
2. Javascript method:
Usage:
markread(1, callbackvar='marktext')
Here we write the value of our second markdown cell to a variable called marktext. Javascript method is trickier - it is async, so we have to send the name of the variable that we want to write to (must be a string representing its name, not the variable itself). Is is important to know also that markread must be the last command in the cell due to a limitation in javascript invoking.
How it works
Internally, the file read method just reads the notebook file which is json, picks the value from 'cells' and filters out the ones which are markdown.
The javascript method however is more complex. It invokes JS because JS has access to the cells including markdown, so JS reads cells values (from the Jupyter.notebook.get_cells), filters the markdown ones, invoke python back and send back those markdown cells - url enconded. Those encoded cells are decoded back and assigned to the callbackvar. In both methods I made some assumptions that may not be correct about trimming the start and the end of the cell value (the ``` and whitespaces).
There are ways to improve the code, for example making it auto detect the notebook name for the file read method, but it involves even more hacks, relying again on javascript to get the notebook name or making an call to the api on port 8888, but having to deal with session password. I believe the most important is covered already by our script. If one method does not work, you will probably still have the other.
I have a DAG with one DataflowTemplateOperator that can deal with different json files. When I trigger the dag I pass some parameters via {{dag_run.conf['param1']}} and works fine.
The issue I have is trying to rename the task_id based on param1.
i.e. task_id="df_operator_read_object_json_file_{{dag_run.conf['param1']}}",
it complains about only alphanumeric characters
or
task_id="df_operator_read_object_json_file_{}".format(dag_run.conf['param1']),
it does not recognise dag_run plus the alpha issue.
The whole idea behind this is that when I see at the dataflow jobs console and job has failed I know who the offender is based on param1. Dataflow Job names are based on task_id like this:
df-operator-read-object-json-file-8b9eecec
and what I need is this:
df-operator-read-object-param1-json-file-8b9eecec
Any ideas if this is possible?
There is no need to generate new operator per file.
DataflowTemplatedJobStartOperator has job_name parameter which is also templated so can be used with Jinja.
I didn't test it but this should work:
from airflow.providers.google.cloud.operators.dataflow import DataflowTemplatedJobStartOperator
op = DataflowTemplatedJobStartOperator(
task_id="df_operator_read_object_json_file",
job_name= "df_operator_read_object_json_file_{{dag_run.conf['param1']}}"
template='gs://dataflow-templates/your_template',
location='europe-west3',
)
I'm currently working on a project in which I create fixtures with Alice-bundle to run tests to make sure my API-endpoints work properly. Everything works fine, except for the DateTime properties.
No matter what string I pass it, eg: <dateTime('2019-09-23 14:00:00')>, it always gives me the wrong date and time, usually something like: 1998-10-25T14:29:45+01:00.
Even using <dateTime('now')> does not work -- it gives me some pre-2000s date & time as well, while that's exactly what some examples I'd found do use.
A fixture could look something like this:
Path\To\Task\Entity:
my_task:
title: 'My tasks'
description: 'These are all the tasks just for me!!!'
startsAt: <dateTime('now')>
endsAt: <dateTime('now')>
createdBy: '#some_higher_user'
Ideally I just want to pass it a string so I can define both a Date and Time and make sure it works properly, in the right format.
And help would be appreciated.
Looking here https://github.com/nelmio/alice/blob/master/doc/advanced-guide.md#functions we read:
function can be a Faker or a PHP native (or registered in the global scope) function.
So I would recommend trying a PHP native function that creates a \DateTime object
<date_create_from_format ( 'Y-m-d H:i:s' , '2019-09-23 14:00:00')>
// or
<date_create('now')>
That's how it works. The <dateTime()> function takes an argument called $max. So what it does is create a date between a starting date (not sure which one, probably something in the 1900 range or so) and that $max argument.
I guess you will want to use <dateTimeBetween()> which takes a startDate and an endDate to create a fake date between them. So I suppose if startDate = endDate, then you'll get the desired fixed date.
Take a look at fzaninotto/Faker library documentation. It's the library used by AliceBundle to actually fake data. There you can see what DateTime related functions you can use.
I need a BigQueryOperator task like the following one: in which I need to save result from a query to a partitioned table. However, the "month_start" need to be derived from the actual DAG execution_date. I couldn't find any documents or examples on how to read the execution_date in my DAG definition script (in Python). Looking forward to some help here.
FYR: I'm with Airflow 1.8.2
t1_invalid_geohash_by_traffic = BigQueryOperator(
task_id='invalid_geohash_by_traffic',
bql='SQL/dangerous-area/InvalidGeohashByTraffic.sql',
params = params,
destination_dataset_table=
'mydataset.mytable${}'.format(month_start), write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False
)
I think I found the answer. Just ran into this blog: https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
I am trying to write a file watcher job in autosys that would watch out for a particular file. The file name format would be filename_ddmmyyyy.
The requirement is that the file comes at 7.15am everyday and the file watcher job starts running at 6.50am and the runs till 8am. If the file is received by then, job is successful else an alert is raised.
Now what I am trying to do is to watch out the file filename_ddmmyyyy for a particular day. e.g. if today is 22nd Feb 2013, the file name will be filename_22022013 and this is the file that I am looking for. If I use wildcards like filename_*, it would look for all possible files which I don't want.
I am not sure how to do this in Windows.
Any help would be much appreciated.
Let me know in case of questions.
You will need to use the profile job attribute to initalize variables when the job starts. One of these variables will need to be the date pattern you are looking for (you'll need another process that outputs that dynamically). Then once you set it to a variable in your profile script, you can refer to that variable name from within the watch_file attribute.
Create global variable as variable with date and us that variable :
example:filename_$${GV_DATE}
GV_DATE: ddmmyyyy
Pretty late to answer, but here is an answer without using global variable. You can use formatted system date variable in the file name.
File_to_watch: filename_%date:~10,4%%date:~4,2%%date:~7,2%