Apache airflow backfill automatic date - airflow

I have a DAG which I want to use for backfilling my database table.
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 4, 1),
'retry_delay': timedelta(minutes=1),
}
dag = DAG(dag_id='airflow_backfill', default_args=args, schedule_interval='#daily')
"""
Task for inserting data per day
"""
task1 = PostgresOperator(
task_id='insert_new_row',
postgres_conn_id='aws_pg',
sql="INSERT INTO airflow_test(date_at) VALUES('2018-04-01')",
dag=dag,
)
task2 = PostgresOperator(
task_id='update_team_name',
postgres_conn_id='aws_pg',
sql="UPDATE airflow_test SET team_name = (SELECT team_name FROM teams ORDER BY RANDOM() LIMIT 1) WHERE team_name is NULL",
dag=dag,
)
task1.set_downstream(task2)
I am inserting one row in database from 1st of April, 2018 but the problem is that I am giving the date_at variable hard coded.
My question is, is there any way by which I can give the date of backfill as the value of insertion? I want to set the value of 'date_at' automatically while doing the backfilling but haven't found any airflow environment/config variable from which I can get the backfill date automatically.
I am using apache airflow 1.9.0. Thanks.

EDITED: You should be able to use a jinja template to grab the variable execution_date:
task1 = PostgresOperator(
task_id='insert_new_row',
postgres_conn_id='aws_pg',
sql="INSERT INTO airflow_test(date_at) VALUES('{{ ds }}')",
dag=dag,
)
https://airflow.apache.org/code.html#default-variables

Related

Get multiple variable Aspentech infoplus 21

I am writing a script in order to connect to an Aspentech Infoplus 21 database server.
When calling for a single TAG I do not record any problem
import pandas as pd
import pyodbc
from datetime import datetime
from datetime import timedelta
#---- Connect to IP21
conn = pyodbc.connect("DRIVER={AspenTech SQLplus};HOST=192.xxx.x.xxx;PORT=10014")
#---- Query string
tag = 'BAN0E10TI110V'
end = datetime.now()
start = end-timedelta (days=2)
end = end.strftime("%Y-%m-%d %H:%M:%S")
start=start.strftime("%Y-%m-%d %H:%M:%S")
sql = "select TS,VALUE from HISTORY "\
"where NAME='%s'"\
"and PERIOD = 300*10"\
"and REQUEST = 2"\
"and REQUEST=2 and TS between TIMESTAMP'%s' and TIMESTAMP'%s'" % (tag, start, end)
data = pd.read_sql(sql,conn) # Pandas DataFrame with your data!
When calling multiple tags through a csv (following script) file I can not get the required data.
import pandas as pd
import pyodbc
from datetime import datetime
from datetime import timedelta
#---- Connect to IP21
conn = pyodbc.connect("DRIVER={AspenTech SQLplus};HOST=192.xxx.x.xxx;PORT=10014")
tags = pd.read_csv("C:\\Users\\xxx\\TAGcsvIN.csv", decimal=',', sep=';', parse_dates=True)
#---- Query string
end = datetime.now()
start = end-timedelta (days=2)
end = end.strftime("%Y-%m-%d %H:%M:%S")
start=start.strftime("%Y-%m-%d %H:%M:%S")
sql = "select TS,VALUE from HISTORY "\
"where NAME='%s'"\
"and PERIOD = 300*10"\
"and REQUEST = 2"\
"and REQUEST=2 and TS between TIMESTAMP'%s' and TIMESTAMP'%s'" % (tags['TAGcsv'], start, end)
data = pd.read_sql(sql,conn) # Pandas DataFrame with your data!
Do someone know how to call multiple tags via csv file?
I'm not proficient in python, but if you want to query several tag, you should build a query like this:
"where NAME IN (""tag1"", ""tag2"", ""tagN"")"\

DynamoDB primary key for date range and user id

I'm still trying to wrap my head around primary key selection in DynamoDB. My current structure is the following, where userId is HASH and sort is RANGE.
userId
sort
event
1
2021-01-18#u2d3-f3d5-s22d-3f52
...
1
2021-01-08#f1d3-s30x-s22d-w2d3
...
2
2021-02-21#s2d2-u2d3-230s-3f52
...
2
2021-02-13#w2d3-e5d5-w2d3-3f52
...
1
2021-01-19#f2d4-f3d5-s22d-3f52
...
1
2020-12-13#f3d5-e5d5-s22d-w2d3
...
2
2020-11-11#e5d5-u2d3-s22d-0j32
...
What I want to achieve is to query all events for a particular user between date A and date B. I have tested a few of solutions that all work, like
Figure out a closest common begins_with for the range I want. If date A is 2019-02-01 and date B is 2021-01-03, then it would be userId = 1 and begins_with (sort, 20), which would return everything from the twenty-first century.
Loop through all months between date A and date B and do a bunch of small queries like userId = 1 and begins_with (sort, 2021-01), then concat the results afterwards.
They all work but have their drawbacks. I'm also a bit unsure of when I'm just complicating things to the point where a scan might actually be worth it instead. Being able to use between would of course be the best option, but I need to put the unique #guid at the end of the range key in order to make each primary key unique.
Am I approaching this the wrong way?
I created a little demo app to show how this works.
You can just use the between condition, because it uses byte-order to implement the between condition. The idea is that you use the regular starting date A and convert it to a string as the beginning of the range. Then you add a day to your end, convert it to string and use that as the end.
The script creates this table (it will look different when you run it):
PK | SK
------------------------------------------------------
demo | 2021-02-26#a4d0f5f3-588a-49d9-8eaa-a3e2f9436ade
demo | 2021-02-27#92b9a41b-9fa5-4ee7-8663-7b801192d8dd
demo | 2021-02-28#e5d162ac-3bbf-417a-9ec7-4024410e1b01
demo | 2021-03-01#7752629e-dc8f-47e0-8cb6-5ed219c434b5
demo | 2021-03-02#dd89ca33-965c-4fe1-8bcc-3d5eee5d6874
demo | 2021-03-03#b696a7fc-ba17-47d5-9d19-454c19e9bccc
demo | 2021-03-04#ee30b1ce-3910-4a59-9e62-09f051b0dc72
demo | 2021-03-05#f0e2405f-6ce9-4fcb-a798-394f7a2f9490
demo | 2021-03-06#bcf76e07-7582-4fe3-8ffd-14f450e60120
demo | 2021-03-07#58d01231-a58d-4c23-b1ed-e525ba102b80
And when I run this function to select the items between two given dates, it returns the result below:
def select_in_date_range(pk: str, start: datetime, end: datetime):
table = boto3.resource("dynamodb").Table(TABLE_NAME)
start = start.isoformat()[:10]
end = (end + timedelta(days=1)).isoformat()[:10]
print(f"Requesting all items starting at {start} and ending before {end}")
result = table.query(
KeyConditionExpression=\
conditions.Key("PK").eq(pk) & conditions.Key("SK").between(start, end)
)
print("Got these items")
for item in result["Items"]:
print(f"PK={item['PK']}, SK={item['SK']}")
Requesting all items starting at 2021-02-27 and ending before 2021-03-04
Got these items
PK=demo, SK=2021-02-27#92b9a41b-9fa5-4ee7-8663-7b801192d8dd
PK=demo, SK=2021-02-28#e5d162ac-3bbf-417a-9ec7-4024410e1b01
PK=demo, SK=2021-03-01#7752629e-dc8f-47e0-8cb6-5ed219c434b5
PK=demo, SK=2021-03-02#dd89ca33-965c-4fe1-8bcc-3d5eee5d6874
PK=demo, SK=2021-03-03#b696a7fc-ba17-47d5-9d19-454c19e9bccc
Full script to try it yourself.
import uuid
from datetime import datetime, timedelta
import boto3
import boto3.dynamodb.conditions as conditions
TABLE_NAME = "sorting-test"
def create_table():
ddb = boto3.client("dynamodb")
ddb.create_table(
AttributeDefinitions=[{"AttributeName": "PK", "AttributeType": "S"}, {"AttributeName": "SK", "AttributeType": "S"}],
TableName=TABLE_NAME,
KeySchema=[{"AttributeName": "PK", "KeyType": "HASH"}, {"AttributeName": "SK", "KeyType": "RANGE"}],
BillingMode="PAY_PER_REQUEST"
)
def create_sample_data():
pk = "demo"
amount_of_events = 10
table = boto3.resource("dynamodb").Table(TABLE_NAME)
start_date = datetime.now()
increment = timedelta(days=1)
print("PK | SK")
print("------------------------------------------------------")
for i in range(amount_of_events):
date = start_date.isoformat()[:10]
unique_id = str(uuid.uuid4())
sk = f"{date}#{unique_id}"
print(f"{pk} | {sk}")
start_date += increment
table.put_item(Item={"PK": pk, "SK": sk})
def select_in_date_range(pk: str, start: datetime, end: datetime):
table = boto3.resource("dynamodb").Table(TABLE_NAME)
start = start.isoformat()[:10]
end = (end + timedelta(days=1)).isoformat()[:10]
print(f"Requesting all items starting at {start} and ending before {end}")
result = table.query(
KeyConditionExpression=\
conditions.Key("PK").eq(pk) & conditions.Key("SK").between(start, end)
)
print("Got these items")
for item in result["Items"]:
print(f"PK={item['PK']}, SK={item['SK']}")
def main():
pass
# create_table()
# create_sample_data()
start = datetime.now() + timedelta(days=1)
end = datetime.now() + timedelta(days=5)
select_in_date_range("demo",start, end)
if __name__ == "__main__":
main()

Which column should I pick up as secondary index in this dynamodb table?

I have a dynamodb table with the following attributes:
PurchaseOrderNumber (partition key)
CustomerID
PurchaseDate
TotalPurchaseValue is what my application must retrieve items from the table to calculate the total value of purchases for a particular customer over a date range. What secondary index should I add to the table?
Thank you.
You can create a Global Secondary Index where the partition key will be CustomerID and sort key will be PurchaseDate. By this, you can perform query operations on a particular customer by CustomerID within a date range of PurchaseDate. You can try out the query code below in AWS lambda IDE :)
import json
import boto3
from boto3.dynamodb.conditions import Key
def lambda_handler(event, context):
dynamodb = boto3.resource("dynamodb")
table_name = "your_table_name"
table = dynamodb.Table(table_name)
customer_id = 1
start_date = "2019-12-10"
end_date = "2019-12-16"
response = table.query(
KeyConditionExpression=Key('CustomerID').eq(customer_id) &
Key('PurchaseDate').between(start_date, end_date)
)
return response

sqlalchemy with sqlite to_sql not creating table nor database

Am new to sqlalchemy. When I run this code there is NO database. I want it to create the database, add the table defined and the data. Reading the documentation for to_sql this code should create the table if it doesn't exist ( it doesn't), when I run it it throws an error that the table has no column num 1 ??? AND does NOT create the database. What am I doing wrong please?
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
date_stuff = [ (20171219, 13.71,28), (20171319, 144.71,33), (20171919, 99.99,99)]
labels = ['date', 'num 1' , 'num 2']
dev_env = "/home/test/Desktop/mtest/hvdata/"
db_name = "tinydatabase.db"
def new_sql_add ( todays_data ):
todays_data.to_sql(name='mcm_trends', con = db ,if_exists='append')
if __name__ == '__main__' :
db_path = dev_env + db_name
db = create_engine('sqlite:///db_path')
df_for_sql = pd.DataFrame.from_records( date_stuff , columns = labels)
new_sql_add ( df_for_sql )

Python TimeDelta Add Day to Supplied Argument

Not sure how to approach this one.
User supplies an argument, ie, program.exe '2001-08-12'
I need to add a single day to that argument - this will represent a date range for another part of the program. I am aware that you can add or subtract from the current day but how does one add or subtract from a user supplied date?
import datetime
...
date=time.strptime(argv[1], "%y-%m-%d");
newdate=date + datetime.timedelta(days=1)
Arnauds Code is valid,Just see how to use it :) :-
>>> import datetime
>>> x=datetime.datetime.strptime('2001-08-12','%Y-%m-%d')
>>> newdate=x + datetime.timedelta(days=1)
>>> newdate
datetime.datetime(2001, 8, 13, 0, 0)
>>>
Okay, here's what I've got:
import sys
from datetime import datetime
user_input = sys.argv[1] # Get their date string
year_month_day = user_input.split('-') # Split it into [year, month, day]
year = int(year_month_day[0])
month = int(year_month_day[1])
day = int(year_month_day[2])
date_plus_a_day = datetime(year, month, day+1)
I understand this is a little long, but I wanted to make sure each step was clear. I'll leave shortening it up to you if you want it shorter.

Resources