How to pass parameters to MWAA (airflow) dag from lambda? - airflow

I need help in passing the arguments (conf params) to mwaa (airflow) from lambda. Lmabda is used to trigger the dag on sqs event.
The dag runs fine without command line params.
import boto3
import http.client
import base64
import ast
mwaa_env_name = 'dflow_dev_2'
dag_name = 'tpda_test'
mwaa_cli_command = 'dags trigger'
client = boto3.client('mwaa')
def lambda_handler(event, context):
# get web token
mwaa_cli_token = client.create_cli_token(
Name=mwaa_env_name
)
conn = http.client.HTTPSConnection(mwaa_cli_token['WebServerHostname'])
payload = "dags trigger " + dag_name + "--conf '{'name':'v111'}' "
headers = {
'Authorization': 'Bearer ' + mwaa_cli_token['CliToken'],
'Content-Type': 'text/plain'
}
conn.request("POST", "/aws_mwaa/cli", payload, headers)
res = conn.getresponse()
data = res.read()
dict_str = data.decode("UTF-8")
mydata = ast.literal_eval(dict_str)
return base64.b64decode(mydata['stdout'])

There may be an issue with payload. The code example below adds a space between the DAG name and --conf and changes the quotations in the JSON string to double quotations.
conf = "{\"" + "name" + "\":\"" + "v111" + "\"}"
payload = "dags trigger " + dag_name + " --conf '{}'".format(conf)
Notice the different values for payload before and after.
Before:
dags trigger tpda_test--conf '{'name':'v111'}'
After:
dags trigger tpda_test --conf '{"name":"v111"}'
Reference: Add a configuration when triggering a DAG (AWS)

Related

Asyncio and Playwright scraping loop, How to await task to parse data at the end of the main function

I am new to Asyncio and playwright. I have done as much research on my own but I still cannot figure out this after going back and forth. Not sure what I am doing wrong in this part of my code.
I have looped my urls in my scraper with playwright to get the XHR response. No issues in that part. The issue is now how do I parse the response.json() thru my parser function and to ensure that it is the final step that is done or rather not to lose any data as this is being done at the same time so I am not sure if I am appending them then doing the parsing in one single go. Which would be the best cast. To do the last parsing of the objects would let me retrieve all json retrivals from all input urls in one go. However, as you can see I get the following error:
Exception in callback AsyncIOEventEmitter._emit_run.._callback(<Task finishe...osing scope")>) at C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_asyncio.py:55
handle: <Handle AsyncIOEventEmitter._emit_run.._callback(<Task finishe...osing scope")>) at C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_asyncio.py:55>
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\asyncio\events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_asyncio.py", line 62, in _callback
self.emit('error', exc)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_base.py", line 116, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_base.py", line 86, in _emit_handle_potential_error
raise error
File "d:\Projects\AXS\ticketbuylinktest.py", line 33, in handle_response
asyncio.create_task(data_parse(dataarr))
NameError: free variable 'data_parse' referenced before assignment in enclosing scope
I understand that I am suppose to move this but I am not sure where it should be moved to ensure that it does the task last.
import ast
from asyncio import tasks
import json
from operator import contains
from urllib import response
from playwright.async_api import async_playwright
import asyncio
urlarr = ['http://shop.samplesite.com/?c=XXX&e=10254802429572333','http://shop.samplesite.com/?c=XXXX&e=10254802429581183']
proxy_to_use = {"server": "http://myproxy.io:19992","username": "XXXXX","XXXX": "XXXXX"}
dataarr = []
finaldata = []
async def main(url):
print("\n\n\n\nURL BEING RUNNED",url)
async def handle_response(response):
l = str(response.url)
checkstring = 'utm_cid'
para = '/veritix/Inv/v2/'
if para in l:
filterurl = response.url
if checkstring in filterurl:
print(response.url)
await asyncio.sleep(5)
data = await response.json()
print("\n\n\n\nDATA:",data)
dataarr.append(data)
asyncio.create_task(data_parse(dataarr))
async with async_playwright() as pw:
browser = await pw.chromium.launch(
headless=False,)
page = await browser.new_page(user_agent='My user agent')
# Data Extraction Code Here
page.on("response",lambda response: asyncio.create_task(handle_response(response)))
await page.goto(url)
await page.wait_for_timeout(3*5000)
await browser.close()
#print("\n\n\n\n",dataarr)
async def data_parse(dataarr):
jsond = json.dumps(dataarr,indent=2)
jsonf = json.loads(jsond)
eventid = jsonf[0]['offerPrices'][0]['zonePrices'][0]['eventID']
sectiondata = jsonf[0]['offerPrices'][0]['zonePrices'][0]['priceLevels']
subdata = []
subdata.append(eventid)
for sec in sectiondata:
section = sec['label']
inventorycount = sec['availability']['amount']
price = (sec['prices'][0]['base'])/100
subdata.extend([section,inventorycount,price])
finaldata.append(subdata)
print(finaldata)
async def go_to_url():
tasks = [main(url) for url in urlarr]
await asyncio.wait(tasks)
asyncio.get_event_loop().run_until_complete(go_to_url())

Airflow Xcom not getting resolved return task_instance string

I am facing an odd issue with xcom_pull where it is always returning back a xcom_pull string
"{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"
My requirement is simple I have pushed an xcom using python operator and with xcom_pull I am trying to retrieve the value and pass it as an http_conn_id for SimpleHttpOperator but the variable is returning a string instead of resolving xcom_pull value.
Python Operator is successfully able to push XCom.
Code:
from datetime import datetime
import simplejson as json
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from google.auth.transport.requests import Request
default_airflow_args = {
"owner": "divyaansh",
"depends_on_past": False,
"start_date": datetime(2022, 5, 18),
"retries": 0,
"schedule_interval": "#hourly",
}
project_configs = {
"project_id": "test",
"conn_id": "google_cloud_storage_default",
"bucket_name": "test-transfer",
"folder_name": "processed-test-rdf",
}
def get_config_vals(**kwargs) -> dict:
"""
Get config vals from airlfow variable and store it as xcoms
"""
task_instance = kwargs["task_instance"]
task_instance.xcom_push(key="http_con_id", value="gcp_cloud_function")
def generate_api_token(cf_name: str):
"""
generate token for api request
"""
import google.oauth2.id_token
request = Request()
target_audience = f"https://us-central1-test-a2h.cloudfunctions.net/{cf_name}"
return google.oauth2.id_token.fetch_id_token(
request=request, audience=target_audience
)
with DAG(
dag_id="cf_test",
default_args=default_airflow_args,
catchup=False,
render_template_as_native_obj=True,
) as dag:
start = DummyOperator(task_id="start")
config_vals = PythonOperator(
task_id="get_config_val", python_callable=get_config_vals, provide_context=True
)
ip_data = json.dumps(
{
"bucket_name": project_configs["bucket_name"],
"file_name": "dummy",
"target_location": "/valid",
}
)
conn_id = "{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"
api_token = generate_api_token("new-cp")
cf_task = SimpleHttpOperator(
task_id="file_decrypt_and_validate_cf",
http_conn_id=conn_id,
method="POST",
endpoint="new-cp",
data=json.dumps(
json.dumps(
{
"bucket_name": "test-transfer",
"file_name": [
"processed-test-rdf/dummy_20220501.txt",
"processed-test-rdf/dummy_20220502.txt",
],
"target_location": "/valid",
}
)
),
headers={
"Authorization": f"bearer {api_token}",
"Content-Type": "application/json",
},
do_xcom_push=True,
log_response=True,
)
print("task new-cp", cf_task)
check_flow = DummyOperator(task_id="check_flow")
end = DummyOperator(task_id="end")
start >> config_vals >> cf_task >> check_flow >> end
Error Message:
raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined") airflow.exceptions.AirflowNotFoundException: The conn_id `"{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"` isn't defined
I have tried several different days but nothing seems to be working.
Can someone point me to the right direction here.
Airflow-version : 2.2.3
Composer-version : 2.0.11
In SimpleHttpOperator the http_conn_id parameter is not templated field thus you can not use Jinja engine with it. This means that this parameter can not be rendered. So when you pass "{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}" to the operator you expect it to be replaced during runtime with the value stored in Xcom by previous task but in fact Airflow consider it just as a regular string this is also what the exception tells you. It actually try to search a connection with the name of your very long string but couldn't find it so it tells you that the connection is not defined.
To solve it you can create a custom operator:
class MySimpleHttpOperator(SimpleHttpOperator):
template_fields = SimpleHttpOperator.template_fields + ("http_conn_id",)
Then you should replace SimpleHttpOperator with MySimpleHttpOperator in your DAG.
This change makes the string that you set in http_conn_id to be passed via the Jinja engine. So in your case the string will be replaced with the Xcom value as you expect.

Encoding problem with GET requests in Haskell

I'm trying to get some Json data from a Jira server using Haskell. I'm counting this as "me having problems with Haskell" rather than encodings or Jira because my problem is when doing this in Haskell.
The problem occurs when the URL (or query) has plus signs. After building my request for theproject+order+by+created, Haskell prints it as:
Request {
host = "myjiraserver.com"
port = 443
secure = True
requestHeaders = [("Content-Type","application/json"),("Authorization","<REDACTED>")]
path = "/jira/rest/api/2/search"
queryString = "?jql=project%3Dtheproject%2Border%2Bby%2Bcreated"
method = "GET"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
But the request fails with this response:
- 'Error in the JQL Query: The character ''+'' is a reserved JQL character. You must
enclose it in a string or use the escape ''\u002b'' instead. (line 1, character
21)'
So it seems like Jira didn't like Haskell's %2B. Do you have any suggestions on what I can do to fix this, or any resources that might be helpful? The same request sans the +order+by+created part is successful.
The code (patched together from these examples):
{-# LANGUAGE OverloadedStrings #-}
import Data.Aeson
import qualified Data.ByteString.Char8 as S8
import qualified Data.Yaml as Yaml
import Network.HTTP.Simple
import System.Environment (getArgs)
-- auth' is echo -e "username:passwd" | base64
foo urlBase proj' auth' = do
let proj = S8.pack (proj' ++ "+order+by+created")
auth = S8.pack auth'
request'' <- parseRequest urlBase
let request'
= setRequestMethod "GET"
$ setRequestPath "/jira/rest/api/2/search"
$ setRequestHeader "Content-Type" ["application/json"]
$ request''
request
= setRequestQueryString [("jql", Just (S8.append "project=" proj))]
$ setRequestHeader "Authorization" [S8.append "Basic " auth]
$ request'
return request
main :: IO ()
main = do
args <- getArgs
case args of
(urlBase:proj:auth:_) -> do
request <- foo urlBase proj auth
putStrLn $ show request
response <- httpJSON request
S8.putStrLn $ Yaml.encode (getResponseBody response :: Value) -- apparently this is required
putStrLn ""
_ -> putStrLn "usage..."
(If you know a simpler way to do the above then I'd take such suggestions as well, I'm just trying to do something analogous to this Python:
import requests
import sys
if len(sys.argv) >= 4:
urlBase = sys.argv[1]
proj = sys.argv[2]
auth = sys.argv[3]
urlBase += "/jira/rest/api/2/search?jql=project="
proj += "+order+by+created"
h = {}
h["content-type"] = "application/json"
h["authorization"] = "Basic " + auth
r = requests.get(urlBase + proj, headers=h)
print(r.json())
)
project+order+by+created is the URL-encoded string for the actual request project order by created (with spaces instead of +). The function setRequestQueryString expects a raw request (with spaces, not URL-encoded), and URL-encodes it.
The Python script you give for comparison essentially does the URL-encoding by hand.
So the fix is to put the raw request in proj:
foo urlBase proj' auth' = do
let proj = S8.pack (proj' ++ " order by created") -- spaces instead of +
...

Nginx server with uwsgi,flask and sleekxmpp

I'm trying to handling some messages by using nginx server with uwsgi, flask and sleekxmpp.
Here is the code.
import ssl, json, logging, threading, time
from flask import Flask
from sleekxmpp import ClientXMPP
from sleekxmpp.exceptions import IqError, IqTimeout
smsg = """{
"version":1,
"type":"request",
"messageId":"xxyyzz",
"payload":
{
"deviceType":"ctlr",
"command":"getDeviceInfo"
}
}"""
class XMPP(ClientXMPP):
rosterList=[]
def __init__(self, jid, password):
ClientXMPP.__init__(self, jid, password)
self.add_event_handler('session_start', self.session_start, threaded = True)
self.add_event_handler('message', self.message, threaded=True)
self.ssl_version = ssl.PROTOCOL_SSLv23
def session_start(self, event):
self.send_presence(pshow='online')
try:
self.rosterList.append(self.get_roster())
except IqError as err:
print 'Error: %' % err.iq['error']['condition']
except IqTimeout:
print 'Error: Request time out'
def message(self, msg):
data = msg['body'][12:]
dictData = json.loads(data)
print data
if 'payload' in dictData.keys():
for lists in dictData['payload']['indexes']:
print lists
elif 'message' in dictData.keys():
print 'Request accepted'
app = Flask(__name__)
#logging.basicConfig(level = logging.DEBUG)
xmpp = XMPP('jid', 'password')
class XmppThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
if xmpp.connect(('server', '5222')):
xmpp.process(block=True)
xt = XmppThread()
xt.start()
#app.route('/')
def send():
xmpp.send_message(mto='receiver', mbody=smsg, mtype='chat')
return '<h1>Send</h1>'
I run the code by uwsgi with these options.
[uwsgi]
uid = uwsgi
gid = uwsgi
pidfile = /run/uwsgi/uwsgi.pid
emperor = /etc/uwsgi.d
stats = /run/uwsgi/stats.sock
chmod-socket = 660
emperor-tyrant = true
cap = setgid,setuid
[uwsgi]
plugin = python
http-socket = :8080
wsgi-file = /var/www/uwsgi/flask_uwsgi.py
callable = app
module = app
enable-threads = True
logto = /var/www/uwsgi/flask_uwsgi.log
When I run uwsgi by typing command, like '/usr/sbin/uwsgi --ini uwsgi.ini', it works well. I can send and recieve the messages. But, when I run this on CentOS 7's service, recieve is working, but send is not working.
Did i need some more options or missing something?

Airflow dynamic DAG and Task Ids

I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12

Resources