I am trying to create a DAG with a PythonOperator with a set of default parameters that can be overridden in the UI. This is meant to be an example, that works, and other users can modify and run the DAG.
Looking at other examples in SO, I came up with something like this:
def example(ds=None, **kwargs):
return ""
with DAG(
dag_id="example_dag",
start_date=datetime(2021, 1, 1),
catchup=False,
parameters = {
"accounts": [
{
"name": "account_1",
"country": "usa"
}
]
}
) as dag:
operator = PythonOperator(task_id=f"example_task", python_callable=example)
operator
However, when I look at the params through the Airflow UI, I see a bunch of extra fields that messes up the structure above:
{
"accounts": [
{
"__var": {
"name": "account_1",
"country": "usa"
},
"__type": "dict"
}
]
}
I have tried a number of different ways of accessing the data without going through __var, but nothing has worked.
How can I work with such default parameters? More generally, is there a way to populate default params for DAG Runs so these fields do not show up?
Related
Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
I have an API that returns some nested json data that I would like airflow to read. However, airflow is giving me an error saying that the nested fields are NULL. These fields are not NULL and I can see the json data in any manual get requests I make to the API.
How do I modify my pipeline in order to read these nested fields?
My API returns a JSON object like:
{
"email": "ronald#mcdonald.com",
"first_name": "ronald",
"last_name": "mcdonald",
"permissions": {
"make_burgers": true,
"make_icecream": false,
},
}
My airflow pipeline object looks like:
class StaffPipeline(_DefaultPipeline):
source_url = f'{config.MCDONALDS_BASE_URL}/staff'
table_config = TableConfig(
table_name='mcdonalds__staff',
field_mapping=[
('email', sa.Column('email', sa.Text)),
('first_name', sa.Column('first_name', sa.Text)),
('last_name', sa.Column('last_name', sa.Text)),
('make_burgers', sa.Column('make_burgers', sa.Boolean)),
('make_icecream', sa.Column('make_icecream', sa.Boolean)),
],
indexes=(LateIndex("email"), LateIndex("last_name")),
)
My error message when trying to run this pipeline:
raise UnusedColumnError(error)
mcdonalds_data.operators.db_tables.UnusedColumnError: Column mcdonalds__staff_123456789.make_burgers only contains NULL values
Thank you!
The answer is that you have to use a nested strucutre in airflow like so:
class StaffPipeline(_DefaultPipeline):
source_url = f'{config.MCDONALDS_BASE_URL}/staff'
table_config = TableConfig(
table_name='mcdonalds__staff',
field_mapping=[
('email', sa.Column('email', sa.Text)),
('first_name', sa.Column('first_name', sa.Text)),
('last_name', sa.Column('last_name', sa.Text)),
(
("permissions", "make_burgers"),
sa.Column('make_burgers', sa.Boolean),
),
(
("permissions", "make_icecream"),
sa.Column('make_icecream', sa.Boolean),
),
],
indexes=(LateIndex("email"), LateIndex("last_name")),
)
is there any easy method to call APIs from Wordpress website and return true or false, depends if some data is there?
Here is the API:
https://api.covalenthq.com/v1/137/address/0x3FEb1D627c96cD918f2E554A803210DA09084462/balances_v2/?&format=JSON&nft=true&no-nft-fetch=true&key=ckey_docs
here is a JSON:
{
"data": {
"address": "0x3feb1d627c96cd918f2e554a803210da09084462",
"updated_at": "2021-11-13T23:25:27.639021367Z",
"next_update_at": "2021-11-13T23:30:27.639021727Z",
"quote_currency": "USD",
"chain_id": 137,
"items": [
{
"contract_decimals": 0,
"contract_name": "PublicServiceKoalas",
"contract_ticker_symbol": "PSK",
"contract_address": "0xc5df71db9055e6e1d9a37a86411fd6189ca2dbbb",
"supports_erc": [
"erc20"
],
"logo_url": "https://logos.covalenthq.com/tokens/137/0xc5df71db9055e6e1d9a37a86411fd6189ca2dbbb.png",
"last_transferred_at": "2021-11-13T09:45:36Z",
"type": "nft",
"balance": "0",
"balance_24h": null,
"quote_rate": 0.0,
"quote_rate_24h": null,
"quote": 0.0,
"quote_24h": null,
"nft_data": null
}
],
"pagination": null
},
"error": false,
"error_message": null,
"error_code": null
}
I want to check if there is "PSK" in contract_ticker_symbol, if it exist and "balance" is > 0 ... then return true.
Is there any painless method because I'm not a programmer...
The Python requests library can handle this. You'll have to install it with pip first (package installer for Python).
I also used a website called JSON Parser Online to see what was going on with all of the data first so that I would be able to make sense of it in my code:
import requests
def main():
url = "https://api.covalenthq.com/v1/137/address/0x3FEb1D627c96cD918f2E554A803210DA09084462/balances_v2/?&format" \
"=JSON&nft=true&no-nft-fetch=true&key=ckey_docs "
try:
response = requests.get(url).json()
for item in response['data']['items']:
# First, find 'PSK' in the list
if item['contract_ticker_symbol'] == "PSK":
# Now, check the balance
if item['balance'] == 0:
return True
else:
return False
except requests.ConnectionError:
print("Exception")
if __name__ == "__main__":
print(main())
This is what is going on:
I am pulling all of the data from the API.
I am using a try/except clause because I need the code to
handle if I can't make a connection to the site.
I am looping through all of the 'items' to find the correct 'item'
that includes the contract ticker symbol for 'PSK'.
I am checking the balance in that item and returning the logic that you wanted.
The script is running itself at the end, but you can always just rename this function and have some other code call it to check it.
I need to filter an array in my AWS Step Functions state. This seems like something I should easily be able to achieve with JsonPath but I am struggling for some reason.
The state I want to process looks like this:
{
"items": [
{
"id": "A"
},
{
"id": "B"
},
{
"id": "C"
}
]
}
I want to filter this array by removing entries for which id is not in a specified whitelist.
To do this, I define a Pass state in the following way:
"ApplyFilter": {
"Type": "Pass",
"ResultPath": "$.items",
"InputPath": "$.items.[?(#.id in ['A'])]",
"Next": "MapDeployments"
}
This makes use of the JsonPath in operator.
Unfortunately when I execute the state machine I receive an error:
{
"error": "States.Runtime",
"cause": "An error occurred while executing the state 'ApplyFilter' (entered at the event id #8). Invalid path '$.items.[?(#.id in ['A'])]' : com.jayway.jsonpath.InvalidPathException: com.jayway.jsonpath.InvalidPathException: Space not allowed in path"
}
However, I don't understand what is incorrect with the syntax. When I test here everything works correctly.
What is wrong with what I have done? Is there another way of achieving this sort of filter using JsonPath?
According to the official AWS docs for Step Functions,
The following in paths are not supported # .. , : ? *
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-paths.html
I have a lambda function that converts my logs to this format:
{
"events": [
{
"field1": "value",
"field2": "value",
"field3": "value"
}, (...)
]
}
When I query it on S3, I get in this format:
[
{
"events": [
{ (...) }
]
}
]
And I'm trying to run a custom classifier for it because the data I want is inside the objects kept by 'events' and not events itself.
So I started with the simplest path I could think that worked in my tests (https://jsonpath.curiousconcept.com/)
$.events[*]
And, sure, worked in the tests but when I run a crawler against the file, the table created includes only an events field with a struct inside it.
So I tried a bunch of other paths:
$[*].events
$[*].['events']
$[*].['events'].[*]
$.[*].events[*]
$.events[*].[*]
Some of these does not even make sense and absolutely every one of those got me an schema with an events field marked as array.
Can anyone point me to a better direction to handle this issue?