I have an API that returns some nested json data that I would like airflow to read. However, airflow is giving me an error saying that the nested fields are NULL. These fields are not NULL and I can see the json data in any manual get requests I make to the API.
How do I modify my pipeline in order to read these nested fields?
My API returns a JSON object like:
{
"email": "ronald#mcdonald.com",
"first_name": "ronald",
"last_name": "mcdonald",
"permissions": {
"make_burgers": true,
"make_icecream": false,
},
}
My airflow pipeline object looks like:
class StaffPipeline(_DefaultPipeline):
source_url = f'{config.MCDONALDS_BASE_URL}/staff'
table_config = TableConfig(
table_name='mcdonalds__staff',
field_mapping=[
('email', sa.Column('email', sa.Text)),
('first_name', sa.Column('first_name', sa.Text)),
('last_name', sa.Column('last_name', sa.Text)),
('make_burgers', sa.Column('make_burgers', sa.Boolean)),
('make_icecream', sa.Column('make_icecream', sa.Boolean)),
],
indexes=(LateIndex("email"), LateIndex("last_name")),
)
My error message when trying to run this pipeline:
raise UnusedColumnError(error)
mcdonalds_data.operators.db_tables.UnusedColumnError: Column mcdonalds__staff_123456789.make_burgers only contains NULL values
Thank you!
The answer is that you have to use a nested strucutre in airflow like so:
class StaffPipeline(_DefaultPipeline):
source_url = f'{config.MCDONALDS_BASE_URL}/staff'
table_config = TableConfig(
table_name='mcdonalds__staff',
field_mapping=[
('email', sa.Column('email', sa.Text)),
('first_name', sa.Column('first_name', sa.Text)),
('last_name', sa.Column('last_name', sa.Text)),
(
("permissions", "make_burgers"),
sa.Column('make_burgers', sa.Boolean),
),
(
("permissions", "make_icecream"),
sa.Column('make_icecream', sa.Boolean),
),
],
indexes=(LateIndex("email"), LateIndex("last_name")),
)
Related
Airflow version :2.0.2
Trying to create Emr cluster, by retrying data from AWS secrets manager.
I am trying to write an airflow dag and, my task is to get data from this get_secret function and use it in Spark_steps
def get_secret():
secret_name = Variable.get("secret_name")
region_name = Variable.get(region_name)
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
account_id = boto3.client('sts').get_caller_identity().get('Account')
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in get_secret_value_response:
secret_str = get_secret_value_response['SecretString']
secret=json.loads(secret_str)
airflow_path=secret["airflow_path"]
return airflow_path
...
I need to use "airflow_path" return value in below spark_steps
Spark_steps:
SPARK_STEPS = [
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
's3://'+airflow_path+'-pyspark/pitchbook/config.zip,s3://'+airflow_path+'-pyspark/pitchbook/jobs.zip,s3://'+airflow_path+'-pyspark/pitchbook/DDL.zip',
's3://'+airflow_path+'-pyspark/pitchbook/main.py'
],
},
},
I saw on the internet I need to use Xcom, is this right ?, and do I need to run this function in python operator first and then get the value. please provide an example as I am a newbie.
Thanks for your help.
Xi
Yes if you would like to pass dynamic stuff, leveraging xcom push/pull might be easier.
Leverage PythonOperator to push data into xcom.
See reference implementation:
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py
https://github.com/apache/airflow/blob/7fed7f31c3a895c0df08228541f955efb16fbf79/airflow/providers/amazon/aws/example_dags/example_emr.py#L108
https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
is there any easy method to call APIs from Wordpress website and return true or false, depends if some data is there?
Here is the API:
https://api.covalenthq.com/v1/137/address/0x3FEb1D627c96cD918f2E554A803210DA09084462/balances_v2/?&format=JSON&nft=true&no-nft-fetch=true&key=ckey_docs
here is a JSON:
{
"data": {
"address": "0x3feb1d627c96cd918f2e554a803210da09084462",
"updated_at": "2021-11-13T23:25:27.639021367Z",
"next_update_at": "2021-11-13T23:30:27.639021727Z",
"quote_currency": "USD",
"chain_id": 137,
"items": [
{
"contract_decimals": 0,
"contract_name": "PublicServiceKoalas",
"contract_ticker_symbol": "PSK",
"contract_address": "0xc5df71db9055e6e1d9a37a86411fd6189ca2dbbb",
"supports_erc": [
"erc20"
],
"logo_url": "https://logos.covalenthq.com/tokens/137/0xc5df71db9055e6e1d9a37a86411fd6189ca2dbbb.png",
"last_transferred_at": "2021-11-13T09:45:36Z",
"type": "nft",
"balance": "0",
"balance_24h": null,
"quote_rate": 0.0,
"quote_rate_24h": null,
"quote": 0.0,
"quote_24h": null,
"nft_data": null
}
],
"pagination": null
},
"error": false,
"error_message": null,
"error_code": null
}
I want to check if there is "PSK" in contract_ticker_symbol, if it exist and "balance" is > 0 ... then return true.
Is there any painless method because I'm not a programmer...
The Python requests library can handle this. You'll have to install it with pip first (package installer for Python).
I also used a website called JSON Parser Online to see what was going on with all of the data first so that I would be able to make sense of it in my code:
import requests
def main():
url = "https://api.covalenthq.com/v1/137/address/0x3FEb1D627c96cD918f2E554A803210DA09084462/balances_v2/?&format" \
"=JSON&nft=true&no-nft-fetch=true&key=ckey_docs "
try:
response = requests.get(url).json()
for item in response['data']['items']:
# First, find 'PSK' in the list
if item['contract_ticker_symbol'] == "PSK":
# Now, check the balance
if item['balance'] == 0:
return True
else:
return False
except requests.ConnectionError:
print("Exception")
if __name__ == "__main__":
print(main())
This is what is going on:
I am pulling all of the data from the API.
I am using a try/except clause because I need the code to
handle if I can't make a connection to the site.
I am looping through all of the 'items' to find the correct 'item'
that includes the contract ticker symbol for 'PSK'.
I am checking the balance in that item and returning the logic that you wanted.
The script is running itself at the end, but you can always just rename this function and have some other code call it to check it.
I need to filter an array in my AWS Step Functions state. This seems like something I should easily be able to achieve with JsonPath but I am struggling for some reason.
The state I want to process looks like this:
{
"items": [
{
"id": "A"
},
{
"id": "B"
},
{
"id": "C"
}
]
}
I want to filter this array by removing entries for which id is not in a specified whitelist.
To do this, I define a Pass state in the following way:
"ApplyFilter": {
"Type": "Pass",
"ResultPath": "$.items",
"InputPath": "$.items.[?(#.id in ['A'])]",
"Next": "MapDeployments"
}
This makes use of the JsonPath in operator.
Unfortunately when I execute the state machine I receive an error:
{
"error": "States.Runtime",
"cause": "An error occurred while executing the state 'ApplyFilter' (entered at the event id #8). Invalid path '$.items.[?(#.id in ['A'])]' : com.jayway.jsonpath.InvalidPathException: com.jayway.jsonpath.InvalidPathException: Space not allowed in path"
}
However, I don't understand what is incorrect with the syntax. When I test here everything works correctly.
What is wrong with what I have done? Is there another way of achieving this sort of filter using JsonPath?
According to the official AWS docs for Step Functions,
The following in paths are not supported # .. , : ? *
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-paths.html
I am trying to enable default MFA for Openstack user using Python keystoneclient API
keystoneclient.users.update
I have sample API curl command from Openstack documentation, where you update the "options" attribute of user account with JSON object
{
"user": {
"options": {
"multi_factor_auth_enabled": true,
"multi_factor_auth_rules": [
["password", "totp"]
]
}
}
}
when I am trying to update the same in Python code I am getting below error
keystoneauth1.exceptions.http.BadRequest: Invalid input for field 'options': u'{ "multi_factor_auth_enabled": true,
"multi_factor_auth_rules": [["password", "totp"]]}' is not of type
'object'
Failed validating 'type' in schema['properties']['options']:
my code is like this
MFA_dict = '{ "multi_factor_auth_enabled": true, "multi_factor_auth_rules": [["password", "totp"]]}'
user = keystone.users.update(user_id, options=MFA_dict)
Never mind I figured it out.
MFA_dict was string, so I had to just remove single quotes and make the object dictionary.
and make lower case true into uppercase True to make it boolean
MFA_dict = { "multi_factor_auth_enabled": True, "multi_factor_auth_rules": [["password", "totp"]]}
I have a lambda function that converts my logs to this format:
{
"events": [
{
"field1": "value",
"field2": "value",
"field3": "value"
}, (...)
]
}
When I query it on S3, I get in this format:
[
{
"events": [
{ (...) }
]
}
]
And I'm trying to run a custom classifier for it because the data I want is inside the objects kept by 'events' and not events itself.
So I started with the simplest path I could think that worked in my tests (https://jsonpath.curiousconcept.com/)
$.events[*]
And, sure, worked in the tests but when I run a crawler against the file, the table created includes only an events field with a struct inside it.
So I tried a bunch of other paths:
$[*].events
$[*].['events']
$[*].['events'].[*]
$.[*].events[*]
$.events[*].[*]
Some of these does not even make sense and absolutely every one of those got me an schema with an events field marked as array.
Can anyone point me to a better direction to handle this issue?