Authorisation error when running airflow via cloud composer - airflow

I get an error when trying to run DAG from cloud composer using the GoogleCloudStorageToBigQueryOperator.
Final error was: {'reason': 'invalid', 'location': 'gs://xxxxxx/xxxx.csv',
and when I follow the URL link to the error ...
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole- project.",
"errors": [
{
"message": "Login Required.",
"domain": "global",
"reason": "required",
"location": "Authorization",
"locationType": "header"
}
],
"status": "UNAUTHENTICATED"
}
}
I have configured the Cloud Storage connection ...
Conn Id My_Cloud_Storage
Conn Type Google Cloud Platform
Project Id xxxxxx
Keyfile Path /home/airflow/gcs/data/xxx.json
Keyfile JSON
Scopes (comma seperated) https://www.googleapis.com/auth/cloud-platform
Code ..
from __future__ import print_function
import datetime
from airflow import models
from airflow import DAG
from airflow.operators import bash_operator
from airflow.operators import python_operator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime.datetime(2019, 4, 15),
}
with models.DAG(
'Ian_gcs_to_BQ_Test',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
load_csv = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq_test',
bucket='xxxxx',
source_objects=['xxxx.csv'],
destination_project_dataset_table='xxxx.xxxx.xxxx',
google_cloud_storage_conn_id='My_Cloud_Storage',
schema_fields=[
{'name':'AAAA','type':'INTEGER','mode':'NULLABLE'},
{'name':'BBB_NUMBER','type':'INTEGER','mode':'NULLABLE'},
],
write_disposition='WRITE_TRUNCATE',
dag=dag)

Ok , now it's fixed.
Turns out it wasn't working because of the header row in the file, once I removed that it worked fine.
Pretty annoying, completely misleading error messages about invalid locations and authorization.

I had the exact same looking error. What fixed it for me was adding the location of my dataset to my operator. So first, check the dataset information if you are not sure the location. Then add it as a parameter in your operator. For example, my dataset was in us-west1 and I was using an operator that looked like this:
check1 = BigQueryCheckOperator(task_id='check_my_event_data_exists',
sql="""
select count(*) > 0
from my_project.my_dataset.event
""",
use_legacy_sql=False,
location="us-west1") # THIS WAS THE FIX IN MY CASE
GCP error messages don't seem to be very good.

Related

Cloud tasks permission error when adding TASK_ID

I have been creating a task following the convention from the documentation - projects/PROJECT_ID/locations/LOCATION_ID/queues/QUEUE_ID which in my real example would look something like this - projects/staging/locations/us-central1/queues/members. That is all working fine, but i wanted to add the TASK_ID so i can enable the de-duplication feature and i used this projects/PROJECT_ID/locations/LOCATION_ID/queues/QUEUE_ID/tasks/TASK_ID which translates to something like this projects/staging/locations/us-central1/queues/members/tasks/testing-id. When i try to use the TASK_ID i get the following error code:
{
"message": "The principal (user or service account) lacks IAM permission \"cloudtasks.tasks.create\" for the resource \"projects\/staging\/locations\/us-central1\/queues\/members\/tasks\/testing-id\" (or the resource may not exist).",
"code": 7,
"status": "PERMISSION_DENIED",
"details": [
{
"#type": "grpc-server-stats-bin",
"data": "<Unknown Binary Data>"
}
]
}
Why is this error happening? Why should adding the TASK_ID change what permission do i need?

Google Authorized Error - Provided Client_Id is somehow 'invalid' - Why?

For whatever reason I am getting "Invalid Client id" error when testing the authentication flow I set up with next-auth. Here are the steps to reproduce the error:
Setup credentials (Client id /client secret) via google.
Setup environment variables in my ".env.local" file.
error:
Error 400: invalid_request
Missing required parameter: client_id
/api/[next-auth].js file
import NextAuth from "next-auth";
import Providers from "next-auth/providers";
export default NextAuth({
// https://next-auth.js.org/configuration/providers
providers: [
Providers.Email({
server: process.env.EMAIL_SERVER,
from: process.env.EMAIL_FROM,
}),
Providers.Google({
clientId: process.env.GOOGLE_ID,
clientSecret: process.env.GOOGLE_SECRET,
}),
],
});
.env.local file:
NEXT_PUBLIC_GOOGLE_ID=MY_GOOGLE_ID
GOOGLE_SECRET=MY_GOOGLE_SECRET
Figured it out. Here is the answer:
This line:
NEXT_PUBLIC_GOOGLE_ID=MY_GOOGLE_ID
GOOGLE_SECRET=MY_GOOGLE_SECRET
Should look like this:
GOOGLE_ID=MY_GOOGLE_ID
GOOGLE_SECRET=MY_GOOGLE_SECRET
Basically, Don't prepend the "NEXT_PUBLIC_"

BigQueryInsertJobOperator - required Parameter is missing, but which?

I've been trying to get this operator working for some time since switching to airflow 2.0 BigQueryInsertJobOperator.
The error I'm seeing shows there is something missing from our connection, oddly enough this connection works in another DAG where we are using google's api to access google sheets:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=
"google-cloud-platform://?extra__google_cloud_platform__project=\analytics&extra__google_cloud_platform__keyfile_dict=
{\"type\": \"service_account\", \"project_id\": \"analytics\",
\"private_key_id\": \"${GCLOUD_PRIVATE_KEY_ID}\", \"private_key\": \"${GCLOUD_PRIVATE_KEY}\",
\"client_email\": \"d#lytics.iam.gserviceaccount.com\", \"client_id\": \"12345667\",
\"auth_uri\": \"https://accounts.google.com/o/oauth2/auth\",
\"token_uri\": \"https://accounts.google.com/o/oauth2/token\",
\"auth_provider_x509_cert_url\": \"https://www.googleapis.com/oauth2/v1/certs\",
\"client_x509_cert_url\": \"https://www.googleapis.com/robot/v1/metadata/x509/d#lytics.iam.gserviceaccount.com\"}"
This is the error I'm seeing:
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
"errors": [
{
"message": "Login Required.",
"domain": "global",
"reason": "required",
"location": "Authorization",
"locationType": "header"
}
],
"status": "UNAUTHENTICATED"
}
}
is there a way I can look up what else might be required in terms of formatting, etc, perhaps a really good example on how to get the correct connection setup for this Operator??
In my logs I'm seeing this error which makes me think perhaps it might not be a credential issue?
File "/usr/local/lib/python3.8/site-packages/google/cloud/_http.py", line 438, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/vice-analytics/jobs?prettyPrint=false: Required parameter is missing
Create a service account json key, which contains all the required info posted in your error message.
https://cloud.google.com/iam/docs/creating-managing-service-account-keys
Then you can paste the json key into the Airflow UI: Admin -> Connections in the json key field and reference this in your dag with: gcp_conn_id="name of connection you created"
Or add the json key as an env variable (on macos):
export GOOGLE_APPLICATION_CREDENTIALS="link to your json key file"

Airflow - predefine variables and connections in file

Is it possible to pre-define variables, connections etc. in a file so that they are loaded when Airflow starts up? Setting them through the UI is not great from a deployment perspective.
Cheers
Terry
I'm glad that someone asked this question. In fact since Airflow completely exposes the underlying SQLAlchemy models to end-user, programmatic manipulation (creation, updation & deletion) of all Airflow models, particularly those used to supply configs like Connection & Variable is possible.
It may not be very obvious, but the open-source nature of Airflow means there are no secrets: you just need to peek in harder. Particularly for these use-cases, I've always found the cli.py to be very useful reference point.
So here's the snippet I use to create all MySQL connections while setting up Airflow. The input file supplied is of JSON format with the given structure.
# all imports
import json
from typing import List, Dict, Any, Optional
from airflow.models import Connection
from airflow.settings import Session
from airflow.utils.db import provide_session
from sqlalchemy.orm import exc
# trigger method
def create_mysql_conns(file_path: str) -> None:
"""
Reads MySQL connection settings from a given JSON file and
persists it in Airflow's meta-db. If connection for same
db already exists, it is overwritten
:param file_path: Path to JSON file containing MySQL connection settings
:type file_path: str
:return: None
:type: None
"""
with open(file_path) as json_file:
json_data: List[Dict[str, Any]] = json.load(json_file)
for settings_dict in json_data:
db_name: str = settings_dict["db"]
conn_id: str = "mysql.{db_name}".format(db_name=db_name)
mysql_conn: Connection = Connection(conn_id=conn_id,
conn_type="mysql",
host=settings_dict["host"],
login=settings_dict["user"],
password=settings_dict["password"],
schema=db_name,
port=settings_dict.get("port", mysql_conn_description["port"]))
create_and_overwrite_conn(conn=mysql_conn)
# utility delete method
#provide_session
def delete_conn_if_exists(conn_id: str, session: Optional[Session] = None) -> bool:
# Code snippet borrowed from airflow.bin.cli.connections(..)
try:
to_delete: Connection = (session
.query(Connection)
.filter(Connection.conn_id == conn_id)
.one())
except exc.NoResultFound:
return False
except exc.MultipleResultsFound:
return False
else:
session.delete(to_delete)
session.commit()
return True
# utility overwrite method
#provide_session
def create_and_overwrite_conn(conn: Connection, session: Optional[Session] = None) -> None:
delete_conn_if_exists(conn_id=conn.conn_id)
session.add(conn)
session.commit()
input JSON file structure
[
{
"db": "db_1",
"host": "db_1.hostname.com",
"user": "db_1_user",
"password": "db_1_passwd"
},
{
"db": "db_2",
"host": "db_2.hostname.com",
"user": "db_2_user",
"password": "db_2_passwd"
}
]
Reference links
With code, how do you update an airflow variable?
Problem updating the connections in Airflow programatically
How to create, update and delete airflow variables without using the GUI?
Programmatically clear the state of airflow task instances
airflow/api/common/experimental/pool.py

Openstack API Authentication

Openstack noob here. I have setup an Ubuntu VM with DevStack, and am trying to authenticate with Keystone to obtain a token to be used for subsequent Openstack API calls. The identity endpoint shown on the “API Access” page in Horizon is: http://<DEVSTACK_IP>/identity.
When I post the below JSON payload to this endpoint, I get the error get_version_v3() got an unexpected keyword argument 'auth’.
{
"auth": {
"identity": {
"methods": [
"password"
],
"password": {
"user": {
"name": "admin",
"domain": {
"name": "Default"
},
"password": “AdminPassword”
}
}
}
}
}
Based on the Openstack docs, I should be hitting http://<DEVSTACK_IP>/v3/auth/tokens to obtain a token, but when I hit that endpoint, I get 404 Not Found.
I'm currently using Postman for testing this, but will eventually be doing programmatically.
Does anybody have any experience with authenticating against the Openstack API that can help?
Not sure whether you want to do it in a python way, but if you do, here is a way to do it:
from keystoneauth1.identity import v3
from keystoneauth1 import session
v3_auth = v3.Password(auth_url=V3_AUTH_URL,
username=USERNAME,
password=PASSWORD,
project_name=PROJECT_NAME,
project_domain_name="default",
user_domain_name="default")
v3_ses = session.Session(auth=v3_auth)
auth_token = v3_ses.get_token()
And you V3_AUTH_URL should be http://<DEVSTACK_IP>:5000/v3 since keystone is using port 5000 as a default.
If you do have a multi-domain devstack, you can change the domains, otherwise, they should be default
Just in case you don't have the client library installed: pip install python-keystoneclient
Here is a good doc for you to read about it:
https://docs.openstack.org/keystoneauth/latest/using-sessions.html
HTH

Resources