composer workflow fails at dataproc operator - airflow

I have a composer environment setup in gcp and it's running a DAG as follows
with DAG('sample-dataproc-dag',
default_args=DEFAULT_DAG_ARGS,
schedule_interval=None) as dag: # Here we are using dag as context
# Submit the PySpark job.
submit_pyspark = DataProcPySparkOperator(
task_id='run_dataproc_pyspark',
main='gs://.../dataprocjob.py',
cluster_name='xyz',
dataproc_pyspark_jars=
'gs://.../spark-bigquery-latest_2.12.jar'
)
simple_bash = BashOperator(
task_id='simple-bash',
bash_command="ls -la")
submit_pyspark.dag = dag
submit_pyspark.set_upstream(simple_bash)
and this is my dataprocjob.py
from pyspark.sql import SparkSession
if __name__ == '__main__':
spark = SparkSession.builder.appName('Jupyter BigQuery Storage').getOrCreate()
table = "projct.dataset.txn_w_ah_demo"
df = spark.read.format("bigquery").option("table",table).load()
df.printSchema()
my composer pipeline fails at the dataproc step. In the composer log stored in the gcs, this is what I see
[2020-09-23 21:40:02,849] {taskinstance.py:1059} ERROR - <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">#-#{"workflow": "sample-dataproc-dag", "task-id": "run_dataproc_pyspark", "execution-date": "2020-09-23T21:39:42.371933+00:00"}
Traceback (most recent call last):
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py", line 1139, in execute
super(DataProcPySparkOperator, self).execute(context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py", line 707, in execute
self.hook.submit(self.hook.project_id, self.job, self.region, self.job_error_states)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py", line 311, in submit
num_retries=self.num_retries)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py", line 51, in __init__
clusterName=cluster_name).execute()
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 851, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">

It seems that the issue you are presenting corresponds to the Dataproc permissions you have granted to your application.
According to docummentation, you need different role privileges to perform Dataproc tasks, for example:
dataproc.clusters.create permits the creation of Cloud Dataproc clusters in the containing project
dataproc.jobs.create permits the submission of Dataproc jobs to Dataproc clusters in the containing project
dataproc.clusters.list permits the listing of details of Dataproc clusters in the containing project
If you want to create a submit a dataproc job, you need the 'dataproc.clusters.use' and 'dataproc.jobs.create' permission.
In order to grant the correct privileges to your user account, you can follow the docummentation for updating your service account you are using in your code and add the correct permissions.

From a first read this looks like the permissions on the Google Cloud account that you are invoking the Dataproc APIs aren't adequate for the Operator.

Related

Airflow DAG keeps failing

I am having an issue with an airflow DAG that keeps failing.
The error is shown below:
[2022-08-19 06:49:15,850] {taskinstance.py:1150} ERROR - task is not running but the task data does not show ended
Traceback (most recent call last):
File "path/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "path/asynchronous.py", line 30, in execute
self._execute(task_definition=self.task_definition, logger=self.logger)
File "path/asynchronous.py", line 159, in _execute
raise RuntimeError(f'task is not running but the task data does not show ended')
RuntimeError: task is not running but the task data does not show ended
I am a beginner in airflow. The code for this was prepared from someone who has unfortunately left and I am trying to troubleshoot it. Can anyone please tell me if they have any suggestions about what could be happening and how to fix it?
When i go to the code tab in airflow the code looks like this:
import json
import logging
from a.component.s3.s3 import S3
from a.context.configuration import Configuration
from a.context.environment import AirflowEnvironment
from a.process.process import Process
airflow_env = AirflowEnvironment()
config = Configuration()
s3 = S3(config=config)
for dag_definition_path in [p for p in airflow_env.sequencing_run_dag_dir_path.glob('*.json') if p.is_file()]:
with dag_definition_path.open() as inf:
json_dict = json.load(fp=inf)
process = Process.from_json_dict(json_dict=json_dict)
logging.info(f'process found, {process.process_name}')
logging.info(f'creating and registering dag for process, {process.process_name}')
dag = process.create_dag(config=config, airflow_env=airflow_env)
# register the DAG globally
globals()[dag.dag_id] = dag
There is a manager routine running every 30 minutes that starts new DAGs if new files are added in a specific folder.
Thank you

GCP Composer v1.18.6 and 2.0.10 incompatible with CloudSqlProxyRunner

In my Composer Airflow DAGs, I have been using the CloudSqlProxyRunner to connect to my Cloud SQL instance.
However, after updating Google Cloud Composer from v1.18.4 to 1.18.6, my DAG started to encounter a strange error:
[2022-04-22, 23:20:18 UTC] {cloud_sql.py:462} INFO - Downloading cloud_sql_proxy from https://dl.google.com/cloudsql/cloud_sql_proxy.linux.x86_64 to /home/airflow/dXhOYoU_cloud_sql_proxy.tmp
[2022-04-22, 23:20:18 UTC] {taskinstance.py:1702} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1330, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1457, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1513, in _execute_task
result = execute_callable(context=context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/decorators/base.py", line 134, in execute
return_value = super().execute(context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/operators/python.py", line 174, in execute
return_value = self.execute_callable()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/operators/python.py", line 185, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/airflow/gcs/dags/real_time_scoring_pipeline.py", line 99, in get_messages_db
with SQLConnection() as sql_conn:
File "/home/airflow/gcs/dags/helpers/helpers.py", line 71, in __enter__
self.proxy_runner.start_proxy()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/cloud_sql.py", line 524, in start_proxy
self._download_sql_proxy_if_needed()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/cloud_sql.py", line 474, in _download_sql_proxy_if_needed
raise AirflowException(
airflow.exceptions.AirflowException: The cloud-sql-proxy could not be downloaded. Status code = 404. Reason = Not Found
Checking manually, https://dl.google.com/cloudsql/cloud_sql_proxy.linux.x86_64 indeed returns a 404.
Looking at the function that raises the exception, _download_sql_proxy_if_needed, it has this code:
system = platform.system().lower()
processor = os.uname().machine
if not self.sql_proxy_version:
download_url = CLOUD_SQL_PROXY_DOWNLOAD_URL.format(system, processor)
else:
download_url = CLOUD_SQL_PROXY_VERSION_DOWNLOAD_URL.format(
self.sql_proxy_version, system, processor
)
So, for whatever reason, in both of these latest images of Composer, processor = os.uname().machine returns x86_64. Previously, it returned amd64, and https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 is in fact a valid link to the binary we need.
I replicated this error in Composer 2.0.10 as well.
I am still investigating possible workarounds, but posting this here in case someone else encounters this issue, and has figured out a workaround, and to raise this with Google engineers (who, according to Composer's docs, monitor this tag).
My current workaround is patching the CloudSqlProxyRunner to hardcode the correct URL:
class PatchedCloudSqlProxyRunner(CloudSqlProxyRunner):
"""
This is a patched version of CloudSqlProxyRunner to provide a workaround for an incorrectly
generated URL to the Cloud SQL proxy binary.
"""
def _download_sql_proxy_if_needed(self) -> None:
download_url = "https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64"
# the rest of the code is taken from the original method
proxy_path_tmp = self.sql_proxy_path + ".tmp"
self.log.info(
"Downloading cloud_sql_proxy from %s to %s", download_url, proxy_path_tmp
)
# httpx has a breaking API change (follow_redirects vs allow_redirects)
# and this should work with both versions (cf. issue #20088)
if "follow_redirects" in signature(httpx.get).parameters.keys():
response = httpx.get(download_url, follow_redirects=True)
else:
response = httpx.get(download_url, allow_redirects=True) # type: ignore[call-arg]
# Downloading to .tmp file first to avoid case where partially downloaded
# binary is used by parallel operator which uses the same fixed binary path
with open(proxy_path_tmp, "wb") as file:
file.write(response.content)
if response.status_code != 200:
raise AirflowException(
"The cloud-sql-proxy could not be downloaded. "
f"Status code = {response.status_code}. Reason = {response.reason_phrase}"
)
self.log.info(
"Moving sql_proxy binary from %s to %s", proxy_path_tmp, self.sql_proxy_path
)
shutil.move(proxy_path_tmp, self.sql_proxy_path)
os.chmod(self.sql_proxy_path, 0o744) # Set executable bit
self.sql_proxy_was_downloaded = True
And then instantiate it and use it as I would the original CloudSqlProxyRunner:
proxy_runner = PatchedCloudSqlProxyRunner(path_prefix, instance_spec)
proxy_runner.start_proxy()
But I am hoping that this is properly fixed by someone at Google soon, by fixing the os.uname().machine value,
or uploading a Cloud SQL proxy binary to the one currently generated in _download_sql_proxy_if_needed.
As mentioned by #enocom this commit to support arm64 download links actually caused a side-effect of generating broken download links. I assume the author of the commit thought that the Cloud SQL Proxy had binaries for each machine type, although in fact there are not Linux x86_64 links.
I have created an airflow PR to hopefully fix the broken links, hopefully it will get merged in soon and resolve this. Will update the thread with any updates.
Update (I've been working with Jack on this): I just merged that PR! When a new version of the providers is added to PyPI, you'll need to add it to your Composer environment. In the meantime, as a workaround, you could take the fix from Jack's PR and use it as a local dependency. (Similar to the other reply here!) If you do this, I highly recommend setting a calendar reminder (maybe a month from now?) to remove the workaround and go back to importing from the provider package, just to make sure you don't miss out on other updates to it! :)

password_auth.py mushroom cloud error on Airflow webserver open when switching backend DBs

Trying to transition from sqlite db to postgresql (based on the guide here: https://www.ryanmerlin.com/2019/07/apache-airflow-installation-on-ubuntu-18-04-18-10/ ) and getting mushroom cloud error at initial screen of webserver UI.
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/home/airflow/.local/lib/python3.6/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
...
...
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/www/utils.py", line 93, in is_accessible
(not current_user.is_anonymous and current_user.is_superuser())
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/auth/backends/password_auth.py", line 114, in is_superuser
return hasattr(self, 'user') and self.user.is_superuser()
AttributeError: 'NoneType' object has no attribute 'is_superuser'
Looking at the webserver logs does not reveal much...
[airflow#airflowetl airflow]$ tail airflow-webserver.*
==> airflow-webserver.err <==
/home/airflow/.local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
==> airflow-webserver.log <==
==> airflow-webserver.out <==
[2019-12-18 10:20:36,553] {settings.py:213} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=72725
==> airflow-webserver.pid <==
72745
One thing that may be useful to note (since this appears to be due to some kind of password issue) is that before trying to switch to postgres, I had set bycrpt password according to the docs (https://airflow.apache.org/docs/stable/security.html#password) with the script here:
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'airflow'
user.email = 'myemail#co.org'
user.password = 'mypasword'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
Anyone know what could be going on here or how to debug further?
Rerunning my user/password script fixed the problem.
I assume this is connected to the change to the new postgres server from the previously used sqllite db. I guess this is stored somewhere in the airflow backend DB (rather new to airflow, so was not aware of these internals) and since I am switching backends the new backend does not have this user/auth info and need to rerun the script to import the airflow package and write a new user/password to its backend to be able to log in with password (since my airflow.cfg uses auth_backend = airflow.contrib.auth.backends.password_auth).

Why Airflow db upgrade to v.10.3 from v1.8.2 fails?

I have airflow v1.8.2 and tried to upgrade to v1.10.3
After the update I run the command
airflow upgradedb
and get the error:
...
File "/opt/nio/lib/airflow/sqlalchemy/engine/default.py", line 536, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column
dag.description does not exist
LINE 1: ...fileloc AS dag_fileloc, dag.owners AS dag_owners,
dag.descri...
^
[SQL: 'SELECT dag.dag_id AS dag_dag_id, dag.is_paused AS dag_is_paused,
dag.is_subdag AS dag_is_subdag, dag.is_active AS dag_is_active,
dag.last_scheduler_run AS dag_last_scheduler_run, dag.last_pickled AS
dag_last_pickled, dag.last_expired AS dag_last_expired,
dag.scheduler_lock AS dag_scheduler_lock, dag.pickle_id AS
dag_pickle_id,
dag.fileloc AS dag_fileloc, dag.owners AS dag_owners, dag.description
AS dag_description, dag.default_view AS dag_default_view,
dag.schedule_interval AS dag_schedule_interval \nFROM dag \nWHERE
dag.dag_id = %(dag_id_1)s \n LIMIT %(param_1)s'] [parameters:
{'dag_id_1': u'custom_feeds_unit_dna', 'param_1': 1}] (Background on
this error at: http://sqlalche.me/e/f405)
Why db upgrade fails ?
Should I first update to airflow v1.9.0 and not to v1.10.3 ?
As per the link in the error message, it seems to be a ProgrammingError. This implies that the syntax is off, which could be dealt with by first updating to v1.9.x because the updatedb code is most likely written to update it from the previous version. So the updatedb for v1.10.x is likely to be able to update the schema from v1.9.x and v1.9.x is likely to be able to update from v1.8.x.
However this takes a lot of time because it has to traverse every path on the system and then index it appropriately, which requires a lot of work (and a lot of resources).
If you don't need the meta data for the precious runs, you could use airflow resetdb to just reset the database and update the schema for the latest version.

Why the imported PowerFactory module in python can only execute single time?

The script is be able to run a software called PoiwerFctory externally by Python as follows:
#add powerfactory.pyd path to python path
import sys
sys.path.append("C:\\Program Files\\DIgSILENT\\PowerFactory 2017
SP2\\Python\\3.6")
#import powerfactory module
import powerfactory
#start powerfactory module in unattended mode (engine mode)
app=powerfactory.GetApplication()
#get the user
user=app.GetCurrentUser()
#active project
project=app.ActivateProject('Python Test') #active project "Python Test"
prj=app.GetActiveProject #returns the actived project
#run python code below
ldf=app.GetFromStudyCase('ComLdf') #caling loadflow command object
ldf.Execute() #executing the load flow command
#get the list of lines contained in the project
Lines=app.GetCalcRelevantObjects('*.ElmLne') #returns all relevant objects,
i.e. all lines
for line in Lines: #get each element out of list
name=line.loc_name #get name of the line
value=line.GetAttribute('c:loading') # return the value of elements
#Print the results
print('Loading of the line: %s = %.2f'%(name,value))
When the above code first time executed in Spyder, it will show proper resutls. However, if re-executing the script again, the following error is appeared:
Reloaded modules: powerfactory
Traceback (most recent call last):
File "<ipython-input-9-ae989570f05f>", line 1, in <module>
runfile('C:/Users/zd1n14/Desktop/Python Test/Call Digsilent in
Python.py', wdir='C:/Users/zd1n14/Desktop/Python Test')
File "C:\ProgramData\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/zd1n14/Desktop/Python Test/Call Digsilent in Python.py",
line 12, in <module>
user=app.GetCurrentUser()
RuntimeError: 'powerfactory.Application' already deleted
Referred to How can I exit powerfactory using Python in Unattended mode?, this may because of PowerFactory in still running. And the only way which has been found so far is to re-start the Spyder and execute the script again, this is so inefficiency that if I want to re-write the code and debugging it.
It would be so much appropriated that if anyone could give me some advice for such problem.
I ran into the same Problem. Python is still connected to powerfactory and gives the Error if you try to connect again. What basicly worked for me was to kill the instance on the end of your skript with
del app
another idea during debugging could be:
try:
# Do something in your skript
finally:
del app
So the killing of the instance happens in any case.
The way to solve this is to reload the powerfacotry module by adding:
if __name__ == "__main__":
before import powerfacory.
The reason behind may referred to: What does if __name__ == "__main__": do?.

Resources