Error when exporting from BigQuery to MySQL - airflow

I am trying to export a table from BigQuery to Google Cloud MySQL database.
I found this operator called BigQueryToMySqlOperator (documented here https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/bigquery_to_mysql/index.html?highlight=bigquerytomysqloperator#module-airflow.providers.google.cloud.transfers.bigquery_to_mysql)
When I deploy the DAG containing this task onto cloud composer, the task always failed with the error
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1113, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1287, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1317, in _execute_task
result = task_copy.execute(context=context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_mysql.py", line 166, in execute
for rows in self._bq_get_data():
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_mysql.py", line 138, in _bq_get_data
response = cursor.get_tabledata(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 2508, in get_tabledata
return self.hook.get_tabledata(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1284, in get_tabledata
rows = self.list_rows(dataset_id, table_id, max_results, selected_fields, page_token, start_index)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 412, in inner_wrapper
raise AirflowException(
airflow.exceptions.AirflowException: You must use keyword arguments in this methods rather than positional
I don't really understand why it is throwing out this error. Can anyone help me figuring out what went wrong or how should I export data from BigQuery to MySQL DB? Much thanks for your help!
EDIT: My operator code would basically look like this
transfer_data = BigQueryToMySqlOperator(
task_id='task_id',
dataset_table='origin_bq_table',
mysql_table='dest_table_name',
replace=True,
)

Based on the stacktrace, you are most likely using apache-airflow-providers-google==2.2.0.
airflow.exceptions.AirflowException: You must use keyword arguments in
this methods rather than positional
This error originates from the GoogleBaseHook, which can be traced back the BigQueryToMySqlOperator.
BigQueryToMySqlOperator > BigQueryHook > BigQueryConnection > BigQueryCursor > get_tabledata
The reason why you are getting the AirflowException is because get_tabledata
is called as part of the execute method.
Unforuntately, the test for the operator is not comprehensive since it only checks whether or not the method was called was the correct parameters.
I think this will require a new release of the google provider where the cursor in BigQueryToMySqlOperator calls list_rows with keyword arguments instead of get_tabledata, which calls list_rows with positional arguments.
I have also made a Github Issue in the Airflow repository.

Related

How to add custom component to Elyra's list of available airflow operators?

Trying to make my own component based on KubernetesPodOperator. I am able to define and add the component to the list of components but when trying to run it, I get:
Operator 'KubernetesPodOperator' of node 'KubernetesPodOperator' is not configured in the list of available operators. Please add the fully-qualified package name for 'KubernetesPodOperator' to the AirflowPipelineProcessor.available_airflow_operators configuration.
and error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/tornado/web.py", line 1704, in _execute
result = await result
File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/handlers.py", line 120, in post
response = await PipelineProcessorManager.instance().process(pipeline)
File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/processor.py", line 134, in process
res = await asyncio.get_event_loop().run_in_executor(None, processor.process, pipeline)
File "/opt/conda/lib/python3.9/asyncio/futures.py", line 284, in __await__
yield self # This tells Task to wait for completion.
File "/opt/conda/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
future.result()
File "/opt/conda/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/airflow/processor_airflow.py", line 122, in process
pipeline_filepath = self.create_pipeline_file(pipeline=pipeline,
File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/airflow/processor_airflow.py", line 420, in create_pipeline_file
target_ops = self._cc_pipeline(pipeline, pipeline_name)
File "/opt/conda/lib/python3.9/site-packages/elyra/pipeline/airflow/processor_airflow.py", line 368, in _cc_pipeline
raise ValueError(f"Operator '{component.name}' of node '{operation.name}' is not configured "
ValueError: Operator 'KubernetesPodOperator' of node 'KubernetesPodOperator' is not configured in the list of available operators. Please add the fully-qualified package name for 'KubernetesPodOperator' to the AirflowPipelineProcessor.available_airflow_operators configuration.
After looking through the src code, I can see in the processor_airflow.py these lines:
# This specifies the default airflow operators included with Elyra. Any Airflow-based
# custom connectors should create/extend the elyra configuration file to include
# those fully-qualified operator/class names.
available_airflow_operators = ListTrait(
CUnicode(),
["airflow.operators.slack_operator.SlackAPIPostOperator",
"airflow.operators.bash_operator.BashOperator",
"airflow.operators.email_operator.EmailOperator",
"airflow.operators.http_operator.SimpleHttpOperator",
"airflow.contrib.operators.spark_sql_operator.SparkSqlOperator",
"airflow.contrib.operators.spark_submit_operator.SparkSubmitOperator"],
help="""List of available Apache Airflow operator names.
Operators available for use within Apache Airflow pipelines. These operators must
be fully qualified (i.e., prefixed with their package names).
""",
).tag(config=True)
tho I am unsure if this can be extended from the client.
The available_airflow_operators list is a configurable trait in Elyra. You’ll have to add the fully-qualified package name for the KubernetesPodOperator to this list in order for it to create the DAG correctly.
To do so, generate a config file from the command line with jupyter elyra --generate-config. Open the created file and add the following line (you can add it under the PipelineProcessor(LoggingConfigurable) heading if you prefer to keep the file organized):
c.AirflowPipelineProcessor.available_airflow_operators.append("airflow.providers.cncf.kubernetes.operators.kubernetes_pod.KubernetesPodOperator")
Change that string value to the correct package for your use case if it's not the above (make sure that it ends with the class name of the required operator). If you need to add multiple packages, you can also use extend rather than append.
Edit: here is the link to the relevant documentation

TypeError: string indices must be integers on ArangoDB

Arango module gives a weird error while accessing databases:
from arango import ArangoClient
client = ArangoClient(hosts='http://localhost:8529/')
sys_db = client.db('_system', username='root', password='root')
sys_db.databases()
below is the error:
Traceback (most recent call last): File "", line 1, in
File
"/home/ubuntu/arangovenv/lib/python3.6/site-packages/arango/database.py",
line 699, in databases
return self._execute(request, response_handler) File "/home/ubuntu/arangovenv/lib/python3.6/site-packages/arango/api.py",
line 66, in _execute
return self._executor.execute(request, response_handler) File "/home/ubuntu/arangovenv/lib/python3.6/site-packages/arango/executor.py",
line 82, in execute
return response_handler(resp) File "/home/ubuntu/arangovenv/lib/python3.6/site-packages/arango/database.py",
line 697, in response_handler
return resp.body['result'] TypeError: string indices must be integers
calling database module from "packages/arango/database.py" giving me the same error.
my env:
1) ubuntu 16.4
2) python-arango==5.2.1
any help appreciated.
If you are running it on some server, it may be a server issue. It was in my case at least. I ran the following to clear the proxy and it worked fine.
export http_proxy=''
As I guessed, resp.body is not the data type that you provided. line 697 of database.py is expecting something else. For example:
>>> data = "MyName"
>>> print(data[0])
'M'
>>> print(data['anything'])
TypeError: string indices must be integers
First print command gives the result while seconds command throws the error.
I hope this might solve your problem.

Access parent dag context at subtag creation time in airflow?

I'm trying to access at subdag creation time some xcom data from parent dag, I was searching to achieve this on internet but I didn't find something.
def test(task_id):
logging.info(f' execution of task {task_id}')
def load_subdag(parent_dag_id, child_dag_id, args):
dag_subdag = DAG(
dag_id='{0}.{1}'.format(parent_dag_id, child_dag_id),
default_args=args,
schedule_interval="#daily",
)
with dag_subdag:
r = DummyOperator(task_id='random')
for i in range(r.xcom_pull(task_ids='take_Ana', key='the_message', dag_id=parent_dag_id)):
t = PythonOperator(
task_id='load_subdag_{0}'.format(i),
default_args=args,
python_callable=print_context,
op_kwargs={'task_id': 'load_subdag_{0}'.format(i)},
dag=dag_subdag,
)
return dag_subdag
load_tasks = SubDagOperator(
task_id='load_tasks',
subdag=load_subdag(dag.dag_id,
'load_tasks', args),
default_args=args,
)
got this error with my code
1 | Traceback (most recent call last):
airflow_1 | File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 374, in process_file
airflow_1 | m = imp.load_source(mod_name, filepath)
airflow_1 | File "/usr/local/lib/python3.6/imp.py", line 172, in load_source
airflow_1 | module = _load(spec)
airflow_1 | File "<frozen importlib._bootstrap>", line 684, in _load
airflow_1 | File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
airflow_1 | File "<frozen importlib._bootstrap_external>", line 678, in exec_module
airflow_1 | File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
airflow_1 | File "/app/dags/airflow_dag_test.py", line 75, in <module>
airflow_1 | 'load_tasks', args),
airflow_1 | File "/app/dags/airflow_dag_test.py", line 55, in load_subdag
airflow_1 | for i in range(r.xcom_pull(task_ids='take_Ana', key='the_message', dag_id=parent_dag_id)):
airflow_1 | TypeError: xcom_pull() missing 1 required positional argument: 'context'
The error is simple: you are missing the context argument required by xcom_pull() method. But you really can't just create context to pass into this method; it is a Python dictionary that Airflow passes to anchor methods like pre_execute() and execute() of BaseOperator (parent class of all Operators).
In other words, context becomes available only when Operator is actually executed, not during DAG-definition. And it makes sense because in taxanomy of Airflow, xcoms are communication mechanism between tasks in realtime: talking to each other while they are running.
But at the end of the day Xcoms, just like every other Airflow model, are persisted in backend meta-db. So of course you can directly retrieve it from there (obviously only the XCOMs of tasks that had run in the past). While I don't have a code-snippet, you can have a look at cli.py where they've used the SQLAlchemy ORM to play with models and backend-db. Do understand that this would mean a query being fired to your backend-db every time the DAG-definition file is parsed, which happens rather quickly.
Useful links
How can one set a variable for use only during a certain dag_run
How to pull xcom value from other task instance in the same DAG run (not the most recent one)?
EDIT-1
After looking at your code-snippet, I got alarmed. Assuming the value returned by xcom_pull() will keep changing frequently, the number of tasks in your dag will also keep changing. This can lead to unpredictable behaviours (you should do a fair bit of research but I don't have a good feeling about it)
I'd suggest you revisit your entire task workflow and condense down to a design where the
number of tasks and
structure of DAG
are known ahead of time (at the time of execution of dag-definition file). You can of-course iterate over a json file / result of a SQL query (like the SQLAlchemy thing mentioned earlier) etc. to spawn your actual tasks, but that file / db / whatever shouldn't be changing frequently.
Do understand that merely iterating over a list to generate tasks is not problematic; what's NOT possible is to have structure of your DAG dependent on result of upstream task. For example you can't have n tasks created in your DAG based on an upstream task calculating value of n at runtime.
So this is not possible
Airflow dynamic tasks at runtime
Is there a way to create dynamic workflows in Airflow
Dynamically create list of tasks
But this is possible (including what you are trying to achieve; even though the way you are doing it doesn't seem like a good idea)
Dynamically Generating DAGs in Airflow
Airflow DAG dynamic structure
etsy/boundary-layer
ajbosco/dag-factory
EDIT-2
So as it turns out, generating tasks from output of upstream tasks is possible after all; although it requires significant amount of knowledge of internal workings of Airflow as well as a tinge of creativity.
In fact unless you really understand it, I would strongly recommend to stay away from it.
But for those who know no bounds here's the trick Proper way to create Dynamic Workflows in Airflow
EDIT-3
Airflow 2.3 added Dynamic Task Mapping. It can be used to iterate over a list and spin up a Task for each item.

Weird AttributeError: OpenMDAO says param has not been initialized when I run my simulation under mpirun

I am running developmental scientific code. I am stuck on a cryptic error message, and am curious what the OpenMDAO team thinks. When I run the code in serial, it works with no issues. When I run it under mpirun, OpenMDAO throws a cryptic error message:
Traceback (most recent call last):
File "test/exampleOptimizationAEP.py", line 129, in <module>
prob['ratedPower'] = ratedPower
.....
File "/scratch/jquick/test/lib/python2.7/site-packages/openmdao-1.7.3-py2.7.egg/openmdao/core/vec_wrapper.py", line 1316, in __setitem__
(self.name, name))
AttributeError: 'params' has not been initialized, setup() must be called before 'ratedPower' can be accessed
I am not sure how to approach this. There is nothing obviously different about the ratedPower variable in the code. What information does this error give me about what is going wrong?
This is a bug in OpenMDAO <= v1.7.2. Look at the output of check_setup and see the list of parameters without associated unknowns. You will find that variable in there. When running in parallel (because of the bug), you can not set any hanging params (ones without associated unknowns) in your setup script.
The way to fix it is to add an IndepVarComp to any variable you need to initialize the value of.

products.sqlalchemypas-1.0-py2.6.egg AttributeError: getGroupsForPrincipal

I am using products.sqlalchemypas-1.0-py2.6.egg for authenticating user from MSSQL Table. Authentication work as expected but now I'm trying implementaing groups plugin to
get groups from different table. What happening is when I'm trying to loggin its giving me error saying AttributeError: getGroupsForPrincipal.
Error Traceback is ..
2012-02-21T15:33:14 INFO Zope Ready to handle requests
2012-02-21T15:39:25 ERROR Zope.SiteErrorLog 1329838765.580.598770330561 http://localhost:8060/dev/login_form
Traceback (innermost last):
Module ZPublisher.Publish, line 115, in publish
Module ZPublisher.BaseRequest, line 596, in traverse
Module Products.PluggableAuthService.PluggableAuthService, line 235, in validate
Module Products.PluggableAuthService.PluggableAuthService, line 735, in _findUser
Module Products.PluggableAuthService.PluggableAuthService, line 668, in _getGroupsForPrincipal
AttributeError: getGroupsForPrincipal
My defination in plugin.py is ...
def getGroupsForPrincipal(self, principal=getSecurityManager().getUser().getId(),request=None):
"Getting groups from SIMS"
import pdb; pdb.set_trace()
groups = []
results = self.simsGroupForUser(username=principal)
for row in results.dictionaries():
group = row.get('group')
groups.append(group)
return groups
Don't know why its not able to reach this method in plugin.py however there is implatemented block where I did define this interface to implement resulting showing groups interface in my acl_user pas object.
[added]
I've tried to import my plugin in debugger and tried to reach this method and have same error so I dont know Do I need to define anything specifically to pick this method in my pas. I did define in my implements class to impelement IGroupsPlugin.
Any comment is great help as always.
I don't think you method definition does what you expect it to. principal=getSecurityManager().getUser().getId() will calculate the default parameter at import time rather than at method execution time.
Just found that My file has wrong indentation, that why it was giving attributes error. Thanks all for your time and comments.

Resources