Snowflake:Assign values from previous statement in SQL query
Requirement: Assign values from the previous statement to the next statement in SQL query , as I run the query in SnowflakeOperator in Airflow
SQL:
BEGIN
app = 'abc';
env = select current_database();
start_time = select current_timestamp()::timestamp_ntz(9);
end_time = select current_timestamp()::timestamp_ntz(9);
duration = (end_time.getTime() - start_time.getTime()) / 1000;
insert into proc_runtimes
(env, app, task, start_time, end_time, duration, message)
values
(env, app, 'Job Start', start_time.toISOString(), end_time.toISOString(), duration, log_message]})
END
EDIT:
Requirement: Assign values from the previous statement to the next statement in SQL query, as I run the query in SnowflakeOperator in Airflow
Error: Airflow SnowflakeOperator not able to execute the anonymous block statement in the SQL file
SQL:
BEGIN
let app := 'abc';
let env := current_database();
let start_time := current_timestamp()::timestamp_ntz(9);
let end_time := current_timestamp()::timestamp_ntz(9);
let duration := DATEDIFF(seconds, end_time, start_time);
let log_message := 'some log';
INSERT INTO proc_runtimes
(env, app, task_name, start_time, end_time, duration, message)
SELECT
:env, :app, 'Job Start', :start_time, :end_time, :duration, :log_message;
END;
Error:
2022-08-16, 19:38:43 UTC] {cursor.py:696} INFO - query: [BEGIN let env := current_database();]
[2022-08-16, 19:38:43 UTC] {cursor.py:720} INFO - query execution done
[2022-08-16, 19:38:43 UTC] {connection.py:509} INFO - closed
[2022-08-16, 19:38:44 UTC] {connection.py:512} INFO - No async queries seem to be running, deleting session
[2022-08-16, 19:38:44 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/snowflake/operators/snowflake.py", line 120, in execute
execution_info = hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/snowflake/hooks/snowflake.py", line 301, in run
cur.execute(sql_statement)
File "/home/airflow/.local/lib/python3.7/site-packages/snowflake/connector/cursor.py", line 782, in execute
self.connection, self, ProgrammingError, errvalue
File "/home/airflow/.local/lib/python3.7/site-packages/snowflake/connector/errors.py", line 273, in errorhandler_wrapper
error_value,
File "/home/airflow/.local/lib/python3.7/site-packages/snowflake/connector/errors.py", line 324, in hand_to_other_handler
cursor.errorhandler(connection, cursor, error_class, error_value)
File "/home/airflow/.local/lib/python3.7/site-packages/snowflake/connector/errors.py", line 210, in default_errorhandler
cursor=cursor,
snowflake.connector.errors.ProgrammingError: 001003 (42000): 01a6551a-0501-b736-0251-83014fb1394b: SQL compilation error:
syntax error line 3 at position 34 unexpected '<EOF>'.
Variable should be defined(:= is assignment operator and can be later accessed:
Test table:
create or replace table proc_runtimes(env TEXT,
app TEXT,
task_name TEXT,
start_time timestamp_ntz(9),
end_time timestamp_ntz(9),
duration TEXT,
message TEXT);
Main block:
BEGIN
let app := 'abc';
let env := current_database();
let start_time := current_timestamp()::timestamp_ntz(9);
let end_time := current_timestamp()::timestamp_ntz(9);
let duration := DATEDIFF(seconds, end_time, start_time);
let log_message := 'some log';
INSERT INTO proc_runtimes
(env, app, task_name, start_time, end_time, duration, message)
SELECT
:env, :app, 'Job Start', :start_time, :end_time, :duration, :log_message;
END;
Check:
SELECT * FROM proc_runtimes;
Resolved the issue with below statement
execute immediate $$
BEGIN
....
....
....
END;
$$
Related
Requirement: I am trying to avoid using Variable.get() Instead use Jinja templated {{var.json.variable}}
I have defined the variables in JSON format as an example below and stored them in the secret manager as snflk_json
snflk_json
{
"snwflke_acct_request_memory":"4000Mi",
"snwflke_acct_limit_memory":"4000Mi",
"schedule_interval_snwflke_acct":"0 12 * * *",
"LIST" ::[
"ABC.DEV","CDD.PROD"
]
}
Issue 1: Unable to retrieve schedule interval from the JSON variable
Error : Invalid timetable expression: Exactly 5 or 6 columns has to be specified for iterator expression.
Tried to use in the dag as below
schedule_interval = '{{var.json.snflk_json.schedule_interval_snwflke_acct}}',
Issue 2:
I am trying to loop to get the task for each in LIST, I tried as below but in vain
with DAG(
dag_id = dag_id,
default_args = default_args,
schedule_interval = '{{var.json.usage_snwflk_acct_admin_config.schedule_interval_snwflke_acct}}' ,
dagrun_timeout = timedelta(hours=3),
max_active_runs = 1,
catchup = False,
params = {},
tags=tags
) as dag:
shares = '{{var.json.snflk_json.LIST}}'
for s in shares:
sf_tasks = SnowflakeOperator(
task_id=f"{s}" ,
snowflake_conn_id= snowflake_conn_id,
sql=sqls,
params={"sf_env": s},
)
Error
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 754, in __init__
validate_key(task_id)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/helpers.py", line 63, in validate_key
raise AirflowException(
airflow.exceptions.AirflowException: The key '{' has to be made of alphanumeric characters, dashes, dots and underscores exclusively
Airflow is parsing the dag every few seconds (30 by default). so actually it runs the for loop on a string with value {{var.json.snflk_json.LIST}} and that why you get that error.
you should use DynamicTask (from ver 2.3) or put the code under Python task that creates tasks and execute the new tasks.
I have a snowflake file with a query like as below, in the snowflake operator if I have a return so that I can pass xcom to the next task.
How can I get only the last query id to be returned for xcom ? Basically I need to get the snowflake last query id to xcom
In SQL File:
select columns from tableA ;
select last_query_id();
Error : Multiple SQL statements in a single API call are not supported; use one API call per statement instead.
or is there a way I can get below query id returned to xcom
Code:
class LastQueryId(SnowflakeOperator):
def execute(self, context: Any) -> None:
"""Run query on snowflake"""
self.log.info('Executing: %s', self.sql)
hook = SnowflakeHook(snowflake_conn_id=self.snowflake_conn_id,
warehouse=self.warehouse, database=self.database,
role=self.role, schema=self.schema, authenticator=self.authenticator)
result = hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
self.query_ids = hook.query_ids
if self.do_xcom_push and len(self.query_ids) > 0:
return self.query_ids[-1]
UPDATED: I was able to get the query id of the snowflake with above code but in the log, I also see the result of the query, how can I avoid those in the log
[2022-06-23, 20:43:39 UTC] {cursor.py:696} INFO - query: [SELECT modifieddate, documentdate...]
[2022-06-23, 20:43:40 UTC] {cursor.py:720} INFO - query execution done
[2022-06-23, 20:43:56 UTC] {snowflake.py:307} INFO - Statement execution info - {'MODIFIEDDATE': datetime.datetime(2022, 6, 23, 11, 42, 34, 233000), 'DOCUMENTDATE': datetime.datetime(2015, 10, 1, 0, 0)...}
[2022-06-23, 20:43:56 UTC] {snowflake.py:307} INFO - Statement execution info - {'MODIFIEDDATE': datetime.datetime(2022, 6, 23, 13, 50, 45, 377000), 'DOCUMENTDATE': datetime.datetime(2021, 7, 1, 0, 0)...}
[2022-06-23, 20:43:56 UTC] {snowflake.py:307} INFO - Statement execution info - {'MODIFIEDDATE': datetime.datetime(2022, 6, 23, 11, 36, 51, 583000), 'DOCUMENTDATE': datetime.datetime(2015, 8, 31, 0, 0)...}
....
....
....
[2022-06-23, 20:43:56 UTC] {snowflake.py:311} INFO - Rows affected: 22116
[2022-06-23, 20:43:56 UTC] {snowflake.py:312} INFO - Snowflake query id: 01a5259b-0501-98f3-0251-830144baa623
[2022-06-23, 20:43:56 UTC] {connection.py:509} INFO - closed
SnowflakeOperator already store the query_ids but it does not push them to xcom.
You can create a custom operator as:
class MySnowflakeOperator(SnowflakeOperator):
def execute(self, context: Any) -> None:
"""Run query on snowflake"""
self.log.info('Executing: %s', self.sql)
hook = self.get_db_hook()
execution_info = hook.run(self.sql, autocommit=self.autocommit, parameters=self.parameters)
self.query_ids = hook.query_ids
if self.do_xcom_push and len(self.query_ids) > 0:
return self.query_ids[-1] # last query_id
If you want to maintain the original operator functionality then you can do:
class MySnowflakeOperator(SnowflakeOperator):
def execute(self, context: Any) -> None:
parent_return_value = super().execute(context)
if self.do_xcom_push and len(self.query_ids) > 0:
self.xcom_push(
context,
key="last_query_id",
value=self.query_ids[-1],
)
return parent_return_value
I am having issues with calling TaskGroups, the error log thinks my Job id is avg_speed_20220502_22c11bdf instead of just avg_speed, and I can't figure out why.
Here's my code:
with DAG(
'debug_bigquery_data_analytics',
catchup=False,
default_args=default_arguments) as dag:
# Note to self: the bucket region and the dataproc cluster should be in the same region
create_cluster = DataprocCreateClusterOperator(
task_id='create_cluster',
...
)
with TaskGroup(group_id='weekday_analytics') as weekday_analytics:
avg_temperature = DummyOperator(task_id='avg_temperature')
avg_tire_pressure = DummyOperator(task_id='avg_tire_pressure')
avg_speed = DataprocSubmitPySparkJobOperator(
task_id='avg_speed',
project_id='...',
main=f'gs://.../.../avg_speed.py',
cluster_name=f'spark-cluster-{{ ds_nodash }}',
region='...',
dataproc_jars=['gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'],
)
avg_temperature >> avg_tire_pressure >> avg_speed
delete_cluster = DataprocDeleteClusterOperator(
task_id='delete_cluster',
project_id='...',
cluster_name='spark-cluster-{{ ds_nodash }}',
region='...',
trigger_rule='all_done',
)
create_cluster >> weekday_analytics >> delete_cluster
Here's the error message I get:
google.api_core.exceptions.InvalidArgument: 400 Job id 'weekday_analytics.avg_speed_20220502_22c11bdf' must conform to '[a-zA-Z0-9]([a-zA-Z0-9\-\_]{0,98}[a-zA-Z0-9])?' pattern
[2022-05-02, 11:46:11 UTC] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=debug_bigquery_data_analytics, task_id=weekday_analytics.avg_speed, execution_date=20220502T184410, start_date=20220502T184610, end_date=20220502T184611
[2022-05-02, 11:46:11 UTC] {standard_task_runner.py:93} ERROR - Failed to execute job 549 for task weekday_analytics.avg_speed (400 Job id 'weekday_analytics.avg_speed_20220502_22c11bdf' must conform to '[a-zA-Z0-9]([a-zA-Z0-9\-\_]{0,98}[a-zA-Z0-9])?' pattern; 18116)
[2022-05-02, 11:46:11 UTC] {local_task_job.py:154} INFO - Task exited with return code 1
[2022-05-02, 11:46:11 UTC] {local_task_job.py:264} INFO - 1 downstream tasks scheduled from follow-on schedule check
In Airflow task identifier is task_id. However when using TaskGroups you can have same task_id in different groups thus tasks defined in task group have identifier of group_id.task_id.
For apache-airflow-providers-google>7.0.0:
The bug has been fixed. It should work now.
For apache-airflow-providers-google<=7.0.0:
You are having issues because DataprocJobBaseOperator has:
:param job_name: The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The
name will always be appended with a random number to avoid name clashes.
The problem is that Airflow adds the . char and Google doesn't accept it thus to fix your issue you must override the default of job_name parameter to a string of your choice. You can set it to be the task_id if you wish.
I opened https://github.com/apache/airflow/issues/23439 to report this bug in the meantime you can follow the suggestion above.
I need to get last two successful execution dates of Airflow job to use in my current run.
Example :
Execution date Job status
2020-05-03 success
2020-05-04 fail
2020-05-05 success
Question :
When I run my job on May 6th I should get values of May 3rd and 5th into variables. Is it possible?
You can leverage SQLAlchemy magic for retrieving execution_dates against last 'n' successfull runs
from pendulum import Pendulum
from typing import List, Dict, Any, Optional
from airflow.utils.state import State
from airflow.settings import Session
from airflow.models.taskinstance import TaskInstance
def last_execution_date(
dag_id: str, task_id: str, n: int, session: Optional[Session] = None
) -> List[Pendulum]:
"""
This function is to queries against airflow table and
return the most recent execution date
Args:
dag_id: dag name
task_id : task name
n : number of runs
session: Session to connect airflow postgres db
Returns:
list of execution date of most of recent n runs
"""
query_val = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag_id,
TaskInstance.task_id == task_id,
TaskInstance.state == State.SUCCESS,
)
.order_by(TaskInstance.execution_date.desc())
.limit(n)
)
execution_dates: List[Pendulum] = list(map(lambda ti: ti.execution_date, query_val))
return execution_dates
# Above function can be used as utility function and can be leveraged with provide_session as below:
last_success_date_fn = provide_session(last_execution_date) # can use provide session decorator as is.
This snippet is tested end to end and can be used in prod.
I've referred to tree() method of views.py for coming up with this script.
Alternatively, you can fire this SQL query to the Airflow's meta-db to retrieve last n execution dates with successful runs
SELECT execution_date
FROM task_instance
WHERE dag_id = 'my_dag_id'
AND task_id = 'my_task_id'
AND state = 'success'
ORDER BY execution_date DESC
LIMIT n
In the lastest version of airflow:
def last_execution_date(
dag_id: str, task_id: str, n: int):
session = Session()
query_val = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag_id,
TaskInstance.task_id == task_id,
TaskInstance.state == State.SUCCESS,
)
.order_by(TaskInstance.execution_date.desc())
.limit(n)
)
execution_dates = list(map(lambda ti: ti.execution_date, query_val))
return execution_dates
How do I programatically get the name of an ODBC driver's DLL file for a given ODBC driver. For example, given "SQL Server Native Client 10.0" I want to find the name of that driver's DLL file: sqlncli10.dll. I can see this in REGEDIT in the "Driver" entry in the registry under HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBCINST.INI. If I try to read the value from the registry in my code it returns an empty string. I also tried using the ODBC API function SQLDrivers. The code below successfully returns all the values of the attributes in the Attribs variable except "Driver". Everything is there - APILevel, ConnectFunctions, CPTimeout, etc - but "Driver" is not in the list.
repeat
Status := SQLDrivers (HENV, SQL_FETCH_NEXT, PAnsiChar(DriverName), 255,
NameLen, PAnsiChar(Attribs), 1024, AttrLen);
if Status = 0 then begin
List.Add(DriverName);
List.Add(Attribs);
end;
until Status <> 0;
You can use SQLGetInfo() with InfoType=SQL_DRIVER_NAME
I hope this will look like:
Status := SQLGetInfo(ConnEnv, SQL_DRIVER_NAME, PAnsiChar(DriverName), 255, NameLen);
But this function works with already connected database.
I tried SQLDrives() and you are right: in my environment this function also do not return DLL name. So I tried to read it from registry and it worked this way:
DLLName := RegGetStringDirect(HKEY_LOCAL_MACHINE, 'SOFTWARE\ODBC\ODBCINST.INI\' + DriverName, 'Driver');
For driver: IBM INFORMIX ODBC DRIVER I got: C:\informix\bin\iclit09b.dll
For driver: SQL Server I got: C:\WINDOWS\system32\SQLSRV32.dll
RegGetStringDirect() is my function based on Windows API to read something from registry.
EDIT:
Two functions to read "SQL Server" ODBC driver dll name by Ron Schuster moved from comment:
procedure TForm1.Button1Click(Sender: TObject);
//using Windows API calls
var
KeyName, ValueName, Value: string;
Key: HKEY;
ValueSize: Integer;
begin
ValueName := 'Driver';
KeyName := 'SOFTWARE\ODBC\ODBCINST.INI\SQL Native Client';
if RegOpenKeyEx(HKEY_LOCAL_MACHINE, PChar(KeyName), 0, KEY_READ, Key) = 0 then
if RegQueryValueEx(Key, PChar(ValueName), nil, nil, nil, #ValueSize) = 0 then begin
SetLength(Value, ValueSize);
RegQueryValueEx(Key, PChar(ValueName), nil, nil, PByte(Value), #ValueSize);
ShowMessage(Value);
end;
end;
procedure TForm1.Button2Click(Sender: TObject);
//using TRegistry class
var
KeyName, ValueName, Value: string;
Reg: TRegistry;
begin
ValueName := 'Driver';
KeyName := 'SOFTWARE\ODBC\ODBCINST.INI\SQL Native Client';
Reg := TRegistry.Create;
try
Reg.RootKey := HKEY_LOCAL_MACHINE;
if Reg.OpenKeyReadOnly(KeyName) then begin
Value := Reg.ReadString(ValueName);
ShowMessage(Value);
end;
finally
Reg.Free;
end;
end;