Airflow not reading latest data from table - airflow

I have an airflow DAG. It has three tasks. All of them are using PostgreSQL operator.
Here is a brief description of the three tasks. All tasks have autocommit = True
Read events from main event tracking table, insert into event tracking temp table. let's call it sp1. I checked the event tracking temp table after the task execution to make sure the events are loaded.
Run business logic based on the event tracking temp table. let's call it sp2
Truncate event tracking temp table. Let's call it sp3
call_populate_landing_event_tracking_temp_micro_batch = PostgresOperator(
task_id=f"call_populate_landing_event_tracking_temp_micro_batch",
postgres_conn_id="postgres_default",
sql="call ods_landing.sp_populate_landing_event_tracking_temp_mirco_batch()",
autocommit = True,
)
call_mirco_batch = PostgresOperator(
task_id=f"call_mirco_batch",
postgres_conn_id="postgres_default",
sql= "call ods_landing.{{params.batch_sp_name}}",
autocommit = True,
)
call_truncate_landing_event_tracking_temp_micro_batch = PostgresOperator(
task_id=f"call_truncate_landing_event_tracking_temp_micro_batch",
postgres_conn_id="postgres_default",
sql="call ods_landing.sp_truncate_landing_event_tracking_temp_micro_batch()",
trigger_rule="all_done",
autocommit = True,
)
call_populate_landing_event_tracking_temp_micro_batch >> call_cdr_mirco_batch >> call_truncate_landing_event_tracking_temp_micro_batch
The behaviour that I am seeing is that when I trigger the job to run in airflow, I am not getting the latest change.
If I run sp1 > sp2 > sp3 manually via DBeaver it works as expected.
The issue seems to lie at task2. Airflow seems to be using an old version of the tracking table (cached somewhere)
I have also used a dag using python operator and specific commit, cursor close and connection close but it does not help.
Any idea of where I should be checking?

Related

Error "Invalid job id" when DAG is triggered manually from UI using BigQueryInsertJobOperator

I am trying to trigger a stored procedure from composer using "BigQueryInsertJobOperator". The Dag runs fine when triggered as per schedule but it fails with below error when it is triggered manually from Airflow UI.
Invalid jobID"airflow_DQ_create_stored_procedure_2021_11_02T01_36_02.229065_00_00_XXXXXXXXXXXXXXXXXX". Job IDs must be alphanumeric (plus underscores and dashes) and must be at most 1024 characters long.
There are alphanumeric and '_' only in the job id created. It is not 1024 characters long.
Both Manual triggering and scheduled triggering should be possible. Please help!
EDIT-1:
It works if we configure job id from our end. But, it throws error if BQ automatically generates the job id.
Below is the code snippet:
from airflow import models
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
'start_date': yesterday
}
with models.DAG(
'DQ_create',
schedule_interval ='#daily',
default_args = default_dag_args
) as dag:
Stored_Procedure = BigQueryInsertJobOperator(
task_id='stored_procedure',
configuration={
"query": {
"query": "CALL `project.dataset.procedure`() ",
"useLegacySql": False,
}
},
)
Stored_Procedure```
You've found a bug in the BigQueryInsertJobOperator class. This class is responsible for constructing the job_id, which it does by taking the values of your DAG variables and concatenating them:
exec_date = context['execution_date'].isoformat()
job_id = f"airflow_{self.dag_id}_{self.task_id}_{exec_date}_{uniqueness_suffix}"
The "execution_date" you're using appears to have millisecond-level accuracy, which makes the exec_date variable have a period in it.
You have four options to fix this:
create a subclass of BigQueryInsertJobOperator that truncates its execution_date down to the second before it calls super().execute(context). The resulting job_id string will have no period.
create a subclass of BigQueryInsertJobOperator that dynamically generates a unique job_id before it calls super().execute(context)
schedule your DAG to run at the top of minute (or at least the top of the second!), thereby forcing your execution_date to not have millisecond accuracy
upgrade to Airflow v2.0. They've fixed this bug by stripping out the period.(https://github.com/apache/airflow/commit/47b05a87f004dc273a4757ba49f03808a86f77e7#diff-529929b4ca60ce73b8da0f45d8a5c43c2d4e391b913fe78b39892899f812951e)

Why does Hangfire wait for 15s every few seconds when polling sql server for jobs?

I’ve inherited a system that uses Hangfire with sql server job storage. Usually when a job is scheduled to be run immediately we notice it takes a few seconds before it’s triggered.
Looking at SQL Profiler when running in my dev environment, the SQL run against Hangfire db looks like this -
exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (#queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, #timeout, GETUTCDATE()))',N'#queues1 nvarchar(4000),#timeout float',#queues1=N'MYQUEUENAME_master',#timeout=-1800
-- Exactly the same SQL as above is executed about 6 times/second for about 3-4 seconds,
-- then nothing for about 2 seconds, then:
exec sp_getapplock #Resource=N'HangFire:recurring-jobs:lock',#DbPrincipal=N'public',#LockMode=N'Exclusive',#LockOwner=N'Session',#LockTimeout=5000
exec sp_getapplock #Resource=N'HangFire:locks:schedulepoller',#DbPrincipal=N'public',#LockMode=N'Exclusive',#LockOwner=N'Session',#LockTimeout=5000
exec sp_executesql N'select top (#count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = #key and Score between #from and #to order by Score',N'#count int,#key nvarchar(4000),#from float,#to float',#count=1000,#key=N'recurring-jobs',#from=0,#to=1596053348
exec sp_executesql N'select top (#count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = #key and Score between #from and #to order by Score',N'#count int,#key nvarchar(4000),#from float,#to float',#count=1000,#key=N'schedule',#from=0,#to=1596053348
exec sp_releaseapplock #Resource=N'HangFire:recurring-jobs:lock',#LockOwner=N'Session'
exec sp_releaseapplock #Resource=N'HangFire:locks:schedulepoller',#LockOwner=N'Session'
-- Then nothing is executed for about 8-10 seconds, then:
exec sp_executesql N'update [HangFire].Server set LastHeartbeat = #now where Id = #id',N'#now datetime,#id nvarchar(4000)',#now='2020-07-29 20:09:19.097',#id=N'ps12345:19764:fe362d1a-5ee4-4d97-b70d-134fdfab2b87'
-- Then about 500ms-2s later I get
exec sp_executesql N'delete top (1) JQ ... -- i.e. Same as first query
The update LastHeartbeat query is only there every second time (from just a brief inspection, maybe that’s not exactly right).
It looks like there’s at least 3 threads running the DELETE query against JQ, since I can see several RPC:Starting before the RPC:Completed, suggesting they’re being executed in parallel instead of sequentially.
I don’t know if that’s normal but seems weird as I thought we had just one ‘consumer’ of the jobs.
I only have one Queue in my dev environment, although in live we’d have 20-50 I’d guess.
Any suggestions on where I should look for the configuration that’s causing:
a) the 8-10s pause between checking for jobs
b) the number of threads that are checking for jobs - it seems like I have too many
After writing this I realised we were using an old version so I upgraded from 1.5.x to 1.7.12, upgraded the database, and changed the startup config to this:
app.UseHangfireDashboard();
GlobalConfiguration.Configuration
.UseSqlServerStorage(connstring, new SqlServerStorageOptions
{
CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
QueuePollInterval = TimeSpan.Zero,
SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
UseRecommendedIsolationLevel = true,
PrepareSchemaIfNecessary = true, // Default value: true
EnableHeavyMigrations = true // Default value: false
})
.UseAutofacActivator(_container);
JobActivator.Current = new AutofacJobActivator(_container);
but if anything the problem is now worse. Or the same but faster: 20 calls to delete top (1) JQ... happen within about 1s now, then the other queries, then a 15s wait, then it starts all over again.
To be clear, the main problem is that if any jobs are added during that 15s delay then it'll take the remainder of that 15s before my job is executed. A second problem I think is it's hitting SQL Server more than needed: 20 times in a second is a bit much, for my needs at least.
(Cross-posted to hangfire forums)
If you don't set QueuePollInterval then Hangfire with sql server storage defaults to polling every 15s. So the first thing to do if you have this problem is set QueuePollInterval to something smaller, e.g. 1s.
But in my case even when I set that it wasn't having any effect. The reason for that was calling app.UseHangfireServer() before I was calling GlobalConfiguration.Configuration.UseSqlServerStorage() with the SqlServerStorageOptions.
When you call app.UseHangfireServer() it uses the current value of JobStorage.Current. My code had set that:
var storage = new SqlServerStorage(connstring);
JobStorage.Current = storage;
then later called
app.UseHangfireServer()
then later called
GlobalConfiguration.Configuration
.UseSqlServerStorage(connstring, new SqlServerStorageOptions
{
CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
QueuePollInterval = TimeSpan.Zero,
SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
UseRecommendedIsolationLevel = true,
PrepareSchemaIfNecessary = true,
EnableHeavyMigrations = true
})
Reordering it to use SqlServerStorageOptions before app.UseHangfireServer() means the SqlServerStorageOptions take effect.
I would suggest checking the Hangfire BackgroundJobServerOptions to see what polling interval you have set up there. This will define the time before the hangfire server will check to see if there are any jobs in queue to execute.
From the documentation
Hangfire Docs
Hangfire Server periodically checks the schedule to enqueue scheduled jobs to their queues, allowing workers to
execute them. By default, check interval is equal to 15 seconds, but you can change it by setting the SchedulePollingInterval property on the options you pass to the BackgroundJobServer constructor:
var options = new BackgroundJobServerOptions
{
SchedulePollingInterval = TimeSpan.FromMinutes(1)
};
var server = new BackgroundJobServer(options);

The conn_id isn't defined

I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
c = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxx',
password='xxxxxxxxx'
)
def list_keys():
hook = S3Hook(aws_conn_id=c.conn_id)
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
In this case It's working fine. The connection is well passed to the S3Hook.
Then I have a second dag:
redshift_connection = Connection(
conn_id='redshift',
conn_type='postgres',
login='duser',
password='xxxxxxxxxx',
host='xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
schema='db'
)
aws_connection = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxxx',
password='xxxxxxxx'
)
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook(aws_connection.conn_id)
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook(redshift_connection.conn_id)
sql_stmnt = sql_statements.COPY_STATIONS_SQL.format(aws_connection.login, aws_connection.password)
redshift_hook.run(sql_stmnt)
dag = DAG(
's3_to_Redshift',
start_date=datetime.datetime.now()
)
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id=redshift_connection.conn_id,
sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
dag=dag
)
This dag return me the following error: The conn_idredshiftisn't defined
Why is that? What are the differences between my first and second dag? Why the connection does seems to work in the first example and not in the second situation?
Thanks.
Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials using the UI or CLI. So, when you pass its ID to S3Hook it successfully retrieves the credentials (from the Airflow's database backend, not from the Connection object that you created).
But, you did not create a connection with the ID redshift, therefore, AwsHook complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.

How can I log sql execution results in airflow?

I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms
Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.
Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.

Use DB to generate airflow tasks dynamically

I want to run an airflow dag like so ->
I have 2 airflow workers W1 and W2.
In W1 I have scheduled a single task (W1-1) but in W2, I want to create X number of tasks (W2-1, W2-2 ... W2-X).
The number X and the bash command for each task will be derived from a DB call.
All tasks for worker W2 should run in parallel after W1 completes.
This is my code
dag = DAG('deploy_single', catchup=False, default_args=default_args, schedule_interval='16 15 * * *')
t1 = BashOperator(
task_id='dummy_task',
bash_command='echo hi > /tmp/hi',
queue='W1_queue',
dag=dag)
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
i = 1
for record in records:
t = BashOperator(
task_id='script_test_'+str(i),
bash_command="{full_command} ".format(full_command=str(record[0])),
queue=str(record[1]),
dag=dag)
t.set_upstream(t1)
i += 1
cursor.close()
connection.close()
However, when I run this, the task on W1 completed successfully but all tasks on W2 failed. In the airflow UI, I can see that it can resolve the correct number of tasks (10 in this case) but each of these 10 failed.
Looking at the logs, I saw that on W2 (which is on a different machine), airflow could not find the db_creds.json file.
I do not want to provide the DB creds file to W2.
My question is how can an airflow task be created dynamically in this case?
Basically i want to run a DB query on the airflow server and assign tasks to one or more workers based on the results of that query. The DB will contain updated info about which engines are active etc I want the DAG to reflect this. From logs, it looks like each worker runs the DB query. Providing access to DB to each worker is not an option.
Thank you #viraj-parekh and #cwurtz.
After much trial and error, found the correct way to use airflow variables for this case.
Step 1) We create another script called gen_var.pyand place it in the dag folder. This way, the scheduler will pick up and generate the variables. If the code for generating variables is within the deploy_single dag then we run into the same dependency issue as the worker will try and process the dag too.
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
import json
import psycopg2
from airflow.models import Variable
from psycopg2.extensions import AsIs
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
hosts = {}
i = 1
for record in records:
comm_dict = {}
comm_dict['full_command'] = str(record[0])
comm_dict['queue_name'] = str(record[1])
hosts[i] = comm_dict
i += 1
cursor.close()
connection.close()
Variable.set("hosts",hosts,serialize_json=True)
Note the call to serialize_json. Airflow will try to store the variable as a string. If you want it to be stored as a dict, then use serialize_json=True. Airflow will still store it as string via json.dumps
Step 2) Simplify the dag and call this "hosts" variable (now deserialize to get back the dict) like so -
hoztz = Variable.get("hosts",deserialize_json=True)
for key in hoztz:
host = hoztz.get(key)
t = BashOperator(
task_id='script_test_'+str(key),
bash_command="{full_command} ".format(full_command=str(host.get('full_command'))),
queue=str(host.get('queue_name')),
dag=dag)
t.set_upstream(t1)
Hope it helps someone else.
One way to do this would be to store the information in an Airflow Variable.
You can fetch the information needed to dynamically generate the DAG (and necessary configs) in a Variable and have W2 access it from there.
Variables are an airflow model that can be used to store static information (information that does not have an associated timestamp) that all tasks can access.

Resources