What precisely is the difference between the "airflow initdb" command and the "airflow resetdb" command?
Is it really necessary to have 2 different commands?
When is it appropriate to use one vs the other?
The doc says ...
airflow initdb: Initialize the metadata database
airflow resetdb: Burn down and rebuild the metadata database
This doesn't tell me much.
My best guess is that
airflow initdb is to be used only the first time that the database is created from the airflow.cfg
airflow resetdb is to be used if any changes to that configuration are required.
When I run them, neither changes the timestamp on the sqlite database but resetdb seems to be much noisier.
airflow initdb:
(.sandbox) [airflow#localhost airflow]$ airflow initdb
[2020-01-01 21:49:21,603] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=24917
DB: postgresql+psycopg2://airflow#localhost:5432/airflow_mdb
[2020-01-01 21:49:22,257] {db.py:368} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
Done.
airflow resetdb:
(.sandbox) [airflow#localhost airflow]$ airflow resetdb
[2020-01-01 21:49:46,579] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=25045
DB: postgresql+psycopg2://airflow#localhost:5432/airflow_mdb
This will drop existing tables if they exist. Proceed? (y/n)y
[2020-01-01 21:49:49,984] {db.py:389} INFO - Dropping tables that exist
[2020-01-01 21:49:50,062] {migration.py:154} INFO - Context impl PostgresqlImpl.
[2020-01-01 21:49:50,063] {migration.py:161} INFO - Will assume transactional DDL.
[2020-01-01 21:49:50,070] {db.py:368} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> e3a246e0dc1, current schema
INFO [alembic.runtime.migration] Running upgrade e3a246e0dc1 -> 1507a7289a2f, create is_encrypted
INFO [alembic.runtime.migration] Running upgrade 1507a7289a2f -> 13eb55f81627, maintain history for compatibility with earlier migrations
INFO [alembic.runtime.migration] Running upgrade 13eb55f81627 -> 338e90f54d61, More logging into task_instance
INFO [alembic.runtime.migration] Running upgrade 338e90f54d61 -> 52d714495f0, job_id indices
INFO [alembic.runtime.migration] Running upgrade 52d714495f0 -> 502898887f84, Adding extra to Log
INFO [alembic.runtime.migration] Running upgrade 502898887f84 -> 1b38cef5b76e, add dagrun
INFO [alembic.runtime.migration] Running upgrade 1b38cef5b76e -> 2e541a1dcfed, task_duration
INFO [alembic.runtime.migration] Running upgrade 2e541a1dcfed -> 40e67319e3a9, dagrun_config
INFO [alembic.runtime.migration] Running upgrade 40e67319e3a9 -> 561833c1c74b, add password column to user
INFO [alembic.runtime.migration] Running upgrade 561833c1c74b -> 4446e08588, dagrun start end
INFO [alembic.runtime.migration] Running upgrade 4446e08588 -> bbc73705a13e, Add notification_sent column to sla_miss
INFO [alembic.runtime.migration] Running upgrade bbc73705a13e -> bba5a7cfc896, Add a column to track the encryption state of the 'Extra' field in connection
INFO [alembic.runtime.migration] Running upgrade bba5a7cfc896 -> 1968acfc09e3, add is_encrypted column to variable table
INFO [alembic.runtime.migration] Running upgrade 1968acfc09e3 -> 2e82aab8ef20, rename user table
INFO [alembic.runtime.migration] Running upgrade 2e82aab8ef20 -> 211e584da130, add TI state index
INFO [alembic.runtime.migration] Running upgrade 211e584da130 -> 64de9cddf6c9, add task fails journal table
INFO [alembic.runtime.migration] Running upgrade 64de9cddf6c9 -> f2ca10b85618, add dag_stats table
INFO [alembic.runtime.migration] Running upgrade f2ca10b85618 -> 4addfa1236f1, Add fractional seconds to mysql tables
INFO [alembic.runtime.migration] Running upgrade 4addfa1236f1 -> 8504051e801b, xcom dag task indices
INFO [alembic.runtime.migration] Running upgrade 8504051e801b -> 5e7d17757c7a, add pid field to TaskInstance
INFO [alembic.runtime.migration] Running upgrade 5e7d17757c7a -> 127d2bf2dfa7, Add dag_id/state index on dag_run table
INFO [alembic.runtime.migration] Running upgrade 127d2bf2dfa7 -> cc1e65623dc7, add max tries column to task instance
INFO [alembic.runtime.migration] Running upgrade cc1e65623dc7 -> bdaa763e6c56, Make xcom value column a large binary
INFO [alembic.runtime.migration] Running upgrade bdaa763e6c56 -> 947454bf1dff, add ti job_id index
INFO [alembic.runtime.migration] Running upgrade 947454bf1dff -> d2ae31099d61, Increase text size for MySQL (not relevant for other DBs' text types)
INFO [alembic.runtime.migration] Running upgrade d2ae31099d61 -> 0e2a74e0fc9f, Add time zone awareness
INFO [alembic.runtime.migration] Running upgrade d2ae31099d61 -> 33ae817a1ff4, kubernetes_resource_checkpointing
INFO [alembic.runtime.migration] Running upgrade 33ae817a1ff4 -> 27c6a30d7c24, kubernetes_resource_checkpointing
INFO [alembic.runtime.migration] Running upgrade 27c6a30d7c24 -> 86770d1215c0, add kubernetes scheduler uniqueness
INFO [alembic.runtime.migration] Running upgrade 86770d1215c0, 0e2a74e0fc9f -> 05f30312d566, merge heads
INFO [alembic.runtime.migration] Running upgrade 05f30312d566 -> f23433877c24, fix mysql not null constraint
INFO [alembic.runtime.migration] Running upgrade f23433877c24 -> 856955da8476, fix sqlite foreign key
INFO [alembic.runtime.migration] Running upgrade 856955da8476 -> 9635ae0956e7, index-faskfail
INFO [alembic.runtime.migration] Running upgrade 9635ae0956e7 -> dd25f486b8ea, add idx_log_dag
INFO [alembic.runtime.migration] Running upgrade dd25f486b8ea -> bf00311e1990, add index to taskinstance
INFO [alembic.runtime.migration] Running upgrade 9635ae0956e7 -> 0a2a5b66e19d, add task_reschedule table
INFO [alembic.runtime.migration] Running upgrade 0a2a5b66e19d, bf00311e1990 -> 03bc53e68815, merge_heads_2
INFO [alembic.runtime.migration] Running upgrade 03bc53e68815 -> 41f5f12752f8, add superuser field
INFO [alembic.runtime.migration] Running upgrade 41f5f12752f8 -> c8ffec048a3b, add fields to dag
INFO [alembic.runtime.migration] Running upgrade c8ffec048a3b -> dd4ecb8fbee3, Add schedule interval to dag
INFO [alembic.runtime.migration] Running upgrade dd4ecb8fbee3 -> 939bb1e647c8, task reschedule fk on cascade delete
INFO [alembic.runtime.migration] Running upgrade c8ffec048a3b -> a56c9515abdc, Remove dag_stat table
INFO [alembic.runtime.migration] Running upgrade 939bb1e647c8 -> 6e96a59344a4, Make TaskInstance.pool not nullable
INFO [alembic.runtime.migration] Running upgrade 6e96a59344a4 -> 74effc47d867, change datetime to datetime2(6) on MSSQL tables
INFO [alembic.runtime.migration] Running upgrade 939bb1e647c8 -> 004c1210f153, increase queue name size limit
(.sandbox) [airflow#localhost airflow]$
Of course you might move database from say sqlite to postgres.
It is unclear which is appropriate for that circumstance.
It is also unclear how the webserver and scheduler know where to look for the configuration?
Perhaps they look in airflow.cfg first to find out where the database is and then look into the database? This seems redundant.
db reset will delete all entries from the metadata database. This includes all dag runs, Variables and Connections.
db init is only run once, when airflow is installed.
Generally we aren't too worried about the dag runs. but the Variables and connections can be annoying to recreate as they often contain secret and sensitive data, which may not be duplicated as a matter of security best practice.
db init is also idempotent, so this can be run as often as you choose to, without needing to worry about the database changing.
The above explanation is right but the commands are wrong.
A working command for reset is - airflow db reset
A working command for init (set db) is - airflow db init
Related
I have created the dag with the following configuration
job_type='daily'
SOURCE_PATH='/home/ubuntu/daily_data'
with DAG(
dag_id="transformer_daily_v1",
is_paused_upon_creation=False,
default_args=default_args,
description="transformer to insert data",
start_date=datetime(2022,9,20),
schedule_interval='31 12 * * *',
catchup=False
) as dag:
task1=PythonOperator(
task_id="dag_task_1",
python_callable=get_to_know_details(job_type,SOURCE_PATH),
)
def get_to_know_details(job_type,SOURCE_PATH):
print("************************",job_type,SOURCE_PATH)
Each time when i start the airflow using command
airflow standalone
the dag function executed automatically without Triggering as seen in the logs
standalone | Starting Airflow Standalone
standalone | Checking database is initialized
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
WARNI [airflow.models.crypto] empty cryptography key - values will not be stored encrypted.
************************ daily /home/ubuntu/daily_data
WARNI [unusual_prefix_8fc9338bb4cf0c5518fed57dffa1a11abec44c36_example_kubernetes_executor] The example_kubernetes_executor examp
le DAG requires the kubernetes provider. Please install it with: pip install apache-airflow[cncf.kubernetes]
airflow version - 2.2.5
I am referencing to this doc and this article in linking a Postgres database to Airflow.
Particularly, I added this line to the file airflow.cfg:
sql_alchemy_conn = postgresql+psycopg2://airflowadmin:airflowadmin#localhost/airflowdb
where airflowadmin is both the username and password for the postgres user and password, and airflowdb is a postgres db created, with airflowadmin having all the privileges.
Now, when I initialize the database with airflow db init, however, I still see sqlite being the linked database. Full output:
DB: sqlite:////home/userxxxx/airflow/airflow.db
[2021-09-07 12:43:53,827] {db.py:702} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
WARNI [airflow.models.crypto] empty cryptography key - values will not be stored encrypted.
WARNI [unusual_prefix_xxxxxxxxxxxxxxxxxxxxxxxxx_example_kubernetes_executor_config] Could not import DAGs in example_kubernetes_executor_config.py: No module named 'kubernetes'
WARNI [unusual_prefix_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx_example_kubernetes_executor_config] Install kubernetes dependencies with: pip install apache-airflow['cncf.kubernetes']
Initialization done
What am I missing?
Make sure the airflow.cfg file you are changing is the same that is actually being loaded by Airflow. From the CLI run:
airflow info
Search under the Paths info section and compare it with the path of the folder with the airflow.cfg file that you are modifying.
airflow info:
ache Airflow
version | 2.1.2
executor | SequentialExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
sql_alchemy_conn | sqlite:////home/vagrant/airflow/airflow.db
dags_folder | /home/vagrant/airflow/dags
plugins_folder | /home/vagrant/airflow/plugins
base_log_folder | /home/vagrant/airflow/logs
remote_base_log_folder |
System info
OS
...
...
Paths info
airflow_home | /home/vagrant/airflow
...
When not defined during the local installation process, the default value of airflow_homeis AIRFLOW_HOME=~/airflow, so I guess that may be the cause of your problem.
I'm trying to run airflow with Azure SQL database as backend using mssql+pyodbc connection string(all relevant drivers have been installed).
while airflow is able to connect to DB and create tables, i.e, airflow initdb runs successfully, I'm facing issues while running airflow scheduler, as a result, the tasks triggered are always in "running" state.
This is the error I get while running airflow scheduler:
*sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near '1'. (102) (SQLExecDirectW)")
[SQL: SELECT dag.dag_id AS dag_dag_id
FROM dag
WHERE dag.is_paused IS 1 AND dag.dag_id IN (?)]
[parameters: ('example_http_operator',)]*
(Background on this error at: http://sqlalche.me/e/13/f405)
I'm using apache-airflow==1.10.11.
If you were able to run airflow + azure SQL DB with any configuration please feel free to jump in.
I found a document and talk the configuration about run airflow + azure SQL DB. Maybe it's helpful for you.
Ref: Setting up Airflow on Azure & connecting to MS SQL Server
This post also give some configurations about it: Apache Airflow - Connection issue to MS SQL Server using pymssql + SQLAlchemy
For MSSQL as backend DB, there is workaround in Airflow#10713. I using apache-airflow==1.10.15 and solved same error as yours.
The command suggested is attached, but I use vi update instead of run sed command.
RUN sed -i 's/import copy/import copy,sqlalchemy/g' /usr/local/lib/python3.6/site-packages/airflow/models/dag.py \ && sed -i 's/DagModel.is_paused.is_(True)/DagModel.is_paused == sqlalchemy.sql.expression.true()/g' /usr/local/lib/python3.6/site-packages/airflow/models/dag.py
This issue is exactly same as below, but the answer did not work for me.
Airflow - Initiation of DB stuck in SQL Server
I am stuck with this issue for the past two days, Any help would be greatly appreciated...!
More details:
I am using docker to run airflow, While initiating airflow DB, it gets stuck in the below task for sometime.
DB: mssql+pymssql://sqlserver:***#10.10.18.10:1433/Test_airflow
[2020-03-13 09:51:10,556] {db.py:368} INFO - Creating tables
INFO [alembic.runtime.migration] Context impl MSSQLImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> e3a246e0dc1, current schema
INFO [alembic.runtime.migration] Running upgrade e3a246e0dc1 -> 1507a7289a2f, create is_encrypted
INFO [alembic.runtime.migration] Running upgrade 1507a7289a2f -> 13eb55f81627, maintain history for compatibility with earlier migrations
INFO [alembic.runtime.migration] Running upgrade 13eb55f81627 -> 338e90f54d61, More logging into task_instance
INFO [alembic.runtime.migration] Running upgrade 338e90f54d61 -> 52d714495f0, job_id indices
INFO [alembic.runtime.migration] Running upgrade 52d714495f0 -> 502898887f84, Adding extra to Log
INFO [alembic.runtime.migration] Running upgrade 502898887f84 -> 1b38cef5b76e, add dagrun
INFO [alembic.runtime.migration] Running upgrade 1b38cef5b76e -> 2e541a1dcfed, task_duration
INFO [alembic.runtime.migration] Running upgrade 2e541a1dcfed -> 40e67319e3a9, dagrun_config
INFO [alembic.runtime.migration] Running upgrade 40e67319e3a9 -> 561833c1c74b, add password column to user
INFO [alembic.runtime.migration] Running upgrade 561833c1c74b -> 4446e08588, dagrun start end
INFO [alembic.runtime.migration] Running upgrade 4446e08588 -> bbc73705a13e, Add notification_sent column to sla_miss
INFO [alembic.runtime.migration] Running upgrade bbc73705a13e -> bba5a7cfc896, Add a column to track the encryption state of the 'Extra' field in connection
INFO [alembic.runtime.migration] Running upgrade bba5a7cfc896 -> 1968acfc09e3, add is_encrypted column to variable table
INFO [alembic.runtime.migration] Running upgrade 1968acfc09e3 -> 2e82aab8ef20, rename user table
INFO [alembic.runtime.migration] Running upgrade 2e82aab8ef20 -> 211e584da130, add TI state index
INFO [alembic.runtime.migration] Running upgrade 211e584da130 -> 64de9cddf6c9, add task fails journal table
INFO [alembic.runtime.migration] Running upgrade 64de9cddf6c9 -> f2ca10b85618, add dag_stats table
INFO [alembic.runtime.migration] Running upgrade f2ca10b85618 -> 4addfa1236f1, Add fractional seconds to mysql tables
INFO [alembic.runtime.migration] Running upgrade 4addfa1236f1 -> 8504051e801b, xcom dag task indices
INFO [alembic.runtime.migration] Running upgrade 8504051e801b -> 5e7d17757c7a, add pid field to TaskInstance
INFO [alembic.runtime.migration] Running upgrade 5e7d17757c7a -> 127d2bf2dfa7, Add dag_id/state index on dag_run table
INFO [alembic.runtime.migration] Running upgrade 127d2bf2dfa7 -> cc1e65623dc7, add max tries column to task instance
When i looked into MSSQL database and find the process creating two connections, one waiting and one locked by the other:
exec sp_who;
spid ecid status loginame hostname blk dbname cmd request_id
55 0 sleeping sqlserver my_server Test_airflow AWAITINGCOMMAND 0
56 0 suspended sqlserver my_server 55 Test_airflow EXECUTE 0'
after 5 or 10 mins, i get the below logs in docker(creating tables & metadata task gets timeout error).
ERROR [airflow.utils.timeout.timeout] Process timed out, PID: 7
malloc(): unsorted double linked list corrupted
/entrypoint.sh: line 69: 7 Aborted airflow initdb
/usr/local/lib/python3.6/site-packages/airflow/configuration.py:241: FutureWarning: The task_runner setting in [core] has the old default value of 'BashTaskRunner'. This value has been changed to 'StandardTaskRunner' in the running config, but please update your config before Apache Airflow 2.0.
FutureWarning
/usr/local/lib/python3.6/site-packages/airflow/configuration.py:631: DeprecationWarning: Specifying both AIRFLOW_HOME environment variable and airflow_home in the config file is deprecated. Please use only the AIRFLOW_HOME environment variable and remove the config file entry.
warnings.warn(msg, category=DeprecationWarning)
[2020-03-13 13:19:24,141] {settings.py:253} INFO - settings.configure_orm(): Using pool settings. pool_size=0, max_overflow=10, pool_recycle=1800, pid=15
Traceback (most recent call last):
File "src/pymssql.pyx", line 450, in pymssql.Cursor.execute
File "src/_mssql.pyx", line 1064, in _mssql.MSSQLConnection.execute_query
File "src/_mssql.pyx", line 1095, in _mssql.MSSQLConnection.execute_query
File "src/_mssql.pyx", line 1228, in _mssql.MSSQLConnection.format_and_run_query
File "src/_mssql.pyx", line 1639, in _mssql.check_cancel_and_raise
File "src/_mssql.pyx", line 1683, in _mssql.maybe_raise_MSSQLDatabaseException
_mssql.MSSQLDatabaseException: (208, b"Invalid object name 'users'.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n")
Version details:
SQL Server --> Microsoft SQL Server 2014 (SP2) (KB3171021) - 12.0.5000.0
Airflow --> Tried 1.10.5 & latest 1.10.9
Python --> 3.6
I'm having trouble getting the LocalExecutor to work.
I created a postgres database called airflow and granted all privileges to the airflow user. Finally I updated my airflow.cfg file:
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, KubernetesExecutor
executor = LocalExecutor
# The SqlAlchemy connection string to the metadata database.
# SqlAlchemy supports many different database engine, more information
# their website
sql_alchemy_conn = postgresql+psycopg2://airflow:[MY_PASSWORD]#localhost:5432/airflow
Next I ran:
airflow initdb
airflow scheduler
airflow webserver
I thought it was working, but I noticed my dags were taking a long time to finish. Upon further inspection of my log files, I noticed that they say Airflow is using the SequentialExecutor.
INFO - Job 319: Subtask create_task_send_email [2020-01-07 12:00:16,997] {__init__.py:51} INFO - Using executor SequentialExecutor
Does anyone know what could be causing this?