I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?
Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).
Related
I have an airflow DAG which call a particular bash command using a variable. At the backend, we have Aurora DB. Do we know if there are any tables in the Aurora DB which stores information of the variables used in Airflow DAGs? I need to create a report out of it and hence, the ask to access the variables from backend.
I tried using the operational_insights schema but could not find any tables with the desired information.
If you are using an Airflow variable you should be able to query a list of them with the REST API no matter which backend you use.
curl "http://<your Airflow host>/api/v1/variables" --user "login:password"
This is preferred over querying the Airflow metadata database directly because if you accidentally modify or drop a table you can corrupt your Airflow.
With that caveat: the standard table where Airflow variables are stored is variable so after logging into the db SELECT * FROM variable; should return a list.
Again this is for Airflow Variables. From your question I am not entirely sure if you mean that or in general any variables that tasks use. In the latter case you might be looking for the rendered_fields parameter of the task instances, which can also be done using the API.
I have an existing database for which i was looking to create a new clustered environment. I tried the following steps:
Create a new database instance (OS & DB Server).
Take a backup / snapshot from existing database server for all the databases.
Import the snapshot to the new server.
Configure the cluster - referred to various sites but all giving same solution. Example reference site - https://vexxhost.com/resources/tutorials/how-to-configure-a-galera-cluster-with-mariadb-on-ubuntu-12-04/
Ran the command (sudo galera_new_cluster) on the primary server. (Primary server - no issue starting up). But when we tried starting the secondary server - it actually crashed for some reason.
Unfortunately at this point, dont have the logs stored / backed up with me where it failed. But it seemed like it tried to sync in with the primary server - had some failure with that.
As for additional part of the actions performed above. Both the server with same username / password - created a passwordless ssh connection between both the machines. Also, the method of syncing is set to rsync.
Am i missing something or doing it wrong? Is there a better way available on it?
I have recently created a very small Google Compute Engine instance, naively thinking it's one of those easily scalable things Google people keep raving about.
I used the quick deployment feature of Wordpress and it all installed itself nicely, so I started configuring and adding data etc.
However, I then found out that I can't scale an existing instance (i.e. it won't allow me to change the instance type to a bigger one. I don't get why not, but there you go.), so it looks like I need to find a way to migrate my Wordpress installation to a new instance.
Will I simply be able to create a new instance and point it at the persistent disk my small instance currently uses, et voila, Bob's your uncle?
Or do I need to manually get the files and MySql data off the first instance and re-import into an empty new instance?
What's the easiest way?
Any advise or helpful links would be appreciated.
Thanks.
P.S.: Btw, should I try to use the Google Cloud SQL store instead of a local MySql installation?
In order to upgrade your VM:
access the VM's settings in the Developers Console (your project -> Compute -> Compute Engine -> VM instances -> click on the VM's name)
Scroll down to the "Disks" section, and un-check "Delete boot disk when instance is deleted"
Delete the VM in question. Take note that the disk, named after the instance, will remain.
Create a new VM, selecting "Existing disk" under Boot disk - Boot source. In the next box down, select the disk from point 3 above, as well as a bigger machine type.
The resulting new instance will use the existing disk from the old one, with improved hardware / performance.
As for using Cloud SQL in lieu of a VM-installed database, it's perfectly feasible, and allows to adjust the Cloud SQL instance to match your actual use. A few consideration when setting up this kind of instance:
limit the IPs allowed to connect to your Cloud SQL instance to your frontend's IP, and perhaps the workstation's IP or subnet from which you maintain the database out of.
configure Cloud SQL to use SSL certificates.
Sammy's answer covers the important stuff I just wanted to clarify how your files are arranged on the two disks that are attached to your instance:
The data disk contains /var/www/ which is all of the wordpress files. It's mounted on the instance at /wordpress
The boot disk contains everything else, including the MySQL database that was created for the Wordpress installation.
I have a C application that inserts data directly to the database of my Meteor application. The app works fine (withoud delays) when I run it in development mode (with "meteor"). However, if I run the app as a node app (bundled) and with external MongoDB, there's an annoying delay in screen updates (5-10s).
I have noticed some previous discussions about this:
Meteor: server-side db insert delays
Using node ddp-client to insert into a meteor collection from Node
Questions:
Is there any other way than building a server-side API for doing the db inserts through Meteor?
Why the delay is only when using external MongoDB?
Is there a way in Meteor to shorten this database polling interval?
You need to enable oplog tailing. Without oplog tailing, when your C program makes a database write, the Meteor server doesn't realise anything has changed until it polls MongoDB again. With oplog tailing, it can pick up the changes much more quickly and efficiently. In development mode, oplog tailing is enabled automatically, but for production it needs some additional setup.
Your MongoDB must be set up as a replica set (a replica set of one node does work).
You have to pass in a mongo URL for the replica set's local database with the environment variable MONGO_OPLOG_URL.
For more information, see this article.
I have a SQLite database that is used by two processes. I am wondering, with the most recent version of SQLite, while one process (connection) starts a transaction to write to the database will the other process be able to read from the database simultaneously?
I collected information from various sources, mostly from sqlite.org, and put them together:
First, by default, multiple processes can have the same SQLite database open at the same time, and several read accesses can be satisfied in parallel.
In case of writing, a single write to the database locks the database for a short time, nothing, even reading, can access the database file at all.
Beginning with version 3.7.0, a new “Write Ahead Logging” (WAL) option is available, in which reading and writing can proceed concurrently.
By default, WAL is not enabled. To turn WAL on, refer to the SQLite documentation.
SQLite3 explicitly allows multiple connections:
(5) Can multiple applications or multiple instances of the same
application access a single database file at the same time?
Multiple processes can have the same database open at the same time.
Multiple processes can be doing a SELECT at the same time. But only
one process can be making changes to the database at any moment in
time, however.
For sharing connections, use SQLite3 shared cache:
Starting with version 3.3.0, SQLite includes a special "shared-cache"
mode (disabled by default)
In version 3.5.0, shared-cache mode was modified so that the same
cache can be shared across an entire process rather than just within a
single thread.
5.0 Enabling Shared-Cache Mode
Shared-cache mode is enabled on a per-process basis. Using the C
interface, the following API can be used to globally enable or disable
shared-cache mode:
int sqlite3_enable_shared_cache(int);
Each call sqlite3_enable_shared_cache() effects subsequent database
connections created using sqlite3_open(), sqlite3_open16(), or
sqlite3_open_v2(). Database connections that already exist are
unaffected. Each call to sqlite3_enable_shared_cache() overrides all
previous calls within the same process.
I had a similar code architecture as you. I used a single SQLite database which process A read from, while process B wrote to it concurrently based on events. (In python 3.10.2 using the most up to date sqlite3 version). Process B was continually updating the database, while process A was reading from it to check data. My issue was that it was working in debug mode, but not in "release" mode.
In order to solve my particular problem I used Write Ahead Logging, which is referenced in previous answers. After creating my database in Process B (write mode) I added the line:
cur.execute('PRAGMA journal_mode=wal') where cur is the cursor object created from establishing connection.
This set the journal to wal mode which allows for concurrent access for multiple reads (but only one write). In Process A, where I was reading the data, before connecting to the same database I included:
time.sleep(0.5)
Setting a sleep timer before a connection was made to the same database fixed my issue with it not working in "release" mode.
In my case: I did not have to manually set any checkpoints, locks, or transactions. Your use case might be different than mine however, so research is most likely required. Nevertheless, I hope this post helps and saves everyone some time!