In Airflow http (and other) connections can be defined as environment variables. However, it is hard to use an https schema for these connections.
Such a connection could be:
export AIRFLOW_CONN_MY_HTTP_CONN=http://example.com
However, defining a secure connection is not possible:
export AIRFLOW_CONN_MY_HTTP_CONN=https://example.com
Because Airflow strips the scheme (https) and in the final connection object the url gets http as scheme.
It turns out that there is a possibility to use https by defining the connection like this:
export AIRFLOW_CONN_MY_HTTP_CONN=https://example.com/https
The second https is called schema in the airflow code (like in DSN's e.g. postgresql://user:passw#host/schema). This schema is then used as the scheme in the construction of the final url in the connection object.
I am wondering if this is by design, or just an infortunate mixup of scheme and schema.
For those who land in this question in the future, I confirm that #jjmurre 's answer works well for 2.1.3 .
In this case we need URI-encoded string.
export AIRFLOW_CONN_SLACK='http://https%3a%2f%2fhooks.slack.com%2fservices%2f...'
See this post for more details.
Hope this can save other fellows an hour which I've spent on investigating.
You should use Connections and then you can specify schema.
This is what worked for me using bitnami airflow:
.env
MY_SERVER=my-conn-type://xxx.com:443/https
docker-compose.yml
environment:
- AIRFLOW_CONN_MY_SERVER=${MY_SERVER}
Related
Problem Statement
When editing the UI I can add modify the extra field to contain {"no_host_key_check": true}
But when I attempt to add this connection from the CLI with this command which follows the connections documentation format
airflow connections add local_sftp --conn-uri "sftp://test_user:test_pass#local_spark_sftp_server_1:22/schema?extra='{\"no_host_key_check\": true}"
The connection is added as {"extra": "'{\"no_host_key_check\": true}"}
How do I need to modify my airflow connections add command to properly format this connection configuration?
This is fundamentally a misunderstanding that all schema parameters for the airflow connections add command are extra parameters by default
airflow connections add local_sftp --conn-uri "sftp://test_user:test_pass#local_spark_sftp_server_1:22/schema?no_host_key_check=true"
Correctly sets the desired parameter.
I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?
Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).
I'm regards to my previous Stackoverflow post here I've finally upgraded from Airflow version 1.9 to 1.10 since it's now released on PyPi. Using their release guide here I got Airflow 1.10 working. Now I inspected their udpates to 1.10 to see how they addressed the bug discovered in Airflow version 1.9 when run on an AWS EC2-Instance. And I found that they replaced all functions that got the servers IP address with a call to this new Airflow class's function get_hostname https://github.com/apache/incubator-airflow/blob/master/airflow/utils/net.py. Now, inside that small function you see the comment that says,
Fetch the hostname using the callable from the config or using
socket.getfqdn as a fallback.
So then after that comment you see the code,
callable_path = conf.get('core', 'hostname_callable')
Which tells us in the airflow.cfg under the section [core] there is a new key value field called hostname_callable which now lets us set how we want to fetch the servers IP address. So their fix for the bug seen in Airflow version 1.9 is to just let us choose how to fetch the IP address if we need to change it. Their default value for this new configuration field is seen here https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/default_airflow.cfg under the [core] section. You can see they have it set as,
[core]
# Hostname by providing a path to a callable, which will resolve the hostname
hostname_callable = socket:getfqdn
So they're using the function call socket:getfqdn that causes the bug to occur when run on an AWS EC2-Instance. I need it to use socket.gethostbyname(socket.gethostname()) (again this is mentioned in more detail on my previous post)
So my question is what's the syntax that I need to use in order to get socket.gethostbyname(socket.gethostname()) in the configuration style of using colons :. For example the function call socket.getfqdn() is written in the configuration file as socket:getfqdn. So I don't see how I would write that syntax with a nested call. Would I write something like socket:gethostbyname(socket.gethostname())?
If it were me, I would create an airflow_local_settings module with a hostname_callable function which returns the necessary value. From the code it looks like you could set the value to airflow_local_settings:hostname_callable.
Basically, in airflow_local_settings:
import socket
def hostname_callable():
return socket.gethostbyname(socket.gethostname())
Then install your airflow_local_settings module to the computer as you would any other module.
This is not a direct answer but might help someone facing this or similar issues.
I ran into a similar issue when using Airflow locally on macOS Catalina.
Sometimes (seemingly random) the local_task_job.py crashed saying
The recorded hostname mycustomhostname.local does not match this instance's hostname 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
I could solve the issue by editing the airflow.cfg and replacing
hostname_callable = socket:getfqdn
with
hostname_callable = socket:gethostname
I have some Lua scripts embedded in nginx. In one of those scripts I connect to my Redis cache and do it like so:
local redis_host = "127.0.0.1"
local redis_port = 6379
...
local ok, err = red:connect(redis_host, redis_port);
I do not like this, because, I have to hard code host and port. Should I instead use something like .ini file, parse it in Lua and get configuration information from this file? How do they solve this problem in real world practice?
Besides, I my scripts I use RSA decryption and encryption. For example, I do it like so now:
local public_key = [[ -----BEGIN PUBLIC KEY----- MFwwDQYJKoZIhvcNAQEBBQADSwAwSAJBAL7udJ++o3T6lgbFwWfaD/9xUMEZMtbm GvbI35gEgzjrRcZs4X3Sikm7QboxMJMrfzjQxISPLtsy9+vhbITQNVkCAwEAAQ== -----END PUBLIC KEY----- ]]
...
local jwt_obj = jwt:verify(public_key, token)
Once again what I do not like about this, is that I have to hard code public key. Do they use it in production like so or use some other techniques to store secrets (like storing them in environment variable)?
I'm sure some people do it this way in production. It is all a matter of what you're comfortable with and what your standards are. Some things that should determine your approach here -
What is the sensitivity of the data and risk if it were to be available publicly?
What is your deployment process? If you use an infrastructure as code approach or some type of config management then you surely don't want these items sitting embedded within code.
To solve the first item around sensitivity of the data, you'd need to consider many different scenarios of the best way to secure the secrets. Standard secret stores like AWS Parameter Store and CredStash are built just for this purpose and you'd need to pull the secrets at runtime to load them to memory.
For the second item, you could use a config file that is replaced per deployment.
To get the best of both worlds, you'd need to combine both a secure mechanism for storing secrets and a configuration approach for deployments/updates.
Like was mentioned in the comments, there are books written on both of these topics so the chances of getting enough detail in a SO answer is unlikely.
both plugin seems to use the same code for redis_gzip_flag and memcached_gzip_flag not provide any instructions about this flag and how to set it, as redis string doesn't have any flag support
so what is this flag?
where I set it in redis?
what number should I choose in the nginx config?
Hadn't heard of this but I found an example here, looks like you add it manually to your location block when you know the data you're going to be requesting from redis is gzipped.