SFTP with Google Cloud Composer - sftp

I need to upload a file via SFTP into an external server through Cloud Composer. The code for the task is as follows:
from airflow import DAG
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
def make_sftp():
import paramiko
import pysftp
import os
from airflow.contrib.hooks.ssh_hook import SSHHook
import subprocess
ssh_hook = SSHHook(ssh_conn_id="conn_id")
sftp_client = ssh_hook.get_conn().open_sftp()
return 0
etl_dag = DAG("dag_test",
start_date=datetime.now(tz=local_tz),
schedule_interval=None,
default_args={
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 5,
"retry_delay": timedelta(minutes=5)})
sftp = PythonVirtualenvOperator(task_id="sftp",
python_callable=make_sftp,
requirements=["sshtunnel", "paramiko"],
dag=etl_dag)
start_pipeline = DummyOperator(task_id="start_pipeline", dag=etl_dag)
start_pipeline >> sftp
In "conn_id" I have used the following options: {"no_host_key_check": "true"}, the DAG runs for a couple of seconds and the fail with the following message:
WARNING - Remote Identification Change is not verified. This wont protect against Man-In-The-Middle attacks\n[2022-02-10 10:01:59,358] {ssh_hook.py:171} WARNING - No Host Key Verification. This wont protect against Man-In-The-Middle attacks\nTraceback (most recent call last):\n File "/tmp/venvur4zvddz/script.py", line 23, in <module>\n res = make_sftp(*args, **kwargs)\n File "/tmp/venvur4zvddz/script.py", line 19, in make_sftp\n sftp_client = ssh_hook.get_conn().open_sftp()\n File "/usr/local/lib/airflow/airflow/contrib/hooks/ssh_hook.py", line 194, in get_conn\n client.connect(**connect_kwargs)\n File "/opt/python3.6/lib/python3.6/site-packages/paramiko/client.py", line 412, in connect\n server_key = t.get_remote_server_key()\n File "/opt/python3.6/lib/python3.6/site-packages/paramiko/transport.py", line 834, in get_remote_server_key\n raise SSHException("No existing session")\nparamiko.ssh_exception.SSHException: No existing session\n'
do I have to set other options? Thank you!

Configuring the SSH connection with key pair authentication
To SSH into the host as a user with username “user_a”, an SSH key pair should be generated for that user and the public key should be added to the host machine. The following are the steps that would create an SSH connection to the “jupyter” user which has the write permissions.
Run the following commands on the local machine to generate the required SSH key:
ssh-keygen -t rsa -f ~/.ssh/sftp-ssh-key -C user_a
“sftp-ssh-key” → Name of the pair of public and private keys (Public key: sftp-ssh-key.pub, Private key: sftp-ssh-key)
“user_a” → User in the VM that we are trying to connect to
chmod 400 ~/.ssh/sftp-ssh-key
Now, copy the contents of the public key sftp-ssh-key.pub into ~/.ssh/authorized_keys of your host system. Check for necessary permissions for authorized_keys and grant them accordingly using chmod.
I tested the setup with a Compute Engine VM . In the Compute Engine console, edit the VM settings to add the contents of the generated SSH public key into the instance metadata. Detailed instructions can be found here. If you are connecting to a Compute Engine VM, make sure that the instance has the appropriate firewall rule to allow the SSH connection.
Upload the private key to the client machine. In this scenario, the client is the Airflow DAG so the key file should be accessible from the Composer/Airflow environment. To make the key file accessible, it has to be uploaded to the GCS bucket associated with the Composer environment. For example, if the private key is uploaded to the data folder in the bucket, the key file path would be /home/airflow/gcs/data/sftp-ssh-key.
Configuring the SSH connection with password authentication
If password authentication is not configured on the host machine, follow the below steps to enable password authentication.
Set the user password using the below command and enter the new password twice.
sudo passwd user_a
To enable SSH password authentication, you must SSH into the host machine as root to edit the sshd_config file.
/etc/ssh/sshd_config
Then, change the line PasswordAuthentication no to PasswordAuthentication yes. After making that change, restart the SSH service by running the following command as root.
sudo service ssh restart
Password authentication has been configured now.
Creating connections and uploading the DAG
1.1 Airflow connection with key authentication
Create a connection in Airflow with the below configuration or use the existing connection.
Extra field
The Extra JSON dictionary would look like this. Here, we have uploaded the private key file to the data folder in the Composer environment's GCS bucket.
{
"key_file": "/home/airflow/gcs/data/sftp-ssh-key",
"conn_timeout": "30",
"look_for_keys": "false"
}
1.2 Airflow connection with password authentication
If the host machine is configured to allow password authentication, these are the changes to be made in the Airflow connection.
The Extra parameter can be empty.
The Password parameter is the user_a's user password on the host machine.
The task logs show that the password authentication was successful.
INFO - Authentication (password) successful!
Upload the DAG to the Composer environment and trigger the DAG. I was facing key validation issue with the latest version of the paramiko=2.9.2 library. I tried downgrading paramiko but the older versions do not seem to support OPENSSH keys. Found an alternative paramiko-ng in which the validation issue has been fixed. Changed the Python dependency from paramiko to paramiko-ng in the PythonVirtualenvOperator.
from airflow import DAG
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
def make_sftp():
import paramiko
from airflow.contrib.hooks.ssh_hook import SSHHook
ssh_hook = SSHHook(ssh_conn_id="sftp_connection")
sftp_client = ssh_hook.get_conn().open_sftp()
print("=================SFTP Connection Successful=================")
remote_host = "/home/sftp-folder/sample_sftp_file" # file path in the host system
local_host = "/home/airflow/gcs/data/sample_sftp_file" # file path in the client system
sftp_client.get(remote_host,local_host) # GET operation to copy file from host to client
sftp_client.close()
return 0
etl_dag = DAG("sftp_dag",
start_date=datetime.now(),
schedule_interval=None,
default_args={
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 5,
"retry_delay": timedelta(minutes=5)})
sftp = PythonVirtualenvOperator(task_id="sftp",
python_callable=make_sftp,
requirements=["sshtunnel", "paramiko-ng", "pysftp"],
dag=etl_dag)
start_pipeline = DummyOperator(task_id="start_pipeline", dag=etl_dag)
start_pipeline >> sftp
Results
The sample_sftp_file has been copied from the host system to the specified Composer bucket.

Related

Can't access the fastapi page using the public ipv4 address of the deployed aws ec2 instance with uvicorn running service

I was testing a simple fastapi backend by deploying it on aws ec2 instance. The service runs fine in the default port 8000 in the local machine. But as I ran the script on the ec2 instance with
uvicorn main:app --reload it ran just fine with following return
INFO: Will watch for changes in these directories: ['file/path']
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started reloader process [4698] using StatReload
INFO: Started server process [4700]
INFO: Waiting for application startup.
INFO: Application startup complete.
Then in the ec2 security group configuration, the TCP for 8000 port was allowed as shown in the below image.
ec2 security group port detail
Then to test and access the service I opened the public ipv4 address with port address as https://ec2-public-ipv4-ip:8000/ in chrome.
But there is no response whatsoever.
The webpage is as below
result webpage
The error in the console is as below
VM697:6747 crbug/1173575, non-JS module files deprecated.
(anonymous) # VM697:6747
The fastapi main file contains :->
from fastapi import FastAPI, Form, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi.encoders import jsonable_encoder
import joblib
import numpy as np
import os
from own.preprocess import Preprocess
import sklearn
col_data = joblib.load("col_bool_mod.z")
app = FastAPI()
#app.get("/predict")
async def test():
return jsonable_encoder(col_data)
#app.post("/predict")
async def provide(data: list):
print(data)
output = main(data)
return output
def predict_main(df):
num_folds = len(os.listdir("./models/"))
result_li = []
for fold in range(num_folds):
print(f"predicting for fold {fold} / {num_folds}")
model = joblib.load(f"./models/tabnet/{fold}_tabnet_reg_adam/{fold}_model.z")
result = model.predict(df)
print(result)
result_li.append(result)
return np.mean(result_li)
def main(data):
df = Preprocess(data)
res = predict_main(df)
print(res)
return {"value": f"{np.float64(res).item():.3f}" if res >=0 else f"{np.float64(0).item()}"}
The service runs fine with same steps in the react js frontend using port 3000 but the fastapi on 8000 is somehow not working.
Thank You for Your Patience
I wanted to retrieve the basic api reponses from the fastapi-uvicorn server deployed in an aws ec2 instance. But there is no response with 8000 port open and ec2 access on local ipv4 ip address.
One way the problem is fixed is by assigning public ipv4 address followed by port 3000 in the CORS origin. But the issue is to hide the get request data on the browser that is accessed by 8000 port.

Jupyter Password Not Hashed

When I try to set up the jupyter notebook password, I don't get a password hash when I open up the jupyter_notebook_config.json file.
This is the output of the json file:
{
"NotebookApp": {
"password":
"argon2:$argon2id$v=19$m=10240,t=10,p=8$pcTg1mB/X5a3XujQqYq/wQ$/UBQBRlFdzmEmxs6c2IzmQ"
}
}
I've tried running passwd() from python as well, like in the instructions for Preparing a hashed password instructions found online but it produces the same results as above. No hash.
Can someone please let me know what I'm doing wrong?
I'm trying to set up a Jetson Nano in similar fashion to the Deep Learing Institute Nano build. With that build you can run Jupyter Lab remotely so the nano can run headless. I'm trying to do the same things with no luck.
Thanks!
This is the default algorithm (argon2):
https://github.com/jupyter/notebook/blob/v6.5.2/notebook/auth/security.py#L23
you can provide a different algorithm like sha1 if you like:
>>> from notebook.auth import passwd
>>> from notebook.auth.security import passwd_check
>>>
>>> password = 'myPass123'
>>>
>>> hashed_argon2 = passwd(password)
>>> hashed_sha1 = passwd(password, 'sha1')
>>>
>>> print(hashed_argon2)
argon2:$argon2id$v=19$m=10240,t=10,p=8$JRz5GPqjOYJu/cnfXc5MZw$LZ5u6kPKytIv/8B/PLyV/w
>>>
>>> print(hashed_sha1)
sha1:c29c6aeeecef:0b9517160ce938888eb4a6ec9ca44e3a31da9519
>>>
>>> passwd_check(hashed_argon2, password)
True
>>>
>>> passwd_check(hashed_sha1, password)
True
Check whether you don't have a different Jupyter server running on your machine. It happened to me that I was trying over and over a password on port 8888 while my intended server was on port 8889.
Another time, Anaconda started a server on localhost:8888, and I was trying to reach a mapped port from a docker container, also on port 8888, and the only way to access was actually on 0.0.0.0:8888.

Apache airflow REST API call fails with 403 forbidden when API authentication is enabled

Apache Airflow REST API fails with 403 forbidden for the call:
"/api/experimental/test"
Configuration in airflow.cfg
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
[api]
rbac = True
auth_backend = airflow.contrib.auth.backends.password_auth
After setting all this, docker image is built and run as a docker container.
Created the airflow user as follows:
airflow create_user -r Admin -u admin -e admin#hpe.com -f Administrator -l 1 -p admin
Login with credentials for Web UI works fine.
Where as login to REST API is not working.
HTTP Header for authentication:
Authorization BASIC YWRtaW46YWRtaW4=
Airflow version: 1.10.9
By creating user in the following manner we can access the Airflow experimental API using credentials.
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'new_user_name'
user.email = 'new_user_email#example.com'
user.password = 'set_the_password'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
By creating user with "airflow create_user" command, we cannot access the Airflow Experimental APIs.

Airflow - Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url

I am running Airflowv1.9 with Celery Executor. I have 5 Airflow workers running in 5 different machines. Airflow scheduler is also running in one of these machines. I have copied the same airflow.cfg file across these 5 machines.
I have daily workflows setup in different queues like DEV, QA etc. (each worker runs with an individual queue name) which are running fine.
While scheduling a DAG in one of the worker (no other DAG have been setup for this worker/machine previously), I am seeing the error in the 1st task and as a result downstream tasks are failing:
*** Log file isn't local.
*** Fetching here: http://<worker hostname>:8793/log/PDI_Incr_20190407_v2/checkBCWatermarkDt/2019-04-07T17:00:00/1.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://<worker hostname>:8793/log/PDI_Incr_20190407_v2/checkBCWatermarkDt/2019-04-07T17:00:00/1.log
I have configured MySQL for storing the DAG metadata. When I checked task_instance table, I see proper hostnames are populated against the task.
I also checked the log location and found that the log is getting created.
airflow.cfg snippet:
base_log_folder = /var/log/airflow
base_url = http://<webserver ip>:8082
worker_log_server_port = 8793
api_client = airflow.api.client.local_client
endpoint_url = http://localhost:8080
What am I missing here? What configurations do I need to check additionally for resolving this issue?
Looks like the worker's hostname is not being correctly resolved.
Add a file hostname_resolver.py:
import os
import socket
import requests
def resolve():
"""
Resolves Airflow external hostname for accessing logs on a worker
"""
if 'AWS_REGION' in os.environ:
# Return EC2 instance hostname:
return requests.get(
'http://169.254.169.254/latest/meta-data/local-ipv4').text
# Use DNS request for finding out what's our external IP:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(('1.1.1.1', 53))
external_ip = s.getsockname()[0]
s.close()
return external_ip
And export: AIRFLOW__CORE__HOSTNAME_CALLABLE=airflow.hostname_resolver:resolve
The web program of the master needs to go to the worker to fetch the log and display it on the front-end page. This process is to find the host name of the worker. Obviously, the host name cannot be found,Therefore, add the host name to IP mapping on the master's vim /etc/hosts
If this happens as part of a Docker Compose Airflow setup, the hostname resolution needs to be passed to the container hosting the webserver, e.g. through extra_hosts:
# docker-compose.yml
version: "3.9"
services:
webserver:
extra_hosts:
- "worker_hostname_0:192.168.xxx.yyy"
- "worker_hostname_1:192.168.xxx.zzz"
...
...
More details here.

After changing paswords in vault.yml, deployment fails in trellis

I had a wordpress site setup using Trellis. Initially I had set up the server and deployed without encrypting the vault.yml.
Once everything was working fine I changed the passwords in vault.yml and encrypted the file. But my deployment fails now.
And I get the following error-
TASK [deploy : WordPress Installed?]
**************************
System info:
Ansible 2.6.3; Darwin
Trellis version (per changelog): "Allow customizing Nginx `worker_connections`"
---------------------------------------------------
non-zero return code
Error: Error establishing a database connection. This either means that
the username and password information in your `wp-config.php` file is
incorrect or we can’t contact the database server at `localhost`. This
could mean your host’s database server is down.
fatal: [mysite.org]: FAILED! => {"changed": false,
"cmd": ["wp", "core", "is-installed", "--skip-plugins", "--skip-
themes", "--require=/srv/www/mysite.org/shared/tmp_multisite_constants.php"], "delta":
"0:00:00.224955", "end": "2019-01-04 16:59:01.531111",
"failed_when_result": true, "rc": 1, "start": "2019-01-04
16:59:01.306156", "stderr_lines": ["Error: Error establishing a
database connection. This either means that the username and password
information in your `wp-config.php` file is incorrect or we can’t
contact the database server at `localhost`. This could mean your host’s
database server is down."], "stdout": "", "stdout_lines": []}
to retry, use: --limit
#/Users/praneethavelamuri/Desktop/path/to/my/project/trellis/deploy.retry
Is there any step I missed? I followed these steps-
ansible-playbook server.yml -e env=staging
./bin/deploy.sh staging mysite.org
change passwords in staging/vault.yml
set vault password
inform ansible about password
encrypt the file
commit the file and push the repo
re deploy and then I get the error!
I got it solved. I have changed the sudo user password too in my vault. so ssh into server and changing sudo password to the password mentioned in vault and then provisioning it and then deploying solved the issue.

Resources