Cannot add JDBC driver in Sqoop command when running import command using Airflow 2.5.0 - airflow

I am running a Sqoop import command which imports a table from MySQL db and loads it to HDFS using Sqoop. I have created a below DAG which performs this above activity.
from airflow.models import DAG
from airflow.contrib.operators.sqoop_operator import SqoopOperator
from airflow.utils.dates import days_ago
Dag_Sqoop_Import = DAG(dag_id="SqoopImport",
schedule_interval="* * * * *",
start_date=days_ago(2))
sqoop_mysql_import = SqoopOperator(conn_id="sqoop_local",
table="shipmethod",
cmd_type="import",
target_dir="/airflow_sqoopImport",
num_mappers=1,
task_id="SQOOP_Import",
dag=Dag_Sqoop_Import)
sqoop_mysql_import
I have also created a SqoopImport connection in Airflow as below.
But when is Trigger the job it should take the below command as I assume
sqoop import --connect jdbc:mysql://192.168.0.15:3306/adventureworks?characterEncoding=latin1 --driver com.mysql.jdbc.Driver --username xxxx --password xxxxxx --autoreset-to-one-mapper --table workorder --target-dir /user/adminn/workorder
But when I check in logs its actually taking below command
Executing command: sqoop import --username xxxx --password MASKED --num-mappers 1 --connect jdbc:mysql://192.168.0.15:3306/adventureworks?characterEncoding=latin1 --target-dir /airflow_sqoopImport --as-textfile --table shipmethod
And the DAG fails giving below error. also I know the cause of this error, I need to add the parameter driver com.mysql.jdbc.Driver which can solve the below error. am struggling to add, can you please let me know where am going wrong.
ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic#5906ebcb is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.'
Replies Appreciated, thanks.

You should provide the driver class as an argument for the operator and not the connection
sqoop_mysql_import = SqoopOperator(conn_id="sqoop_local",
table="shipmethod",
cmd_type="import",
target_dir="/airflow_sqoopImport",
driver="com.mysql.jdbc.Driver",
num_mappers=1,
task_id="SQOOP_Import",
dag=Dag_Sqoop_Import)

Related

how to connect hdfs in airflow?

How to perform HDFS operation in Airflow?
make sure you install following python package
pip install apache-airflow-providers-apache-hdfs
#Code Snippet
#Import packages
from airflow import settings
from airflow.models import Connection
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.bash import BashOperator
#Define new DAG
dag_execute_hdfs_commands = DAG(
dag_id ='connect_hdfs',
schedule_interval='#once',
start_date=days_ago(1),
dagrun_timeout=timedelta(minutes=60),
description='excuting hdfs commands',
)
#Establish connection to HDFS
conn =Connection(
conn_id = 'webhdfs_default1',
conn_type='HDFS',
host='localhost',
login='usr_id',
password='password',
port='9000',
)
session = settings.Session()
#Following line will add new connection to you airflow default DB
#Make sure once the DAG runs successfully you comment out following line.
#Because we do not want to add same connection "webhdfs_default1" every time we perform hdfs operations.
session.add(conn) #On your next run comment this out
session.close()
if __name__ == '__main__':
dag_execute_hdfs_commands.cli()
Once above DAG runs successfully you can perform hdfs operation hereafter
For example if you wish to list files in hdfs directory try the following code
#File listing operation
start_task = BashOperator(
task_id="start_task",
bash_command="hdfs dfs -ls /",
dag=dag_execute_hdfs_commands
)
start_task
You cannot use the connection webhdfs_default with BashOperator, where it works with WebHDFSHook hook, which create a client to query the web HDFS server. Currently there is two implemented method:
check_for_path: to check if a file exists in hdfs
load_file: to upload a file to hdfs
You can access the client to do other operation:
webHDFS_hook = WebHDFSHook(webhdfs_conn_id="<you conn id>")
client = webHDFS_hook.get_conn()
client.<operation>
The client it an instance from hdfs.InsecureClient if the conf core.security is not kerberos, and hdfs.ext.kerberos.KerberosClient if it is. Here is the documentation of hdfs cli clients, you can check what are the available operation and use them.
There is a lot of available operations like download, delete, list, read, make_dir, ..., which you can call in a new Airflow operator.

Airflow Only works with the Celery, CeleryKubernetes or Kubernetes executors

I got this dag, nevrtheless when trying to run it, it stacks on Queued run. When i then trying to run manually i get error:
Error:
Only works with the Celery, CeleryKubernetes or Kubernetes executors
Code:
from airflow import DAG
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.operators.python import PythonOperator
from datetime import datetime
def helloWorld():
print('Hello World')
def take_clients():
hook = PostgresHook(postgres_conn_id="postgres_robert")
df = hook.get_pandas_df(sql="SELECT * FROM clients;")
print(df)
# do what you need with the df....
with DAG(dag_id="test",
start_date=datetime(2021,1,1),
schedule_interval="#once",
catchup=False) as dag:
task1 = PythonOperator(
task_id="hello_world",
python_callable=helloWorld)
task2 = PythonOperator(
task_id="get_clients",
python_callable=take_clients)
task1 >> task2
I guess you are trying to use RUN button from the UI.
This button is enabled only for executors that supports it.
In your Airflow setup you are using Executor that doesn't support this command.
In newer Airflow versions the button is simply disable if you you are using Executor that doesn't support it:
I assume that what you are after is to create a new run, in that case you should use Trigger Run button. If you are looking to re-run specific task then use Clear button.
you run it as LocalExecutor , you have to change your Executor to Celery, CeleryKubernetes or Kubernetes or DaskExecutor
if you using docker-compose add:
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
otherwise go to airflow Executor

SFTP with Google Cloud Composer

I need to upload a file via SFTP into an external server through Cloud Composer. The code for the task is as follows:
from airflow import DAG
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
def make_sftp():
import paramiko
import pysftp
import os
from airflow.contrib.hooks.ssh_hook import SSHHook
import subprocess
ssh_hook = SSHHook(ssh_conn_id="conn_id")
sftp_client = ssh_hook.get_conn().open_sftp()
return 0
etl_dag = DAG("dag_test",
start_date=datetime.now(tz=local_tz),
schedule_interval=None,
default_args={
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 5,
"retry_delay": timedelta(minutes=5)})
sftp = PythonVirtualenvOperator(task_id="sftp",
python_callable=make_sftp,
requirements=["sshtunnel", "paramiko"],
dag=etl_dag)
start_pipeline = DummyOperator(task_id="start_pipeline", dag=etl_dag)
start_pipeline >> sftp
In "conn_id" I have used the following options: {"no_host_key_check": "true"}, the DAG runs for a couple of seconds and the fail with the following message:
WARNING - Remote Identification Change is not verified. This wont protect against Man-In-The-Middle attacks\n[2022-02-10 10:01:59,358] {ssh_hook.py:171} WARNING - No Host Key Verification. This wont protect against Man-In-The-Middle attacks\nTraceback (most recent call last):\n File "/tmp/venvur4zvddz/script.py", line 23, in <module>\n res = make_sftp(*args, **kwargs)\n File "/tmp/venvur4zvddz/script.py", line 19, in make_sftp\n sftp_client = ssh_hook.get_conn().open_sftp()\n File "/usr/local/lib/airflow/airflow/contrib/hooks/ssh_hook.py", line 194, in get_conn\n client.connect(**connect_kwargs)\n File "/opt/python3.6/lib/python3.6/site-packages/paramiko/client.py", line 412, in connect\n server_key = t.get_remote_server_key()\n File "/opt/python3.6/lib/python3.6/site-packages/paramiko/transport.py", line 834, in get_remote_server_key\n raise SSHException("No existing session")\nparamiko.ssh_exception.SSHException: No existing session\n'
do I have to set other options? Thank you!
Configuring the SSH connection with key pair authentication
To SSH into the host as a user with username “user_a”, an SSH key pair should be generated for that user and the public key should be added to the host machine. The following are the steps that would create an SSH connection to the “jupyter” user which has the write permissions.
Run the following commands on the local machine to generate the required SSH key:
ssh-keygen -t rsa -f ~/.ssh/sftp-ssh-key -C user_a
“sftp-ssh-key” → Name of the pair of public and private keys (Public key: sftp-ssh-key.pub, Private key: sftp-ssh-key)
“user_a” → User in the VM that we are trying to connect to
chmod 400 ~/.ssh/sftp-ssh-key
Now, copy the contents of the public key sftp-ssh-key.pub into ~/.ssh/authorized_keys of your host system. Check for necessary permissions for authorized_keys and grant them accordingly using chmod.
I tested the setup with a Compute Engine VM . In the Compute Engine console, edit the VM settings to add the contents of the generated SSH public key into the instance metadata. Detailed instructions can be found here. If you are connecting to a Compute Engine VM, make sure that the instance has the appropriate firewall rule to allow the SSH connection.
Upload the private key to the client machine. In this scenario, the client is the Airflow DAG so the key file should be accessible from the Composer/Airflow environment. To make the key file accessible, it has to be uploaded to the GCS bucket associated with the Composer environment. For example, if the private key is uploaded to the data folder in the bucket, the key file path would be /home/airflow/gcs/data/sftp-ssh-key.
Configuring the SSH connection with password authentication
If password authentication is not configured on the host machine, follow the below steps to enable password authentication.
Set the user password using the below command and enter the new password twice.
sudo passwd user_a
To enable SSH password authentication, you must SSH into the host machine as root to edit the sshd_config file.
/etc/ssh/sshd_config
Then, change the line PasswordAuthentication no to PasswordAuthentication yes. After making that change, restart the SSH service by running the following command as root.
sudo service ssh restart
Password authentication has been configured now.
Creating connections and uploading the DAG
1.1 Airflow connection with key authentication
Create a connection in Airflow with the below configuration or use the existing connection.
Extra field
The Extra JSON dictionary would look like this. Here, we have uploaded the private key file to the data folder in the Composer environment's GCS bucket.
{
"key_file": "/home/airflow/gcs/data/sftp-ssh-key",
"conn_timeout": "30",
"look_for_keys": "false"
}
1.2 Airflow connection with password authentication
If the host machine is configured to allow password authentication, these are the changes to be made in the Airflow connection.
The Extra parameter can be empty.
The Password parameter is the user_a's user password on the host machine.
The task logs show that the password authentication was successful.
INFO - Authentication (password) successful!
Upload the DAG to the Composer environment and trigger the DAG. I was facing key validation issue with the latest version of the paramiko=2.9.2 library. I tried downgrading paramiko but the older versions do not seem to support OPENSSH keys. Found an alternative paramiko-ng in which the validation issue has been fixed. Changed the Python dependency from paramiko to paramiko-ng in the PythonVirtualenvOperator.
from airflow import DAG
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
def make_sftp():
import paramiko
from airflow.contrib.hooks.ssh_hook import SSHHook
ssh_hook = SSHHook(ssh_conn_id="sftp_connection")
sftp_client = ssh_hook.get_conn().open_sftp()
print("=================SFTP Connection Successful=================")
remote_host = "/home/sftp-folder/sample_sftp_file" # file path in the host system
local_host = "/home/airflow/gcs/data/sample_sftp_file" # file path in the client system
sftp_client.get(remote_host,local_host) # GET operation to copy file from host to client
sftp_client.close()
return 0
etl_dag = DAG("sftp_dag",
start_date=datetime.now(),
schedule_interval=None,
default_args={
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 5,
"retry_delay": timedelta(minutes=5)})
sftp = PythonVirtualenvOperator(task_id="sftp",
python_callable=make_sftp,
requirements=["sshtunnel", "paramiko-ng", "pysftp"],
dag=etl_dag)
start_pipeline = DummyOperator(task_id="start_pipeline", dag=etl_dag)
start_pipeline >> sftp
Results
The sample_sftp_file has been copied from the host system to the specified Composer bucket.

pyodbc - [unixODBC][Driver Manager]Data source name not found, and no default driver specified

I am setting up a system to connect to an AWS Redshift database from python. I am thinking that there's something wrong in the python script because I can connect via isql. I've installed all the relevant packages, and I am able to connect via isql as follows:
$ isql rndredshift readonly ***** -v
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+---------------------------------------+
SQL> quit
However, my python script is failing to connect. Here's the script:
import pyodbc
import sys
def main():
redshift_conn_str = assemble_connection_string(
Driver='{PostgreSQL}',
Server='10.191.4.97',
ServerName='rndredshift',
Port='5439',
Database='prod',
Uid='readonly',
Pwd='*******'
)
print("===========")
print(redshift_conn_str)
print("===========")
new_conn2 = pyodbc.connect(redshift_conn_str)
print(psql.read_sql('select top 10 * from rawdb.raw_imprequest_20150101', new_conn2))
def assemble_connection_string(**kwargs):
return ';'.join([k + '=' + v for (k, v) in kwargs.items()])
if __name__ == '__main__':
sys.exit(main())
Here's the output:
===========
Uid=readonly;Database=prod;ServerName=rndredshift;Driver={PostgreSQL}; Server=10.191.4.97;Pwd=********;Port=5439
===========
Traceback (most recent call last):
File "test_redshift.py", line 24, in <module>
sys.exit(main())
File "test_redshift.py", line 17, in main
new_conn2 = pyodbc.connect(redshift_conn_str)
pyodbc.Error: ('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found, and no default driver specified (0) (SQLDriverConnectW)')
The PosgreSQL driver is installed:
$ odbcinst -q -d
[PostgreSQL]
[MySQL]
And the data source is configured:
$ odbcinst -q -s
[rndredshift]
If you're using DSNs, you're going to need to specify that in your connection string. Also, if you want to use DSN-less connections, I believe the keyword is SERVER and not SERVERNAME.
Try this connection string?
Uid=readonly;Database=prod;DSN=rndredshift;Driver={PostgreSQL};Pwd=********;
Make sure you specify the full server name and port in odbc.ini as well. Also, since you're using PostgreSQL, any reason you're not using the native PostgreSQL driver?
https://wiki.postgresql.org/wiki/Psycopg
Good luck!
Also, I've been perplexed over the ways to obtain and install the PostgreSQL driver. When I installed unixODBC, the odbcinst.ini file was created and contained an entry for the PostgreSQL driver that looked this this:
[PostgreSQL]
Description = ODBC for PostgreSQL
Driver = /usr/lib/psqlodbc.so
Setup = /usr/lib/libodbcpsqlS.so
Driver64 = /usr/lib64/psqlodbc.so
Setup64 = /usr/lib64/libodbcpsqlS.so
FileUsage = 1
However, the files for Driver and Driver64 where not on the system. So then, I installed postgresql-odbc, which gave me the missing libraries. Is there a better way to do this? As I mentioned earlier, isql works fine, so I'm still thinking it's a python issue.
I decided to try using the psycopg2 package, and I got a connection to work! Here's my script:
import sys
import psycopg2
def main():
conn_string = "host='10.191.4.97' dbname='prod' user='readonly' password='****' port='5439'"
print("===========")
print(conn_string)
print("===========")
new_conn2 = psycopg2.connect(conn_string)
print("Connected using psycopg2!")
if __name__ == '__main__':
sys.exit(main())
So, while I'm happy that I can connect, the question still remains about pyodbc and the PostgreSQL connection string. Thoughts?
Here's the connection string:
Uid=readonly;Database=prod;ServerName=rndredshift;Driver={PostgreSQL}; Server=10.191.4.97;Pwd=********;Port=5439
Using DSN instead of ServerName didn't work.

sqoop query to get sql server data into cloudera manager

sqoop import --connect 'jdbc:sqlserver://IP address;username=user;password=pswd;database=Master' --table [Person].[BusinessEntityContact] --target-dir /home/ubuntu/hdfs/dir is not working .
Reference:http://mapredit.blogspot.com/2011/10/sqoop-and-microsoft-sql-server.html [1]: http://i.stack.imgur.com/W5mBB.png
In your error logs got SQLServerException and says "Connect timed out. Verify the connection properties.". Please check whether you have access from where you try run this command and also MSSQL port "1433". Then add number of maps "-m" in your command.
Nirmale
can you try from your unix box to
curl http://131.107.174.121:1433
If you get "Empty reply from server" it is ok, or if you get an error like "couldn't connect to host", check with your SQL Server admin as to what port the SQL Server is listening on.
Best way to check is using sqoop list-tables command, something as follows:
sqoop list-tables -connect 'jdbc:sqlserver://IP address;username=user;password=pswd;database=Master' -username --password

Resources