Airflow-pandas-read-sql-query to dataframe
i am trying to connect to SQL server local to get data from a table and process the data using pandas operations but i m failing to figure out how to pass the select query results to a data frame
the below works to clear data in the table
``` sql_command = """ DELETE FROM [TestDB].[dbo].[PythonTestData] """
t3 = MsSqlOperator( task_id = 'run_test_proc',
mssql_conn_id = 'mssql_local',
sql = sql_command,
dag = dag,
database = 'TestDB',
autocommit = True) ```
the intended pandas is
query = 'SELECT * FROM [ClientData] '#where product_name='''+i+''''''
df = pd.read_sql(query, conn)
pn_list = df['ClientID'].tolist()
#print("The original pn_list is : " + str(pn_list))
for i in pn_list:
varw= str(i)
queryw = 'SELECT * FROM [ClientData] where [ClientID]='''+varw+''
dfw = pd.read_sql(queryw, conn)
dfw = dfw.applymap(str)
cols=['product_id','product_name','brand_id']
x=dfw.values.tolist()
x=x[0]
ClientID=x[0]
Name=x[1]
Org=x[2]
Email=x[3]
#print('Name :'+Name+' ,'+'Org :'+Org+' ,'+'Email :'+Email+' ,'+'ClientID :'+ClientID)
salesData_qry= 'SELECT * FROM [TestDB].[dbo].[SalesData] where [ClientID]='''+ClientID+''
salesData_df= pd.read_sql(salesData_qry, conn)
salesData_df['year1'] = salesData_df['Order Date'].dt.strftime('%Y')
salesData_df['OrderMonth'] = salesData_df['Order Date'].dt.strftime('%b')
filename ='Daily_Campaign_Report_'+Name+'_'+Org+'_'+datetime.now().strftime("%Y%m%d_%H%M%S")
p = Path('C:/Users/user/Documents/WorkingData/')
salesData_df.to_csv(Path(p, filename + '.csv'))```
Please point me to correct approach as i m new to airflow
I'm not so clear on how you generate the query code but in order to get dataframe from MsSQL you need to use MsSqlHook:
from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
def mssql_func(**kwargs):
hook = MsSqlHook(conn_id='mssql_local')
df = hook.get_pandas_df(sql="YOUR_QUERY")
#do whatever you need on the df
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
this is the code i am using for the dag
def mssql_func(**kwargs):
conn = MsSqlHook.get_connection(conn_id="mssql_local")
hook = conn.get_hook()
df = hook.get_pandas_df(sql="SELECT * FROM [TestDB].[dbo].[ClientData]")
#do whatever you need on the df
print(df)
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
Error Log
[2021-01-12 16:07:15,114] {providers_manager.py:159} WARNING - The provider for package 'apache-airflow-providers-imap' could not be registered from because providers for that package name have already been registered
[2021-01-12 16:07:15,618] {base.py:65} INFO - Using connection to: id: mssql_local. Host: localhost, Port: 1433, Schema: dbo, Login: sa, Password: XXXXXXXX, extra: None
[2021-01-12 16:07:15,626] {taskinstance.py:1396} ERROR - (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")
Traceback (most recent call last):
File "src/pymssql.pyx", line 636, in pymssql.connect
File "src/_mssql.pyx", line 1964, in _mssql.connect
File "src/_mssql.pyx", line 682, in _mssql.MSSQLConnection.__init__
File "src/_mssql.pyx", line 1690, in _mssql.maybe_raise_MSSQLDatabaseException
_mssql.MSSQLDatabaseException: (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")
Related
Here is my function
getSQL <- function(server="server name", database="database name", Uid="
user name", Pwd="password", Query){
conlink <- paste('driver={SQL Server};server=', server,';database=',database,';Uid=', Uid,
';Pwd=', Pwd,';Encrypt=True;TrustServerCertificate=False', sep="")
conn <- odbcDriverConnect(conlink)
dat <- sqlQuery(channel= conn, Query, stringsAsFactors = F)
odbcCloseAll()
return(dat)
}
When I call the function using
query.cut = "SELECT [measurename]
,[OrgType]
,[year_session]
,[Star]
,[cutvalue]
,[Date]
,[File]
FROM [database name].[dbo].[DST_Merged_Cutpoint]
ORDER BY [year_session] DESC
"
getSQL(Query=query.cut)
I get this error:
Error in sqlQuery(conn, Query, stringsAsFactors = F) :
first argument is not an open RODBC channel
In addition: Warning messages:
1: In odbcDriverConnect(conlink) :
[RODBC] ERROR: state 28000, code 18456, message [Microsoft][ODBC SQL Server Driver][SQL Server]Login failed for user ' insightm8'.
2: In odbcDriverConnect(conlink) :
[RODBC] ERROR: state 01S00, code 0, message [Microsoft][ODBC SQL Server Driver]Invalid connection string attribute
3: In odbcDriverConnect(conlink) :
Error in sqlQuery(conn, Query, stringsAsFactors = F) :
first argument is not an open RODBC channel
How can I fix these errors? Thanks in advance
Take care not to add spaces to UID:
Server]Login failed for user ' insightm8'.
Reproducing this on an SQL Server connection creates the same error.
Try using paste0 instead of paste :
conlink <- paste0('driver={SQL Server};server=', server,';database=',database,';Uid=', Uid,
';Pwd=', Pwd,';Encrypt=True;TrustServerCertificate=False', sep="")
In Spark I am unable to find how to connect to Kudu using SparkR. If I try the following in scala:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
// Read kudu table and select data of August 2018
val df = spark.sqlContext.read.options(Map("kudu.master" -> "198.y.x.xyz:7051","kudu.table" -> "table_name")).kudu
df.createOrReplaceTempView("mytable")
it works perfectly. In SparkR I have been trying to the following:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc = sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"), sparkPackages = "org.apache.kudu:kudu-spark2_2.11:1.8.0")
sqlContext <- sparkRSQL.init(sc)
df = read.jdbc(url="198.y.x.xyz:7051",
driver = "jdbc:kudu:sparkdb",
source="jdbc",
tableName = "table_name"
)
I get the following error:
Error in jdbc : java.lang.ClassNotFoundException: jdbc:kudu:sparkdb
Trying the following:
df = read.jdbc(url="jdbc:mysql://198.19.10.103:7051",
tableName = "banglalink_data_table_1"
)
gives:
Error: Error in jdbc : java.sql.SQLException: No suitable driver
I cannot find any help on how to load the correct driver. I think that using the sparkPackages option is correct as it gives no error. What am I doing wrong??
I've created a model in R, published into SQL Server table and validated the model by calling it into SQL Server.
However, I'm failing in an attempt to use the model for prediction over new data.
Here's my script:
DROP PROCEDURE IF EXISTS predict_risk_new_data;
GO
CREATE OR ALTER PROCEDURE predict_risk_new_data (#q nvarchar(max))
AS
BEGIN
DECLARE #model varchar(30) = 'risk_rxLogit'
DECLARE #rx_model varbinary(max) = (SELECT model FROM rx_models WHERE model_name = #model);
EXEC sp_execute_external_script
#language = N'R'
,#script = N'
require("RevoScaleR");
input_data = InputDataSet;
model <- unserialize(rx_model);
prediction <- rxPredict(model, input_data, writeModelVars = TRUE);
OutputDataSet <- cbind(predictions[1], predictions[2]);'
,#input_data_1 = #q
,#parallel = 1
,#params = N'#rx_model varbinary(max), #r_rowsPerRead int'
,#input_data_1_name = N'InputDataSet'
,#rx_model = #rx_model
,#r_rowsPerRead = 100
WITH result sets (("Risk_Pred" float, "ZIP" int));
END
GO;
/*
EXEC predict_risk 'SELECT TOP 100 [ZIP], [Week], [Age], [Risk] FROM dbo.Risk'
*/
Here's the error output:
Msg 39004, Level 16, State 20, Line 223 A 'R' script error occurred
during execution of 'sp_execute_external_script' with HRESULT
0x80004004. Msg 39019, Level 16, State 2, Line 223 An external script
error occurred: Error in unserialize(rx_model) : read error Calls:
source -> withVisible -> eval -> eval -> unserialize
Error in execution. Check the output for more information. Error in
eval(expr, envir, enclos) : Error in execution. Check the output
for more information. Calls: source -> withVisible -> eval -> eval ->
.Call Execution halted
New to R/ML in SQL Server, help would be aprreciated.
Thanks in advance.
When I did something like this I had to add as.raw to the model.
Try this
model <- unserialize(as.raw(rx_model));
I'm trying to get data from AWS SQL Server.
This code works fine from local PC, but it didn't work from shiny-server (ubuntu).
library(dbplyr)
library(dplyr)
library(DBI)
con <- dbConnect(odbc::odbc(),
driver = "FreeTDS",
server = "aws server",
database = "",
uid = "",
pwd = "")
tbl(con, "shops")
dbGetQuery(con,"SELECT *
FROM shops")
"R version 3.4.2 (2017-09-28)"
packageVersion("dbplyr")
[1] ‘1.2.1.9000’
packageVersion("dplyr")
[1] ‘0.7.4’
packageVersion("DBI")
[1] ‘0.7.15’
I have next error:
tbl(con, "shops")
Error: <SQL> 'SELECT *
FROM "shops" AS "zzz2"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42000: [FreeTDS][SQL Server]Incorrect syntax near 'shops'.
But dbGetQuery(con,"SELECT * FROM shops") works fine.
Can you explain what's going wrong?
This is more likely because the FreeTDS driver does not return the class that dbplyr expects to see in order to use the MS SQL translation. The workaround is to take the result of class(con) and then add the following lines right after you connect, but before calling tbl(). Replace the [you class name] with the results of the class(con) call:
sql_translate_env.[your class name] <- dbplyr:::`sql_translate_env.Microsoft SQL Server`
sql_select.[your class name]<- dbplyr:::`sql_select.Microsoft SQL Server`
I am trying to connect to hive using R JDBC library. My code looks like this:
library('DBI')
library('rJava')
library('RJDBC')
hadoop.class.path = list.files(path=c('/usr/hdp/hadoop/'), pattern='jar', full.names=T);
hadoop.lib.path = list.files(path=c('/usr/hdp/hadoop/lib/'), pattern='jar', full.names=T);
hive.class.path = list.files(path=c('/usr/hdp/hive/lib/'), pattern='jar', full.names=T);
mapred.class.path = list.files(path=c('/usr/hdp/hadoop-mapreduce'), pattern='jar', full.names=T);
cp = c(hadoop.class.path, hadoop.lib.path, hive.class.path, mapred.class.path, '/usr/hdp/hadoop-mapreduce/hadoop-mapreduce-client-core.jar')
.jinit(classpath=cp)
drv <- JDBC('org.apache.hive.jdbc.HiveDriver', '/usr/hdp/hive/lib/hive-jdbc.jar')
con <- dbConnect(drv, 'jdbc:hive2://my.cluster.net:10000/default;principal=hive/my.cluster.net#domain.com', 'hive', 'hive')
But when I run, I get the following error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.SecurityUtil
However, I checked my /usr/hdp/hadoop/hadoop-commons.jar and found that the class org.apache.hadoop.security.SecurityUtil is there. So what else could be causing this error?