Read Kudu from SparkR - r

In Spark I am unable to find how to connect to Kudu using SparkR. If I try the following in scala:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
// Read kudu table and select data of August 2018
val df = spark.sqlContext.read.options(Map("kudu.master" -> "198.y.x.xyz:7051","kudu.table" -> "table_name")).kudu
df.createOrReplaceTempView("mytable")
it works perfectly. In SparkR I have been trying to the following:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc = sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"), sparkPackages = "org.apache.kudu:kudu-spark2_2.11:1.8.0")
sqlContext <- sparkRSQL.init(sc)
df = read.jdbc(url="198.y.x.xyz:7051",
driver = "jdbc:kudu:sparkdb",
source="jdbc",
tableName = "table_name"
)
I get the following error:
Error in jdbc : java.lang.ClassNotFoundException: jdbc:kudu:sparkdb
Trying the following:
df = read.jdbc(url="jdbc:mysql://198.19.10.103:7051",
tableName = "banglalink_data_table_1"
)
gives:
Error: Error in jdbc : java.sql.SQLException: No suitable driver
I cannot find any help on how to load the correct driver. I think that using the sparkPackages option is correct as it gives no error. What am I doing wrong??

Related

How can I reduce the error of SQL query in R?

Here is my function
getSQL <- function(server="server name", database="database name", Uid="
user name", Pwd="password", Query){
conlink <- paste('driver={SQL Server};server=', server,';database=',database,';Uid=', Uid,
';Pwd=', Pwd,';Encrypt=True;TrustServerCertificate=False', sep="")
conn <- odbcDriverConnect(conlink)
dat <- sqlQuery(channel= conn, Query, stringsAsFactors = F)
odbcCloseAll()
return(dat)
}
When I call the function using
query.cut = "SELECT [measurename]
,[OrgType]
,[year_session]
,[Star]
,[cutvalue]
,[Date]
,[File]
FROM [database name].[dbo].[DST_Merged_Cutpoint]
ORDER BY [year_session] DESC
"
getSQL(Query=query.cut)
I get this error:
Error in sqlQuery(conn, Query, stringsAsFactors = F) :
first argument is not an open RODBC channel
In addition: Warning messages:
1: In odbcDriverConnect(conlink) :
[RODBC] ERROR: state 28000, code 18456, message [Microsoft][ODBC SQL Server Driver][SQL Server]Login failed for user ' insightm8'.
2: In odbcDriverConnect(conlink) :
[RODBC] ERROR: state 01S00, code 0, message [Microsoft][ODBC SQL Server Driver]Invalid connection string attribute
3: In odbcDriverConnect(conlink) :
Error in sqlQuery(conn, Query, stringsAsFactors = F) :
first argument is not an open RODBC channel
How can I fix these errors? Thanks in advance
Take care not to add spaces to UID:
Server]Login failed for user ' insightm8'.
Reproducing this on an SQL Server connection creates the same error.
Try using paste0 instead of paste :
conlink <- paste0('driver={SQL Server};server=', server,';database=',database,';Uid=', Uid,
';Pwd=', Pwd,';Encrypt=True;TrustServerCertificate=False', sep="")

Airflow connect to sql server select results to a data frame

Airflow-pandas-read-sql-query to dataframe
i am trying to connect to SQL server local to get data from a table and process the data using pandas operations but i m failing to figure out how to pass the select query results to a data frame
the below works to clear data in the table
``` sql_command = """ DELETE FROM [TestDB].[dbo].[PythonTestData] """
t3 = MsSqlOperator( task_id = 'run_test_proc',
mssql_conn_id = 'mssql_local',
sql = sql_command,
dag = dag,
database = 'TestDB',
autocommit = True) ```
the intended pandas is
query = 'SELECT * FROM [ClientData] '#where product_name='''+i+''''''
df = pd.read_sql(query, conn)
pn_list = df['ClientID'].tolist()
#print("The original pn_list is : " + str(pn_list))
for i in pn_list:
varw= str(i)
queryw = 'SELECT * FROM [ClientData] where [ClientID]='''+varw+''
dfw = pd.read_sql(queryw, conn)
dfw = dfw.applymap(str)
cols=['product_id','product_name','brand_id']
x=dfw.values.tolist()
x=x[0]
ClientID=x[0]
Name=x[1]
Org=x[2]
Email=x[3]
#print('Name :'+Name+' ,'+'Org :'+Org+' ,'+'Email :'+Email+' ,'+'ClientID :'+ClientID)
salesData_qry= 'SELECT * FROM [TestDB].[dbo].[SalesData] where [ClientID]='''+ClientID+''
salesData_df= pd.read_sql(salesData_qry, conn)
salesData_df['year1'] = salesData_df['Order Date'].dt.strftime('%Y')
salesData_df['OrderMonth'] = salesData_df['Order Date'].dt.strftime('%b')
filename ='Daily_Campaign_Report_'+Name+'_'+Org+'_'+datetime.now().strftime("%Y%m%d_%H%M%S")
p = Path('C:/Users/user/Documents/WorkingData/')
salesData_df.to_csv(Path(p, filename + '.csv'))```
Please point me to correct approach as i m new to airflow
I'm not so clear on how you generate the query code but in order to get dataframe from MsSQL you need to use MsSqlHook:
from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
def mssql_func(**kwargs):
hook = MsSqlHook(conn_id='mssql_local')
df = hook.get_pandas_df(sql="YOUR_QUERY")
#do whatever you need on the df
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
this is the code i am using for the dag
def mssql_func(**kwargs):
conn = MsSqlHook.get_connection(conn_id="mssql_local")
hook = conn.get_hook()
df = hook.get_pandas_df(sql="SELECT * FROM [TestDB].[dbo].[ClientData]")
#do whatever you need on the df
print(df)
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
Error Log
[2021-01-12 16:07:15,114] {providers_manager.py:159} WARNING - The provider for package 'apache-airflow-providers-imap' could not be registered from because providers for that package name have already been registered
[2021-01-12 16:07:15,618] {base.py:65} INFO - Using connection to: id: mssql_local. Host: localhost, Port: 1433, Schema: dbo, Login: sa, Password: XXXXXXXX, extra: None
[2021-01-12 16:07:15,626] {taskinstance.py:1396} ERROR - (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")
Traceback (most recent call last):
File "src/pymssql.pyx", line 636, in pymssql.connect
File "src/_mssql.pyx", line 1964, in _mssql.connect
File "src/_mssql.pyx", line 682, in _mssql.MSSQLConnection.__init__
File "src/_mssql.pyx", line 1690, in _mssql.maybe_raise_MSSQLDatabaseException
_mssql.MSSQLDatabaseException: (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")

Can't get data with dbplyr from shiny-server

I'm trying to get data from AWS SQL Server.
This code works fine from local PC, but it didn't work from shiny-server (ubuntu).
library(dbplyr)
library(dplyr)
library(DBI)
con <- dbConnect(odbc::odbc(),
driver = "FreeTDS",
server = "aws server",
database = "",
uid = "",
pwd = "")
tbl(con, "shops")
dbGetQuery(con,"SELECT *
FROM shops")
"R version 3.4.2 (2017-09-28)"
packageVersion("dbplyr")
[1] ‘1.2.1.9000’
packageVersion("dplyr")
[1] ‘0.7.4’
packageVersion("DBI")
[1] ‘0.7.15’
I have next error:
tbl(con, "shops")
Error: <SQL> 'SELECT *
FROM "shops" AS "zzz2"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42000: [FreeTDS][SQL Server]Incorrect syntax near 'shops'.
But dbGetQuery(con,"SELECT * FROM shops") works fine.
Can you explain what's going wrong?
This is more likely because the FreeTDS driver does not return the class that dbplyr expects to see in order to use the MS SQL translation. The workaround is to take the result of class(con) and then add the following lines right after you connect, but before calling tbl(). Replace the [you class name] with the results of the class(con) call:
sql_translate_env.[your class name] <- dbplyr:::`sql_translate_env.Microsoft SQL Server`
sql_select.[your class name]<- dbplyr:::`sql_select.Microsoft SQL Server`

Connecting to HIve JDBC through R Class Error

I am trying to connect to hive using R JDBC library. My code looks like this:
library('DBI')
library('rJava')
library('RJDBC')
hadoop.class.path = list.files(path=c('/usr/hdp/hadoop/'), pattern='jar', full.names=T);
hadoop.lib.path = list.files(path=c('/usr/hdp/hadoop/lib/'), pattern='jar', full.names=T);
hive.class.path = list.files(path=c('/usr/hdp/hive/lib/'), pattern='jar', full.names=T);
mapred.class.path = list.files(path=c('/usr/hdp/hadoop-mapreduce'), pattern='jar', full.names=T);
cp = c(hadoop.class.path, hadoop.lib.path, hive.class.path, mapred.class.path, '/usr/hdp/hadoop-mapreduce/hadoop-mapreduce-client-core.jar')
.jinit(classpath=cp)
drv <- JDBC('org.apache.hive.jdbc.HiveDriver', '/usr/hdp/hive/lib/hive-jdbc.jar')
con <- dbConnect(drv, 'jdbc:hive2://my.cluster.net:10000/default;principal=hive/my.cluster.net#domain.com', 'hive', 'hive')
But when I run, I get the following error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.SecurityUtil
However, I checked my /usr/hdp/hadoop/hadoop-commons.jar and found that the class org.apache.hadoop.security.SecurityUtil is there. So what else could be causing this error?

RODBC connectivity to Oracle without tnsnames.ora

I am trying to connect to Oracle from R using RODBC without using tnsnanes.ora.
I have tried following strings, but none of them are working.
> con.text <- paste0("Driver={OracleODBC-11g};Dbq=//oracle.server:1527/database.pdw.prod;Uid=user;Pwd=pswd;")
> con.text <- paste0("Driver={OracleODBC-11g}; ",
"CONNECTSTRING=(DESCRIPTION=(ADDRESS= (PROTOCOL = TCP)(HOST = oracle.server)(PORT = 1527))(CONNECT_DATA=(SERVICE_NAME = database.pdw.prod))); uid=user;pwd=pswd;")
> con.text <- paste0("Driver=", "OracleODBC-11g"
, ";Server=", "oracle.server"
, ";Database=", "database.pdw.prod"
, ";Uid=", "user"
, ";Pwd=", "pwd", ";")
> con.text <- paste0("Driver=", "OracleODBC-11g"
, ";Server=", "oracle.server"
, ";CONNECTSTRING=" , "(DESCRIPTION=(ADDRESS= (PROTOCOL = TCP)(HOST = oracle.server)(PORT = 1527))(CONNECT_DATA=(SERVICE_NAME = database.pdw.prod)))"
, ";Database=", "database.pdw.prod"
, ";Uid=", "user"
, ";Pwd=", "pswd", ";")
> con1 <- odbcDriverConnect(connection = con.text)
But for all these strings I am getting following error:
Warning messages:
1: In odbcDriverConnect(connection = con.text) :
[RODBC] ERROR: state HY000, code 12162, message [unixODBC][Oracle][ODBC][Ora]ORA-12162: TNS:net service name is incorrectly specified
2: In odbcDriverConnect(connection = con.text) : ODBC connection failed
OR
1: In odbcDriverConnect(connection = con.text) :
[RODBC] ERROR: state IM002, code 0, message [unixODBC][Driver Manager]Data source name not found, and no default driver specified
The correct sysntaxis you are looking for is
Conex <- odbcDriverConnect("DRIVER=Oracle en OraClient11g_home2;UID=USERNAME;PWD=PASSWORD;DBQ=//HOSTNAME:PORT/ORACLE_SID;",
believeNRows = FALSE)
Ex
Conex <- odbcDriverConnect("DRIVER=Oracle en OraClient11g_home2;UID=John;PWD=Deere;DBQ=//fcoracleserver.youdomain:1521/TestEnvironment;",
believeNRows = FALSE)
The hard part is to find the name of the Driver, as you can see mine is on spanish.
What I did is I create first a ODBC Conection using the C:\Windows\System32\odbcad32.exe, there you can check the right name of your Oracle or SQL Server driver.
Once you create the conection, you can use
odbcDataSources() on R, to see that conection and to find out the driver. Thats really the hard part.
Hope it helps !

Resources