Connecting to HIve JDBC through R Class Error - r

I am trying to connect to hive using R JDBC library. My code looks like this:
library('DBI')
library('rJava')
library('RJDBC')
hadoop.class.path = list.files(path=c('/usr/hdp/hadoop/'), pattern='jar', full.names=T);
hadoop.lib.path = list.files(path=c('/usr/hdp/hadoop/lib/'), pattern='jar', full.names=T);
hive.class.path = list.files(path=c('/usr/hdp/hive/lib/'), pattern='jar', full.names=T);
mapred.class.path = list.files(path=c('/usr/hdp/hadoop-mapreduce'), pattern='jar', full.names=T);
cp = c(hadoop.class.path, hadoop.lib.path, hive.class.path, mapred.class.path, '/usr/hdp/hadoop-mapreduce/hadoop-mapreduce-client-core.jar')
.jinit(classpath=cp)
drv <- JDBC('org.apache.hive.jdbc.HiveDriver', '/usr/hdp/hive/lib/hive-jdbc.jar')
con <- dbConnect(drv, 'jdbc:hive2://my.cluster.net:10000/default;principal=hive/my.cluster.net#domain.com', 'hive', 'hive')
But when I run, I get the following error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.SecurityUtil
However, I checked my /usr/hdp/hadoop/hadoop-commons.jar and found that the class org.apache.hadoop.security.SecurityUtil is there. So what else could be causing this error?

Related

Airflow connect to sql server select results to a data frame

Airflow-pandas-read-sql-query to dataframe
i am trying to connect to SQL server local to get data from a table and process the data using pandas operations but i m failing to figure out how to pass the select query results to a data frame
the below works to clear data in the table
``` sql_command = """ DELETE FROM [TestDB].[dbo].[PythonTestData] """
t3 = MsSqlOperator( task_id = 'run_test_proc',
mssql_conn_id = 'mssql_local',
sql = sql_command,
dag = dag,
database = 'TestDB',
autocommit = True) ```
the intended pandas is
query = 'SELECT * FROM [ClientData] '#where product_name='''+i+''''''
df = pd.read_sql(query, conn)
pn_list = df['ClientID'].tolist()
#print("The original pn_list is : " + str(pn_list))
for i in pn_list:
varw= str(i)
queryw = 'SELECT * FROM [ClientData] where [ClientID]='''+varw+''
dfw = pd.read_sql(queryw, conn)
dfw = dfw.applymap(str)
cols=['product_id','product_name','brand_id']
x=dfw.values.tolist()
x=x[0]
ClientID=x[0]
Name=x[1]
Org=x[2]
Email=x[3]
#print('Name :'+Name+' ,'+'Org :'+Org+' ,'+'Email :'+Email+' ,'+'ClientID :'+ClientID)
salesData_qry= 'SELECT * FROM [TestDB].[dbo].[SalesData] where [ClientID]='''+ClientID+''
salesData_df= pd.read_sql(salesData_qry, conn)
salesData_df['year1'] = salesData_df['Order Date'].dt.strftime('%Y')
salesData_df['OrderMonth'] = salesData_df['Order Date'].dt.strftime('%b')
filename ='Daily_Campaign_Report_'+Name+'_'+Org+'_'+datetime.now().strftime("%Y%m%d_%H%M%S")
p = Path('C:/Users/user/Documents/WorkingData/')
salesData_df.to_csv(Path(p, filename + '.csv'))```
Please point me to correct approach as i m new to airflow
I'm not so clear on how you generate the query code but in order to get dataframe from MsSQL you need to use MsSqlHook:
from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
def mssql_func(**kwargs):
hook = MsSqlHook(conn_id='mssql_local')
df = hook.get_pandas_df(sql="YOUR_QUERY")
#do whatever you need on the df
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
this is the code i am using for the dag
def mssql_func(**kwargs):
conn = MsSqlHook.get_connection(conn_id="mssql_local")
hook = conn.get_hook()
df = hook.get_pandas_df(sql="SELECT * FROM [TestDB].[dbo].[ClientData]")
#do whatever you need on the df
print(df)
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
Error Log
[2021-01-12 16:07:15,114] {providers_manager.py:159} WARNING - The provider for package 'apache-airflow-providers-imap' could not be registered from because providers for that package name have already been registered
[2021-01-12 16:07:15,618] {base.py:65} INFO - Using connection to: id: mssql_local. Host: localhost, Port: 1433, Schema: dbo, Login: sa, Password: XXXXXXXX, extra: None
[2021-01-12 16:07:15,626] {taskinstance.py:1396} ERROR - (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")
Traceback (most recent call last):
File "src/pymssql.pyx", line 636, in pymssql.connect
File "src/_mssql.pyx", line 1964, in _mssql.connect
File "src/_mssql.pyx", line 682, in _mssql.MSSQLConnection.__init__
File "src/_mssql.pyx", line 1690, in _mssql.maybe_raise_MSSQLDatabaseException
_mssql.MSSQLDatabaseException: (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")

Read Kudu from SparkR

In Spark I am unable to find how to connect to Kudu using SparkR. If I try the following in scala:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
// Read kudu table and select data of August 2018
val df = spark.sqlContext.read.options(Map("kudu.master" -> "198.y.x.xyz:7051","kudu.table" -> "table_name")).kudu
df.createOrReplaceTempView("mytable")
it works perfectly. In SparkR I have been trying to the following:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc = sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"), sparkPackages = "org.apache.kudu:kudu-spark2_2.11:1.8.0")
sqlContext <- sparkRSQL.init(sc)
df = read.jdbc(url="198.y.x.xyz:7051",
driver = "jdbc:kudu:sparkdb",
source="jdbc",
tableName = "table_name"
)
I get the following error:
Error in jdbc : java.lang.ClassNotFoundException: jdbc:kudu:sparkdb
Trying the following:
df = read.jdbc(url="jdbc:mysql://198.19.10.103:7051",
tableName = "banglalink_data_table_1"
)
gives:
Error: Error in jdbc : java.sql.SQLException: No suitable driver
I cannot find any help on how to load the correct driver. I think that using the sparkPackages option is correct as it gives no error. What am I doing wrong??

Can't get data with dbplyr from shiny-server

I'm trying to get data from AWS SQL Server.
This code works fine from local PC, but it didn't work from shiny-server (ubuntu).
library(dbplyr)
library(dplyr)
library(DBI)
con <- dbConnect(odbc::odbc(),
driver = "FreeTDS",
server = "aws server",
database = "",
uid = "",
pwd = "")
tbl(con, "shops")
dbGetQuery(con,"SELECT *
FROM shops")
"R version 3.4.2 (2017-09-28)"
packageVersion("dbplyr")
[1] ‘1.2.1.9000’
packageVersion("dplyr")
[1] ‘0.7.4’
packageVersion("DBI")
[1] ‘0.7.15’
I have next error:
tbl(con, "shops")
Error: <SQL> 'SELECT *
FROM "shops" AS "zzz2"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42000: [FreeTDS][SQL Server]Incorrect syntax near 'shops'.
But dbGetQuery(con,"SELECT * FROM shops") works fine.
Can you explain what's going wrong?
This is more likely because the FreeTDS driver does not return the class that dbplyr expects to see in order to use the MS SQL translation. The workaround is to take the result of class(con) and then add the following lines right after you connect, but before calling tbl(). Replace the [you class name] with the results of the class(con) call:
sql_translate_env.[your class name] <- dbplyr:::`sql_translate_env.Microsoft SQL Server`
sql_select.[your class name]<- dbplyr:::`sql_select.Microsoft SQL Server`

RODBC connectivity to Oracle without tnsnames.ora

I am trying to connect to Oracle from R using RODBC without using tnsnanes.ora.
I have tried following strings, but none of them are working.
> con.text <- paste0("Driver={OracleODBC-11g};Dbq=//oracle.server:1527/database.pdw.prod;Uid=user;Pwd=pswd;")
> con.text <- paste0("Driver={OracleODBC-11g}; ",
"CONNECTSTRING=(DESCRIPTION=(ADDRESS= (PROTOCOL = TCP)(HOST = oracle.server)(PORT = 1527))(CONNECT_DATA=(SERVICE_NAME = database.pdw.prod))); uid=user;pwd=pswd;")
> con.text <- paste0("Driver=", "OracleODBC-11g"
, ";Server=", "oracle.server"
, ";Database=", "database.pdw.prod"
, ";Uid=", "user"
, ";Pwd=", "pwd", ";")
> con.text <- paste0("Driver=", "OracleODBC-11g"
, ";Server=", "oracle.server"
, ";CONNECTSTRING=" , "(DESCRIPTION=(ADDRESS= (PROTOCOL = TCP)(HOST = oracle.server)(PORT = 1527))(CONNECT_DATA=(SERVICE_NAME = database.pdw.prod)))"
, ";Database=", "database.pdw.prod"
, ";Uid=", "user"
, ";Pwd=", "pswd", ";")
> con1 <- odbcDriverConnect(connection = con.text)
But for all these strings I am getting following error:
Warning messages:
1: In odbcDriverConnect(connection = con.text) :
[RODBC] ERROR: state HY000, code 12162, message [unixODBC][Oracle][ODBC][Ora]ORA-12162: TNS:net service name is incorrectly specified
2: In odbcDriverConnect(connection = con.text) : ODBC connection failed
OR
1: In odbcDriverConnect(connection = con.text) :
[RODBC] ERROR: state IM002, code 0, message [unixODBC][Driver Manager]Data source name not found, and no default driver specified
The correct sysntaxis you are looking for is
Conex <- odbcDriverConnect("DRIVER=Oracle en OraClient11g_home2;UID=USERNAME;PWD=PASSWORD;DBQ=//HOSTNAME:PORT/ORACLE_SID;",
believeNRows = FALSE)
Ex
Conex <- odbcDriverConnect("DRIVER=Oracle en OraClient11g_home2;UID=John;PWD=Deere;DBQ=//fcoracleserver.youdomain:1521/TestEnvironment;",
believeNRows = FALSE)
The hard part is to find the name of the Driver, as you can see mine is on spanish.
What I did is I create first a ODBC Conection using the C:\Windows\System32\odbcad32.exe, there you can check the right name of your Oracle or SQL Server driver.
Once you create the conection, you can use
odbcDataSources() on R, to see that conection and to find out the driver. Thats really the hard part.
Hope it helps !

I cannot connect postgresql schema.table with dplyr package

Im trying to connect postgres with dplyr functions
my_db <- src_postgres(dbname = 'mdb1252', user = "diego", password = "pass")
my_db
src: postgres 9.2.5 [postgres#localhost:5432/mdb1252]
tbls: alf, alturas, asociad, atenmed, base, bfa_boys_p_exp, bfa_boys_z_exp,
bfa_girls_p_exp, bfa_girls_z_exp, bres, c21200012012, c212000392011, c212000532011,
c21200062012, c212006222012, c212007352012, c212012112013, c212012242012,
c212012452012, c2222012242012, calles, cap, cap0110, casos_tbc_tr09, casos_tbctr09,
casosvadela, catpo, cbcvl, cie09, cie10, cie103d, cie103dantigua, cie10c, cie9a,
cie9mc, clasiarc, coalc, coddepto, codedades, codest, codlocaerbio, codprov, coheb,
cohec, cohep, cohiv, coho09_20110909_m, coign, combl, comet, comp, comport, conev,
conymad, copri, corci3cod, corci910, cores, corin, cotab, cutoi, cutto, def0307,......
but when I try to connect a tbl
my_tbl <- tbl(my_db, 'def0307')
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: no existe la relación «def0307»
LINE 1: SELECT * FROM "def0307" WHERE 0=1;
^
)
I think the problem is a schema issue because sql should be:
SELECT * FROM mortalidad.def0307
I made my_tbl <- tbl(my_db, 'mortalidad.def0307');
my_tbl <- tbl(my_db, c('mortalidad','def0307')) without a solution.
Im having a lot of fun working with dplyr Im from SQL but I wish resolve that and trying dplyr skills.
Thanks in advance.
Finally dplyr has the solution to this problem thanks to the latest version 0.7 recently announced by Hadley Wickham. The DBI and dbplyr libraries greatly simplified the connection between dplyr and PostgreSQL.
con <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "database.rstudio.com",
user = "hadley",
password = rstudioapi::askForPassword("Database password"))
tbl <- dplyr::tbl(con, dbplyr::in_schema('mortalidad','def0307'))
You might want this,
db=src_postgres(dbname = 'mdb1252',
user = "diego", password = "pass", options="-c search_path=mortalidad")
If anybody ends up here with the same problem, here is what works for me: (taken from #Diego's comment from Feb 6'14)
postgre_table <- function (src, schema, table) {
paste('SELECT * FROM', paste(schema, table, sep = '.')) %>%
sql() %>% tbl(src = src)
}

Resources