RImpala: Query Failed When Larger Data - r

check1<-rimpala.query("select * from sum2")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.sql.SQLException: Method not supported
dim(sum2) is 49501 rows and 18 columns.
check1<-rimpala.query("select *from sum3")
dim(sum3) is 102 rows and 6 columns.
It worked with smaller sample size.
sorry that I cant reproduce example to this. Is anyone encounter the same problem with larger data size? Any idea to solve this? Thanks.

As noted elsewhere on StackOverflow, RImpala does not implement executeUpdate and so cannot run any query that modifies state. I suspect you hit your error not by running a larger SELECT query but rather because you tried to insert, update, or delete some data.
If you'd like to use Impala from R, I'd recommend using dplyrimpaladb.

RImpala (v0.1.6) build is updated with the support to execute DDL queries using executeUpdate.
The latest build contains the following fixes / additions:
Support for DDL query execution.
fetchSize parameter in query function to state the number of records that can be retrieved in one round trip read from Impala.
Fix for query failing when NULL values are being returned.
Compatiblity with CDH 5.x.x
You can run DDL queries using the query function as illustrated below:
rimpala.query(Q="drop table sample_table",isDDL="true")
You can also specify the fetchSize in the query function to aid reading large data efficiently.
rimpala.query(Q="select * from sample_table",fetchSize="10000")
Please find the latest build in Cran : http://cran.r-project.org/web/packages/RImpala/index.html
Source Code : https://github.com/Mu-Sigma/RImpala

I have the same problem with the RImpala package and recommend to use the RJDBC package:
library(RJDBC)
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
classPath = list.files("path_to_jars",pattern="jar$",full.names=T),
identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:21050/;auth=noSasl")
check1 <- dbGetQuery(conn, "select *from sum3")
I used these jar files an evenything works as expected:
https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip
For more information and a speed comparison look at this blog post:
http://datascience.la/r-and-impala-its-better-to-kiss-than-using-java/

Related

rquery: Connect to specific schema in Postgres DB

The rquery package has been out for some time now, but the documentation is still very sparse. There isn't even a tag yet in SO, this question will create it.
Maybe there is someone who can help me nevertheless.
I want to connect to a schema in my Postgres-DB via rqueryto read the data into R with all the speed it promises.
Using this code it works with all the tables in the public-schema.
library(RPostgres)
library(rquery)
con <- dbConnect(RPostgres::Postgres(),
host = #####,
dbname = #####,
user = #####,
password = ######)
df <- db_td(con, "tablename") %.>%
execute(con, .)
Now when I want to access a table in a specific schema db_td() has the argument qualifiers = which is an
optional named ordered vector of strings carrying
additional db hierarchy terms,such as schema
So I did:
db_td(db, "tablename", qualifiers = c(schema = "schema"))
But:
Error in result_create(conn#ptr, statement) : Failed to prepare
query: FEHLER: Relation »tablename« existiert nicht LINE 1: SELECT
* FROM "tablename" LIMIT 1
So the qualifiers = argument seems to be completely ignored.
My question is thus pretty basic:
How can I connect to a schema in a PostgresDB via rquery?
all my attempts to solve this "within" rquery seem to fail miserably, but you can work around it by doing something like:
dbExecute(con, "SET search_path = foo_schema, public;")
before you run db_td.
I think it's caused by rq_colnames doing:
paste0("SELECT * FROM ", quote_identifier(db, table_name),
" LIMIT 1")
and hence not doing anything with its qualifiers, at least this matches the error I get back.
maybe report a bug/issue with rquery if this isn't enough
I have created an issue on github. So far regular rquery indeed doesn't have schema ability. The development version of rquery (1.3.4) however has, as of today, basic schema ability.
To be installed via:
library(devtools)
install_github("WinVector/rquery", host = "https://api.github.com")
Here's a small instruction. Seems to have been inteded to work just as I was trying in my question.
Be careful though, rquery hasn't been fully tested in schema-mode and some things might not work.
EDIT: rquery now has full schema support.

Avoiding warning message “There is a result object still in use” when using dbSendQuery to create table on database

Background:
I use dbplyr and dplyr to extract data from a database, then I use the command dbSendQuery() to build my table.
Issue:
After the table is built, if I run another command I get the following warning:
Warning messages:
1. In new_result(connection#ptr, statement): Cancelling previous query
2. In connection_release(conn#ptr) :
 There is a result object still in use.
The connection will be automatically released when it is closed.
Question:
Because I don’t have a result to fetch (I am sending a command to build a table) I’m not sure how to avoid this warning. At the moment I disconnect after building a table and the error goes away. Is there anything I can do do to avoid this warning?
Currently everything works, I just have this warning. I'd just like to avoid it as I assume I should be clearing something after I've built my table.
Code sample
# establish connection
con = DBI::dbConnect(<connection stuff here>)
# connect to table and database
transactions = tbl(con,in_schema(“DATABASE_NAME”,”TABLE_NAME”))
# build query string
query_string = “SELECT * FROM some_table”
# drop current version of table
DBI::dbSendQuery(con,paste('DROP TABLE MY_DB.MY_TABLE'))
# build new version of table
DBI::dbSendQuery(con,paste('CREATE TABLE PABLE MY_DB.MY_TABLE AS (‘,query_string,’) WITH DATA'))
Even though you're not retrieving stuff with a SELECT clause, DBI still allocates a result set after every call to DBI::dbSendQuery().
Give it a try with DBI::dbClearResult() in between of DBI::dbSendQuery() calls.
DBI::dbClearResult() does:
Clear A Result Set
Frees all resources (local and remote) associated with a
result set. In some cases (e.g., very large result sets) this
can be a critical step to avoid exhausting resources
(memory, file descriptors, etc.)
The example of the man page should give a hint how the function should be called:
con <- dbConnect(RSQLite::SQLite(), ":memory:")
rs <- dbSendQuery(con, "SELECT 1")
print(dbFetch(rs))
dbClearResult(rs)
dbDisconnect(con)

Executing a stored oracle procedure in R using ROracle

I'm having trouble executing/calling an Oracle procedure in R via ROracle. I've tried many different ways of calling the procedure and I keep getting the same errors.
I've had no problem doing SELECT queries but calling a procedure is proving difficult. I've used both oracleProc and dbSendQuery functions, but to no avail. Neither of them work. Roracle documentation is pathetic for examples of calling procedures.
Let's say the Oracle procedure is called MYPROC in MYSCHEMA. The procedure is very simple with NO parameters (it involves reading a few tables and writing to a table)
When I execute the procedure directly in Oracle Developer, there is no problem:
The following works in Oracle Developer (but not in R)
EXEC MYSCHEMA.MYPROC;
Then I try to call the same procedure from R (via ROracle) and gives me error. I've tried many different ways of calling the procedure i get same errors:
# This didn't work in R
> require(ROracle)
> LOAD_query <- oracleProc(con1, "BEGIN EXEC MYSCHEMA.MYPROC; END;")
This is the error I get:
Error in .oci.oracleProc(conn, statement, data = data, prefetch =
prefetch, :
# Then i tried the following and it still didn't work
> LOAD_query <- oracleProc(con1, "EXEC MYSCHEMA.MYPROC;")
This is the error i got (a bit different from the one above):
Error in .oci.oracleProc(conn, statement, data = data, prefetch =
prefetch, : ORA-00900: invalid SQL statement
# so then i tried dbSendQuery which works perfectly fine with any SELECT statements but it didn't work
> LOAD_query <- dbSendQuery(con1, "BEGIN EXEC MYSCHEMA.MYPROC; END;")
This is the error i get (same as the first one):
Error in .oci.SendQuery(conn, statement, data = data, prefetch =
prefetch, :
# I even tried the following to exhaust all possibilities. And still no luck. I get the same error as above:
> LOAD_query <- oracleProc(con1, "BEGIN EXEC MYSCHEMA.MYPROC(); END;")
My procedure doesn't have any parameters. As I mentioned it works just fine when called in Oracle developer.
I've run out of ideas how to get such a ridiculously simple query work in R! I am only interested in getting this work via ROracle though.
Did you create (compile) the procedure first? For example:
dbGetQuery(con, "CREATE PROCEDURE MYPROC ... ")
Then try to execute the procedure like this:
oracleProc(con, "BEGIN MYPROC(); END;")
You're right that ROracle::oracleProc documentation is not good. This example helped me:
https://community.oracle.com/thread/4058424

R: How to use RJDBC to download blob data from oracle database?

Does anyone know of a way to download blob data from an Oracle database using RJDBC package?
When I do something like this:
library(RJDBC)
drv <- JDBC(driverClass=..., classPath=...)
conn <- dbConnect(drv, ...)
blobdata <- dbGetQuery(conn, "select blobfield from blobtable where id=1")
I get this message:
Error in .jcall(rp, "I", "fetch", stride) :
java.sql.SQLException: Ongeldig kolomtype.: getString not implemented for class oracle.jdbc.driver.T4CBlobAccessor
Well, the message is clear, but still I hope there is a way to download blobs. I read something about 'getBinary()' as a way of getting blob information. Can I find a solution in that direction?
The problem is that RJDBC tries to convert the SQL data type it reads to either double or String in Java. Typically the trick works because JDBC driver for Oracle has routines to convert different data types to String (accessed by getString() method of java.sql.ResultSet class). For BLOB, though, the getString() method has been discontinued from some moment. RJDBC still tries calling it, which results in an error.
I tried digging into the guts of RJDBC to see if I can get it to call proper function for BLOB columns, and apparently the solution requires modification of fetch S4 method in this package and also the result-grabbing Java class within the package. I'll try to get this patch to package maintainers. Meanwhile, quick and dirty fix using rJava (assuming conn and q as in your example):
s <- .jcall(conn#jc, "Ljava/sql/Statement;", "createStatement")
r <- .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", q, check=FALSE)
listraws <- list()
col_num <- 1L
i <- 1
while(.jcall(r, 'Z', 'next')){
listraws[[i]] <- .jcall(r, '[B', 'getBytes', col_num)
i <- i + 1
}
This retrieves list of raw vectors in R. The next steps depend on the nature of data - in my application these vectors represent PNG images and can be handled pretty much as file connections by png package.
Done using R 3.1.3, RJDBC 0.2-5, Oracle 11-2 and OJDBC driver for JDK >= 1.6

sqlFetch Table not found error

After I use
cn<-odbcConnect(...)
to connect to MS SQL Server. I can successfully get data using:
tmp <- sqlQuery(cn, "select * from MyTable")
But if I use
tmp <- sqlFetch(cn,"MyTable")
R would complain about "Error in odbcTableExists(channel, sqtable) : table not found on channel". Did I miss anything here?
Assuming you work on Windows OS. When you define your "dsn" in Control panel > Administrative tools > System and Security > Data Sources (ODBC), you have to select a database as well. If you do that your code should work as expected.
So, the problem is not in your R code, but in your "dsn" string that in my opinion does not contain the reference to a database which is needed.

Resources