Is there a package for object-relational mapping in R? - r

(By object-relational mapping, I mean what is described here: Wikipedia: Object-relational mapping.)
Here is how I could imagine this work in R : a kind of "virtual data frame" is linked to a database, and returns the results of SQL queries when accessed. For instance, head(virtual_list) would actually return the results of (select * from mapped_table limit 5) on the mapped database.
I have found this post by John Myles White, but there seems to have been no progress in the last 3 years.
Is there a working package that implements this ?
If not,
Would it be useful ?
What would be the best way to implement it (S4 ?) ?

The very recent package dplyr is implementing this (amongst other amazing features).
Here are illustrations from the examples of function src_mysql():
# Connection basics ---------------------------------------------------------
# To connect to a database first create a src:
my_db <- src_mysql(host = "blah.com", user = "hadley",
password = "pass")
# Then reference a tbl within that src
my_tbl <- tbl(my_db, "my_table")
# Methods -------------------------------------------------------------------
batting <- tbl(lahman_mysql(), "Batting")
dim(batting)
colnames(batting)
head(batting)

There is an old unsupported package, SQLiteDF, that does that. Build it from source and ignore the numerous error messages.
> # from example(sqlite.data.frame)
>
> library(SQLiteDF)
> iris.sdf <- sqlite.data.frame(iris)
> iris.sdf$Petal.Length[1:10] # $ done via SQL
[1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5

Looks like John Myles White he's given up on it.
There is a bit of a workaround explained here.

I don't think it would be useful. R is not a real OOP language. The "central" data structure in R is the data frame. No need for Object-Relational Mapping here.What you want is a mapping between SQL tables and data frames and the RMySQL and RODBC provide just that :
dbGetQuery to return the results of a query in a data frame and dbWriteTable to insert data in a table or do a bulk update ( from a data frame).

As an experienced R user, I would not use this. First off, this 'virtual frame' would be slow to use, since you constantly need to synchronize between R memory and the database. It would also require locking the database table, since otherwise you have unpredictable results due to other edits happening at the same time.
Finally, I do not think R is suited for implementing a different evaluation of promise objects. Doing myFrame$foo[ myFrame$foo > 40 ] will still fetch the full foo column, since you cannot possible implement a full translation scheme from R to SQL.
Therefore, I prefer to load a dataframe() from a query, use it, and write it back to the database if required.

Next to the various driver packages for querying DBs (DBI, RODBC,RJDBC,RMySql,...)
and dplyr, there's also sqldf https://cran.r-project.org/web/packages/sqldf/
This will automatically import dataframes into the db & let you query the data via sql. At the end the db is deleted.

The most similar could be dbplyr.
In R you work with tables, not rows.

Related

How to use "Spark Connection" with databricks to extract data from SQL database and then convert to R dataframe?

I am trying to extract data from an Azure SQL database into an R notebook on Azure Databricks to run an R script on it. Because of the difficulties I have experienced using jdbc and DBI (detailed here: "No suitable driver" error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks and here: How to get SQL database connection compatible with DBI::dbGetQuery function when converting between R script and databricks R notebook?) I have decided to use the inbuilt spark connector as such (credentials changed for security reasons):
%scala
//Connect to database:
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
// Aquire a DataFrame collection (val collection)
val dave_connection = Config(Map(
"url" -> "servername.database.windows.net",
"databaseName" -> "databasename",
"dbTable" -> "myschema.mytable",
"user" -> "username",
"password" -> "userpassword"
))
val collection = sqlContext.read.sqlDB(dave_connection)
collection.show()
This works in the sense that it displays the data, but as someone who doesn't know the first thing about scala or spark I now have no idea how to get it into an R or R-compatible dataframe.
I have tried to see what kind of object "collection" is, but:
%scala
getClass(collection)
returns only:
notebook:1: error: too many arguments for method getClass: ()Class[_ <: $iw]
getClass(collection)
And trying to access it using sparklyr implies that it doesn't actually exist, e.g.
library(sparklyr)
sc <- spark_connect(method="databricks")
sdf_schema(collection)
returns:
Error in spark_dataframe(x) : object 'collection' not found
I feel like this is may well be pretty obvious to anyone who understands scala, but I don't (I come from an analyst rather than computer science background), and I just want to get this data into an R dataframe to perform analyses on it. (I know Databricks is all about parallelisation and scaling, but I'm not performing any parallelised functions on this dataset, the only reason I'm using Databricks is because my work PC doesn't have sufficient memory to run my analyses locally!)
So, does anyone have any ideas on how I can convert this spark object "collection" into an R dataframe?

How to use custom SQL function in dbplyr?

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.
But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.
There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.
I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.
You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:
union_all = function(table_a,table_b, list_of_columns){
# extract database connection
connection = table_a$src$con
sql_query = build_sql(con = connection,
sql_render(table_a),
"\nUNION ALL\n",
sql_render(table_b)
)
return(tbl(connection, sql(sql_query)))
}
unioned_table = union_all(table_1, table_2, c("who", "where", "when"))
Two key commands here are:
sql_render, which takes a dbplyr table and returns the SQL code that produces it
build_sql, which assembles a query from strings.
You have choices for your execution command:
tbl(connection, sql(sql_query)) will return the resulting table
dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)
Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function thats available in SQL-land.
This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the came code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.
You can use sql() which runs whatever raw SQL you provide.
Example
Here the lubridate equivalent doesn't work on a database backend.
So instead I place custom SQL code sql("EXTRACT(WEEK FROM ildate)") inside sql(), like so:
your_dbplyr_object %>%
mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))

R: How to use RJDBC to download blob data from oracle database?

Does anyone know of a way to download blob data from an Oracle database using RJDBC package?
When I do something like this:
library(RJDBC)
drv <- JDBC(driverClass=..., classPath=...)
conn <- dbConnect(drv, ...)
blobdata <- dbGetQuery(conn, "select blobfield from blobtable where id=1")
I get this message:
Error in .jcall(rp, "I", "fetch", stride) :
java.sql.SQLException: Ongeldig kolomtype.: getString not implemented for class oracle.jdbc.driver.T4CBlobAccessor
Well, the message is clear, but still I hope there is a way to download blobs. I read something about 'getBinary()' as a way of getting blob information. Can I find a solution in that direction?
The problem is that RJDBC tries to convert the SQL data type it reads to either double or String in Java. Typically the trick works because JDBC driver for Oracle has routines to convert different data types to String (accessed by getString() method of java.sql.ResultSet class). For BLOB, though, the getString() method has been discontinued from some moment. RJDBC still tries calling it, which results in an error.
I tried digging into the guts of RJDBC to see if I can get it to call proper function for BLOB columns, and apparently the solution requires modification of fetch S4 method in this package and also the result-grabbing Java class within the package. I'll try to get this patch to package maintainers. Meanwhile, quick and dirty fix using rJava (assuming conn and q as in your example):
s <- .jcall(conn#jc, "Ljava/sql/Statement;", "createStatement")
r <- .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", q, check=FALSE)
listraws <- list()
col_num <- 1L
i <- 1
while(.jcall(r, 'Z', 'next')){
listraws[[i]] <- .jcall(r, '[B', 'getBytes', col_num)
i <- i + 1
}
This retrieves list of raw vectors in R. The next steps depend on the nature of data - in my application these vectors represent PNG images and can be handled pretty much as file connections by png package.
Done using R 3.1.3, RJDBC 0.2-5, Oracle 11-2 and OJDBC driver for JDK >= 1.6

RImpala: Query Failed When Larger Data

check1<-rimpala.query("select * from sum2")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.sql.SQLException: Method not supported
dim(sum2) is 49501 rows and 18 columns.
check1<-rimpala.query("select *from sum3")
dim(sum3) is 102 rows and 6 columns.
It worked with smaller sample size.
sorry that I cant reproduce example to this. Is anyone encounter the same problem with larger data size? Any idea to solve this? Thanks.
As noted elsewhere on StackOverflow, RImpala does not implement executeUpdate and so cannot run any query that modifies state. I suspect you hit your error not by running a larger SELECT query but rather because you tried to insert, update, or delete some data.
If you'd like to use Impala from R, I'd recommend using dplyrimpaladb.
RImpala (v0.1.6) build is updated with the support to execute DDL queries using executeUpdate.
The latest build contains the following fixes / additions:
Support for DDL query execution.
fetchSize parameter in query function to state the number of records that can be retrieved in one round trip read from Impala.
Fix for query failing when NULL values are being returned.
Compatiblity with CDH 5.x.x
You can run DDL queries using the query function as illustrated below:
rimpala.query(Q="drop table sample_table",isDDL="true")
You can also specify the fetchSize in the query function to aid reading large data efficiently.
rimpala.query(Q="select * from sample_table",fetchSize="10000")
Please find the latest build in Cran : http://cran.r-project.org/web/packages/RImpala/index.html
Source Code : https://github.com/Mu-Sigma/RImpala
I have the same problem with the RImpala package and recommend to use the RJDBC package:
library(RJDBC)
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
classPath = list.files("path_to_jars",pattern="jar$",full.names=T),
identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:21050/;auth=noSasl")
check1 <- dbGetQuery(conn, "select *from sum3")
I used these jar files an evenything works as expected:
https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip
For more information and a speed comparison look at this blog post:
http://datascience.la/r-and-impala-its-better-to-kiss-than-using-java/

How to read data from Cassandra with R?

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.
The Cassandra schema is defined like this:
create table chosen_samples (id bigint , temperature double, primary key(id))
I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)
> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")
The connection seems to succeed but the parsing of the table into data frame fails:
> cs
Error in data.frame(..dfd. = c("#\"ffffff", "#(<cc><cc><cc><cc><cc><cd>", :
duplicate row.names:
I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")
But this one fails like this:
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Even though the location to the java driver is correct
$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.
There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("D:/apache-cassandra-2.0.10/lib",
pattern="jar$",full.names=T))
That is working on my unix but not on my windows machine.
Hope that helps.
This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.
Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.
I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy (assuming no Scala knowledge).
Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.
I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.
Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this
# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")
python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")
# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))
names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")
# and decoding NA's
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df, parsena))
Anybody has a better idea?
I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver).
Also, when run in RStudio it worked fine in the second run but not first.
What helped was explicitly call:
library(rJava)
.jinit()
It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.
Include the Cassandra JDBC dependencies in your classpath : download dependencies
As of the RCassandra package, right now it's still too primitive compared to RJDBC.

Resources