How to read data from Cassandra with R? - r

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.
The Cassandra schema is defined like this:
create table chosen_samples (id bigint , temperature double, primary key(id))
I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)
> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")
The connection seems to succeed but the parsing of the table into data frame fails:
> cs
Error in data.frame(..dfd. = c("#\"ffffff", "#(<cc><cc><cc><cc><cc><cd>", :
duplicate row.names:
I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")
But this one fails like this:
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Even though the location to the java driver is correct
$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar

You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.
There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("D:/apache-cassandra-2.0.10/lib",
pattern="jar$",full.names=T))
That is working on my unix but not on my windows machine.
Hope that helps.

This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.
Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.
I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy (assuming no Scala knowledge).
Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.
I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.

Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this
# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")
python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")
# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))
names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")
# and decoding NA's
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df, parsena))
Anybody has a better idea?

I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver).
Also, when run in RStudio it worked fine in the second run but not first.
What helped was explicitly call:
library(rJava)
.jinit()

It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.
Include the Cassandra JDBC dependencies in your classpath : download dependencies
As of the RCassandra package, right now it's still too primitive compared to RJDBC.

Related

sparklyr hadoopConfiguration

I apologize that this question will be hard to make fully reproducible because it involves a running spark context (referenced to as sc below), but I am trying to set a hadoopConfiguration in sparklyr, specifically for accessing swift/objectStore objects from RStudio sparklyr as a Spark object, but in general for a scala call to hadoopConfiguration. Something like (scala code):
sc.hadoopConfiguration.set(f"fs.swift.service.$name.auth.url","https://identity.open.softlayer.com"/v3/auth/tokens")
where sc is a running spark context. In SparkR I can run (R code)
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", paste("fs.swift.service.keystone.auth.url"), paste("https://identity.open.softlayer.com/v3/auth/tokens",sep=""))
in sparklyr I have tried every incantation of this that I think of, but my best guess is (again R code)
sc %>% invoke("set", paste("fs.swift.service.keystone,auth.url"), paste("https://identity.open.softlayer.com/v3/auth/tokens",sep=""))
but this results in the non-verbose error (and irregular spelling) of
Error in enc2utf8(value) : argumemt is not a character vector
of course I tried to encode the inputs in every way that I can think of (naturally enc2utf8(value) being the first, but many others including lists and as.character(as.list(...)) which appears to be a favorite for sparklyr coders). Any suggestions would be greatly appreciated. I have combed the source code for sparklyr and cannot find any mentions of hadoopConfiguration in the sparklyr github, so I am afraid that I missing something very basic in the core configuration. I have also tried to pass these configs in the config.yml in the spark_connect() core call, but while this is working in setting the "fs.swift.service.keystone.auth.url" as a sc$config$s.swift.service.keystone.auth.url setting, it is apparently failing to set these as a core hadoopConfiguration.
By the way, I am using Spark1.6, scala 2.10, R 3.2.1, and sparklyr_0.4.19.
I figured this out
set_swift_config <- function(sc){
#get spark_context
ctx <- spark_context(sc)
#set the java spark context
jsc <- invoke_static(
sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx
)
#set the swift configs:
hconf <- jsc %>% invoke("hadoopConfiguration")
hconf %>% invoke("set","fs.swift.service.keystone.auth.url",
"https://identity.open.softlayer.com/v3/auth/tokens" )
}
which can be run with set_swift_config(sc).

Error: unexpected symbol in RScript - No further information provided about the line or syntax generating error

I have read the many posts related to R Syntax errors, but everyone points to the error message and using it to figure out where the error occurs. My situation is different in that the error is generic. See below:
Error: unexpected symbol in "RScript correlation_presalesfinal3.R"
RStudio executes it fine.
It is an incredibly simple script, and I am wondering if it has to do with how I am constructing my Postgres syntax. Does R require line break symbols between the statements (select, from, group by etc)?
That is the only thing I can thing of. I am trying to compare a separate R-generated correlation to one generated by PostgreSQL directly. This particular piece is the call to PostgreSQL to calculate correlation directly.
I appreciate your help!
Here is the code:
#Written by Laura for Standard Imp
#Install if necessary (definitely on the first run)
install.packages("RColorBrewer")
install.packages("gplots")
install.packages("RSclient")
install.packages("RPostgreSQL")
#libraries in use
library(RColorBrewer)
library(gplots)
library(RSclient)
library(RPostgreSQL)
# Establish connection to PostgreSQL using RPostgreSQL
drv <- dbDriver("PostgreSQL")
# Full version of connection setting
con <- dbConnect(drv, dbname="db",host="ip",port=5432,user="user",password="pwd")
# -----------------------------^--------^-------------------^---- -------^
myLHSRHSFinalTable <- dbGetQuery(con,"select l1.a_lhsdescription as LHS, l2.a_rhsdescription as RHS, l7.a_scenariodescription as Scenario, corr(l3.driver_metric, l4.driver_metric) as Amount from schema_name.table_name l3 join schema_name.table_name l4 on L3.Time_ID = l4.Time_ID join schema_name.opera_00004_dim_lhs l1 on l3.LHS_ID = l1.member_id join schema_name.opera_00004_dim_rhs l2 on l4.RHS_ID = l2.member_id join schema_name.opera_00004_dim_scenario l7 on l3.scenario_id = l7.member_id join schema_name.opera_00004_dim_time l8 on l3.time_id = l8.member_id where l7.a_scenariodescription = 'Actual'
group by l1.a_lhsdescription , l2.a_rhsdescription, l7.a_scenariodescription ")
myLHSRHSFinalTable
write.csv(myLHSRHSFinalTable, file = "data_load_stats_final.csv")
# Close PostgreSQL connection
dbDisconnect(con)
Your description of the problem possibly lacks enough detail for people to answer, but in my situation I ran into this same error message because I was executing Rscript from within the R shell. The documentation in R isn't always clear, as neither is the help, for indicating to the user where the commands are to be executed.
If you're working from the 'terminal' you use Rscript to run an R script, whereas if you're working from within the 'R shell' you use source() to run an R script.
As I'm still a newbie, I'm sure this answer is too much of an oversimplification, but it works to solve the basic error I was getting.
My script file called output.R can be executed from the 'terminal' command line prompt ("$") within my Linux system by the command:
$Rscript output.R
Or alternatively from within R, by first running the R shell, then executing the command at the R prompt (">")
$R
>source("output.R")

R: How to use RJDBC to download blob data from oracle database?

Does anyone know of a way to download blob data from an Oracle database using RJDBC package?
When I do something like this:
library(RJDBC)
drv <- JDBC(driverClass=..., classPath=...)
conn <- dbConnect(drv, ...)
blobdata <- dbGetQuery(conn, "select blobfield from blobtable where id=1")
I get this message:
Error in .jcall(rp, "I", "fetch", stride) :
java.sql.SQLException: Ongeldig kolomtype.: getString not implemented for class oracle.jdbc.driver.T4CBlobAccessor
Well, the message is clear, but still I hope there is a way to download blobs. I read something about 'getBinary()' as a way of getting blob information. Can I find a solution in that direction?
The problem is that RJDBC tries to convert the SQL data type it reads to either double or String in Java. Typically the trick works because JDBC driver for Oracle has routines to convert different data types to String (accessed by getString() method of java.sql.ResultSet class). For BLOB, though, the getString() method has been discontinued from some moment. RJDBC still tries calling it, which results in an error.
I tried digging into the guts of RJDBC to see if I can get it to call proper function for BLOB columns, and apparently the solution requires modification of fetch S4 method in this package and also the result-grabbing Java class within the package. I'll try to get this patch to package maintainers. Meanwhile, quick and dirty fix using rJava (assuming conn and q as in your example):
s <- .jcall(conn#jc, "Ljava/sql/Statement;", "createStatement")
r <- .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", q, check=FALSE)
listraws <- list()
col_num <- 1L
i <- 1
while(.jcall(r, 'Z', 'next')){
listraws[[i]] <- .jcall(r, '[B', 'getBytes', col_num)
i <- i + 1
}
This retrieves list of raw vectors in R. The next steps depend on the nature of data - in my application these vectors represent PNG images and can be handled pretty much as file connections by png package.
Done using R 3.1.3, RJDBC 0.2-5, Oracle 11-2 and OJDBC driver for JDK >= 1.6

Making a connection using rmongodb in R on Mac OS x

I've been assigned to analyze some data that are contained in a MongoDB format. I'm a complete newbie to MongoDB, but I can manage if I can read the data and convert it to an R data table or data frame. If possible, I'd like to do just enough to get the MongoDB data into R.
I'm trying to get access to the data using the rmongodb package in R version 3.1.2 on Mac OS X Yosemite via RStudio 0.98.953. I've tried this so far:
install.packages("rmongodb")
library(rmongodb)
#up to here, it works
mongo <- mongo.create(host='localhost')
mongo <- mongo.create(host='127.0.0.1')
mongo <- mongo.create()
Each of the mongo <- assignment statements results in the same error:
Unable to connect to localhost:27017, error code = 2
and
mongo.is.connected(mongo)
returns FALSE.
If this is an essential part of the answer, we can use "db=test" as the database. For what it's worth, the datasets are stored in "~/Desktop/MyExample" and consist of four files with the extension "bson" and their analogues ending with ".metadata.json", as well as a "system.indexes.bson" file.
Any ideas? Thanks in advance!

sqlFetch Table not found error

After I use
cn<-odbcConnect(...)
to connect to MS SQL Server. I can successfully get data using:
tmp <- sqlQuery(cn, "select * from MyTable")
But if I use
tmp <- sqlFetch(cn,"MyTable")
R would complain about "Error in odbcTableExists(channel, sqtable) : table not found on channel". Did I miss anything here?
Assuming you work on Windows OS. When you define your "dsn" in Control panel > Administrative tools > System and Security > Data Sources (ODBC), you have to select a database as well. If you do that your code should work as expected.
So, the problem is not in your R code, but in your "dsn" string that in my opinion does not contain the reference to a database which is needed.

Resources