Making a connection using rmongodb in R on Mac OS x - r

I've been assigned to analyze some data that are contained in a MongoDB format. I'm a complete newbie to MongoDB, but I can manage if I can read the data and convert it to an R data table or data frame. If possible, I'd like to do just enough to get the MongoDB data into R.
I'm trying to get access to the data using the rmongodb package in R version 3.1.2 on Mac OS X Yosemite via RStudio 0.98.953. I've tried this so far:
install.packages("rmongodb")
library(rmongodb)
#up to here, it works
mongo <- mongo.create(host='localhost')
mongo <- mongo.create(host='127.0.0.1')
mongo <- mongo.create()
Each of the mongo <- assignment statements results in the same error:
Unable to connect to localhost:27017, error code = 2
and
mongo.is.connected(mongo)
returns FALSE.
If this is an essential part of the answer, we can use "db=test" as the database. For what it's worth, the datasets are stored in "~/Desktop/MyExample" and consist of four files with the extension "bson" and their analogues ending with ".metadata.json", as well as a "system.indexes.bson" file.
Any ideas? Thanks in advance!

Related

How to use "Spark Connection" with databricks to extract data from SQL database and then convert to R dataframe?

I am trying to extract data from an Azure SQL database into an R notebook on Azure Databricks to run an R script on it. Because of the difficulties I have experienced using jdbc and DBI (detailed here: "No suitable driver" error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks and here: How to get SQL database connection compatible with DBI::dbGetQuery function when converting between R script and databricks R notebook?) I have decided to use the inbuilt spark connector as such (credentials changed for security reasons):
%scala
//Connect to database:
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
// Aquire a DataFrame collection (val collection)
val dave_connection = Config(Map(
"url" -> "servername.database.windows.net",
"databaseName" -> "databasename",
"dbTable" -> "myschema.mytable",
"user" -> "username",
"password" -> "userpassword"
))
val collection = sqlContext.read.sqlDB(dave_connection)
collection.show()
This works in the sense that it displays the data, but as someone who doesn't know the first thing about scala or spark I now have no idea how to get it into an R or R-compatible dataframe.
I have tried to see what kind of object "collection" is, but:
%scala
getClass(collection)
returns only:
notebook:1: error: too many arguments for method getClass: ()Class[_ <: $iw]
getClass(collection)
And trying to access it using sparklyr implies that it doesn't actually exist, e.g.
library(sparklyr)
sc <- spark_connect(method="databricks")
sdf_schema(collection)
returns:
Error in spark_dataframe(x) : object 'collection' not found
I feel like this is may well be pretty obvious to anyone who understands scala, but I don't (I come from an analyst rather than computer science background), and I just want to get this data into an R dataframe to perform analyses on it. (I know Databricks is all about parallelisation and scaling, but I'm not performing any parallelised functions on this dataset, the only reason I'm using Databricks is because my work PC doesn't have sufficient memory to run my analyses locally!)
So, does anyone have any ideas on how I can convert this spark object "collection" into an R dataframe?

Convert SparkR DataFrame to H2O Frame

Using SparkR, I am wondering if it is possible to convert a Spark DataFrame into an H2O frame?
I have seen examples of converting R data.frames to h2o frames, but, sadly, this is not a viable option (data size).
I know it is possible to use sparklyr and rsparkling to create an h2o frame, but I am not using HIVE, or Hadoop, sparklyr or rsparkling.
Instead, my goal is to convert the sdf from this:
set.seed(123)
df<- data.frame(ColA=rep(c("dog", "cat", "fish", "shark"), 4), ColB=rnorm(16), ColC=rep(seq(1:8),2))
sdf<- SparkR::createDataFrame(df)
into this:
as.h2o(sdf, destination_frame = "hsdf") # fails, came from Spark (SparkR)
as.h2o(df, destination_frame = "hdf") # succeeds, but this is a regular R data.frame
Hopefully, someone has figured out a way to do this using what SparkR can provide. I think it would be a huge boon to R users.
There is no support for converting between H2O and Spark frames natively in either the h2o or the SparkR packages. You would have to use rsparkling (which depends on sparklyr) or do a conversion from Spark DataFrame -> R data.frame -> H2O Frame.
You mentioned Hadoop and HIVE... just to clarify, neither of those are requirements for using rsparkling::as_h2o_frame().
Since none of the above worked for me, the solution was:
Saving spark dataframe on a csv (folder csv)
Using apply function to open each csv file using the package Rio Import
tmp<- lapply(list.files("data/csvfolder.csv"), function(x){rio::import(paste0("data/csvfolder.csv/", x))})
df00<- do.call("rbind", tmp)
Use the "df00" as a dataframe to use as you wish,,
Hope that works for you guys! Collect and as.data.frame are too weak depending on the type of data being used.
Chers

When use sparkR to process data, where is the place that the programme really run?

I am new to spark and sparkR, and my question is below:
when I wrote the codes below:
1). set up environment and start a spark.session()
sparkR.session(master = "my/spark/master/on/one/server/standaloneMode", , sparkConfig = list(spark.driver.memory="4g",spark.sql.warehouse.dir = "my/hadoop_home/bin",sparkPackages = "com.databricks:spark-avro_2.11:3.0.1"))
Then I wrote:
rund <- data.frame(V1 = runif(10000000,100,10000),V2 =runif(10000000,100,10000))
df <- as.DataFrame(rund)
Here is the thing:
1). Where does the program do the 'splitting'? on my local machine or on server?
2). Also, could anyone tell me where did the program exactly run the code "as.DataFrame()"? on my computer or on my server where was set as standalone_mode of spark.
SparkR is an interface for Spark. This means that some R functions are overridden by the SparkR package to provide a similar user experience you already know from R. You probably should have a look at the documentation to see which Spark functions are available: https://spark.apache.org/docs/latest/api/R/index.html
Those functions typically ingest SparkDataFrames which you can create, for example with the as.DataFrame function. SparkDataFrames provide a reference to a SparkDataFrame in your Spark cluster.
In your example you've created a local R data frame rund. The runif functions also were executed locally in your R instance.
# executed in your local R instance
rund <- data.frame(V1 = runif(10000000,100,10000),V2 =runif(10000000,100,10000))
The df object however is a SparkDataFrame, which will be created in your Spark Cluster. as.DataFrame is executed in R, but the actual SparkDataFrame only will exists in your cluster.
df <- as.DataFrame(rund)
To easily distinguish between R and Spark data frames, you can use the class function:
> class(df)
[1] "data.frame"
> class(df.spark)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
In general a SparkDataFrame can be used as input for the various functions the SparkR package has to offer, for example to group or sort your SparkDataFrame in Spark. The Spark operations are executed when a Spark action is called. An example for such an action is collect. It triggers the transformations in Spark and retrieves the computed data from your Spark cluster and creates a corresponded R data frame in your local R instance. If you have a look at the documentation you can see if a function can ingest a SparkDataFrame:
##S4 method for signature 'SparkDataFrame'
collect(x, stringsAsFactors = FALSE)
Moreover it is possible to execute custom R code in your Spark cluster by using user-defined functions: https://spark.apache.org/docs/latest/sparkr.html#applying-user-defined-function.

R: How to use RJDBC to download blob data from oracle database?

Does anyone know of a way to download blob data from an Oracle database using RJDBC package?
When I do something like this:
library(RJDBC)
drv <- JDBC(driverClass=..., classPath=...)
conn <- dbConnect(drv, ...)
blobdata <- dbGetQuery(conn, "select blobfield from blobtable where id=1")
I get this message:
Error in .jcall(rp, "I", "fetch", stride) :
java.sql.SQLException: Ongeldig kolomtype.: getString not implemented for class oracle.jdbc.driver.T4CBlobAccessor
Well, the message is clear, but still I hope there is a way to download blobs. I read something about 'getBinary()' as a way of getting blob information. Can I find a solution in that direction?
The problem is that RJDBC tries to convert the SQL data type it reads to either double or String in Java. Typically the trick works because JDBC driver for Oracle has routines to convert different data types to String (accessed by getString() method of java.sql.ResultSet class). For BLOB, though, the getString() method has been discontinued from some moment. RJDBC still tries calling it, which results in an error.
I tried digging into the guts of RJDBC to see if I can get it to call proper function for BLOB columns, and apparently the solution requires modification of fetch S4 method in this package and also the result-grabbing Java class within the package. I'll try to get this patch to package maintainers. Meanwhile, quick and dirty fix using rJava (assuming conn and q as in your example):
s <- .jcall(conn#jc, "Ljava/sql/Statement;", "createStatement")
r <- .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", q, check=FALSE)
listraws <- list()
col_num <- 1L
i <- 1
while(.jcall(r, 'Z', 'next')){
listraws[[i]] <- .jcall(r, '[B', 'getBytes', col_num)
i <- i + 1
}
This retrieves list of raw vectors in R. The next steps depend on the nature of data - in my application these vectors represent PNG images and can be handled pretty much as file connections by png package.
Done using R 3.1.3, RJDBC 0.2-5, Oracle 11-2 and OJDBC driver for JDK >= 1.6

How to read data from Cassandra with R?

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.
The Cassandra schema is defined like this:
create table chosen_samples (id bigint , temperature double, primary key(id))
I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)
> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")
The connection seems to succeed but the parsing of the table into data frame fails:
> cs
Error in data.frame(..dfd. = c("#\"ffffff", "#(<cc><cc><cc><cc><cc><cd>", :
duplicate row.names:
I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")
But this one fails like this:
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Even though the location to the java driver is correct
$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.
There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("D:/apache-cassandra-2.0.10/lib",
pattern="jar$",full.names=T))
That is working on my unix but not on my windows machine.
Hope that helps.
This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.
Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.
I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy (assuming no Scala knowledge).
Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.
I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.
Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this
# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")
python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")
# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))
names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")
# and decoding NA's
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df, parsena))
Anybody has a better idea?
I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver).
Also, when run in RStudio it worked fine in the second run but not first.
What helped was explicitly call:
library(rJava)
.jinit()
It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.
Include the Cassandra JDBC dependencies in your classpath : download dependencies
As of the RCassandra package, right now it's still too primitive compared to RJDBC.

Resources