R-Studio IDE on Databricks

R-Studio IDE on Databricks - r

I am trying to import a dataset from my Databricks File System (DBFS) to R-Studio- which is running on Databricks Cluster; and I am facing this issue below.
> sparkDF <- read.df(source = "parquet", path = "/tmp/lrs.parquet", header="true", inferSchema = "true")`
Error: Error in load : java.lang.SecurityException: No token to
authorize principal
at com.databricks.sql.acl.ReflectionBackedAclClient$$anonfun$com$databricks$sql$acl$ReflectionBackedAclClient$$token$2.apply(ReflectionBackedAclClient.scala:137)
at com.databricks.sql.acl.ReflectionBackedAclClient$$anonfun$com$databricks$sql$acl$ReflectionBackedAclClient$$token$2.apply(ReflectionBackedAclClient.scala:137)
at scala.Option.getOrElse(Option.scala:121)
at com.databricks.sql.acl.ReflectionBackedAclClient.com$databricks$sql$acl$ReflectionBackedAclClient$$token(ReflectionBackedAclClient.scala:137)
at com.databricks.sql.acl.ReflectionBackedAclClient$$anonfun$getValidPermissions$1.apply(ReflectionBackedAclClient.scala:86)
at com.databricks.sql.acl.ReflectionBackedAclClient$$anonfun$getValidPermissions$1.apply(ReflectionBackedAclClient.scala:81)
at com.databricks.sql.acl.ReflectionBackedAclClient.stripReflectionException(ReflectionBackedAclClient.scala:73)
at com.databricks.sql.acl.Refle
The DBFS Location is correct, any suggestions or blogs are welcomed for this!

The syntax for reading data with R on Databricks depends on whether you are reading into Spark or into R on the driver. See below:
# reading into Spark
sparkDF <- read.df(source = "parquet",
path = "dbfs:/tmp/lrs.parquet")
# reading into R
r_df <- read.csv("/dbfs/tmp/lrs.csv")
When reading into Spark, use the dbfs:/ prefix, when reading into R directly use /dbfs/.

We should use dbfs before the directory path.
For Example: /dbfs/tmp/lrs.parquet

Related

How to export sf object to GDB using RPyGeo in R (Windows)?

I have a bunch of sf objects I'd like to export to GDB from R. I'm running R 4.0.2 on Windows 10. In this case the sf objects are all vector point data. The main reasons to export to GDB are to keep longer field names (the shapefile truncation is very annoying), and because GDBs are more desirable storage locations for our workflows.
Yes, I know about the ArcGisBinding package. I've got it to work in a test script but it's pretty unstable - often crashing and requiring a restart of R. This is a problem, because the sf objects I'd like to export come after an already long Rmd that reads in, formats and cleans the data. So it's not a simple manner of re-running the script until arc.write doesn't break. I could break up the script, but then I'd still have to read in a bunch of shapefiles. One option I haven't yet explored is using reticulate to call a python script instead of trying to do everything in R, but we're trying to do our analysis all in one place, if possible.
I'm pretty sure I've managed to set up RPyGeo appropriately, first setting my python path using the reticulate package. I'm doing it this way because IT restrictions means I can't edit PATH variables on my machine.
#package calls
library(sf)
library(spData)
library(reticulate)
#set python version in reticulate
py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
reticulate::use_python(python = py_path, required = TRUE)
#call RPyGeo
library(RPyGeo) # for potential point export
#output gdb
out.gdb <- "C:/LOCAL_PROJECTS/Output/Output.gdb"
#RPyGeo Parameters
# Note that, in order to use RPyGeo you need a working ArcMap or ArcGIS Pro installation on your computer.
# python path - note that this will change depending on which version of Arc one is using
# py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
arcpy <- rpygeo_build_env(workspace = out.gdb,
overwrite = TRUE,
extensions = c("Spatial","DataInteroperability"),
path = py_path)
I've tried a bunch of different tools to export an sf object, here using dummy data also used in the RPyGeo vignette
data(nz, package = "spData")
arcpy$Copy_management(in_data = nz,out_data = "nz_test")
arcpy$Copy_management(in_data = nz,out_data = file.path(out.gdb,"nz"))
arcpy$FeatureClassToGeodatabase_conversion(Input_Features = nz,Output_Geodatabase = out.gdb)
arcpy$FeatureClassToFeatureClass_conversion(in_features = nz,out_path = out.gdb,out_name = "nz")
arcpy$QuickExport_interop(Input = nz,Output = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = "nz")
Each time I get an error, for example:
Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: Object: Error in executing tool
Detailed traceback:
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3232, in CopyFeatures
raise e
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3229, in CopyFeatures
retval = convertArcObjectToPythonObject(gp.CopyFeatures_management(*gp_fixargs((in_features, out_feature_class, config_keyword, spatial_grid_1, spatial_grid_2, spatial_grid_3), True)))
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\geoprocessing\_base.py", line 511, in <lambda>
return lambda *args: val(*gp_fixargs(args, True))
I'm not an expert in ArcPy by any means. Nor am I an expert in tracing errors inside packages. Am I making a simple syntax mistake? Is there something else that I'm missing? Any help would be much appreciated!

send2cy doesn't work in Rstudio ~ Cyrest, Cyotoscape

When I run "send2cy" function in R studio, I got error.
# Basic setup
library(igraph)
library(RJSONIO)
library(httr)
dir <- "/currentdir/"
setwd(dir)
port.number = 1234
base.url = paste("http://localhost:", toString(port.number), "/v1", sep="")
print(base.url)
# Load list of edges as Data Frame
network.df <- read.table("./data/eco_EM+TCA.txt")
# Convert it into igraph object
network <- graph.data.frame(network.df,directed=T)
# Remove duplicate edges & loops
g.tca <- simplify(network, remove.multiple=T, remove.loops=T)
# Name it
g.tca$name = "Ecoli TCA Cycle"
# This function will be published as a part of utility package, but not ready yet.
source('./utility/cytoscape_util.R')
# Convert it into Cytosccape.js JSON
cygraph <- toCytoscape(g.tca)
send2cy(cygraph, 'default%20black', 'circular')
Error in file(con, "r") : cannot open the connection
Called from: file(con, "r")
But I didn't find error when I use "send2cy" function from terminal R (Run R from terminal just calling by "R").
Any advice is welcome.

I tested your script with local copies of the network data and utility script, and with updated file paths. The script ran fine for me in R Studio.
Given the error message you are seeing "Error in file..." I suspect the issue is with your local files and file paths... somehow in an R Studio-specific way?
FYI: an updated, consolidated and update set of R scripts for Cytoscape are available here: https://github.com/cytoscape/cytoscape-automation/tree/master/for-scripters/R. I don't think anything has significantly changed, but perhaps trying in a new context will resolve the issue you are facing.

Download a custom dataset in Azure ML Jupyter/iPython Notebook using R

I need to download a custom dataset in an Azure Jupyter/iPython Notebook.
My ultimate goal is to install an R package. To be able to do this the package (the dataset) needs to be downloaded in code. I followed the steps outlined by Andrie de Vries in the comments section of this post: Jupyter Notebooks with R in Azure ML Studio.
Uploading the package as a ZIP file was without problems, but when I run the code in my notebook I get an error:
Error in curl(x$DownloadLocation, handle = h, open = conn): Failure
when receiving data from the peer Traceback:
download.datasets(ws, "plotly_3.6.0.tar.gz.zip")
lapply(1:nrow(datasets), function(j) get_dataset(datasets[j, . ], ...))
FUN(1L[[1L]], ...)
get_dataset(datasets[j, ], ...)
curl(x$DownloadLocation, handle = h, open = conn)
So I simplified my code into:
library("AzureML")
ws <- workspace()
ds <- datasets(ws)
ds$Name
data <- download.datasets(ws, "plotly_3.6.0.tar.gz.zip")
head(data)
Where "plotly_3.6.0.tar.gz.zip" is the name of my dataset of data type "Zip".
Unfortunately this results in the same error.
To rule out data type issues I also tried to download another dataset of mine which is of data type "Dataset". Also the same error.
Now I change the dataset I want to download to one of the sample datasets of AzureML Studio.
"text.preprocessing.zip" is of datatype Zip
data <- download.datasets(ws, "text.preprocessing.zip")
"Flight Delays Data" is of datatype GenericCSV
data <- download.datasets(ws, "Flight Delays Data")
Both of the sample datasets can be downloaded without problems.
So why can't I download my own saved dataset?
I could not find anything helpful in the documentation of the download.datasets function. Not on rdocumentation.org, nor on cran.r-project.org (page 17-18).

Try this:
library(AzureML)
ws <- workspace(
id = "your AzureML ID",
auth = "your AzureML Key"
)
name <- "Name of your saved data"
ws <- workspace()

It seems the error I got was due to a bug in the (then early) Azure ML Studio.
I tried again after the reply of Daniel Prager only to find out my code works as expected without any changes. Adding the id and auth parameters was not needed.

Installing pdftotext on Windows (for use with R, 'tm' package)

I am having trouble using R, 'tm' package, to read in .pdf files.
Specifically, I try to run the following code:
library(tm)
filename = "myfile.pdf"
tmp1 <- readPDF(PdftotextOptions="-layout")
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
doc[1:15]
...which gives me the error:
Error in readPDF(PdftotextOptions = "-layout") :
unused argument (PdftotextOptions = "-layout")
I assume this is due to the fact that the pdftotext program (part of xpdf, http://www.foolabs.com/xpdf/download.html) has not been installed correctly on my machine, so that R cannot access it.
What are the steps to install xpdf/pdftotext correctly such that the above R code can be executed? (I am aware of similar questions already posted, however they don't address the same issue)

PdftotextOptions is no parameter of readPDF. readPDF has a control parameter, which expects a list. So correct use would be:
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
tmp1 <- readPDF(control = list(text = "-layout"))
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
}

Set
setwd('C:/xpdf/bin64')
It works for me.

How to read data from Cassandra with R?

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.
The Cassandra schema is defined like this:
create table chosen_samples (id bigint , temperature double, primary key(id))
I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)
> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")
The connection seems to succeed but the parsing of the table into data frame fails:
> cs
Error in data.frame(..dfd. = c("#\"ffffff", "#(<cc><cc><cc><cc><cc><cd>", :
duplicate row.names:
I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")
But this one fails like this:
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Even though the location to the java driver is correct
$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar

You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.
There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("D:/apache-cassandra-2.0.10/lib",
pattern="jar$",full.names=T))
That is working on my unix but not on my windows machine.
Hope that helps.

This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.
Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.
I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy (assuming no Scala knowledge).
Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.
I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.

Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this
# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")
python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")
# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))
names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")
# and decoding NA's
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df, parsena))
Anybody has a better idea?

I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver).
Also, when run in RStudio it worked fine in the second run but not first.
What helped was explicitly call:
library(rJava)
.jinit()

It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.
Include the Cassandra JDBC dependencies in your classpath : download dependencies
As of the RCassandra package, right now it's still too primitive compared to RJDBC.