sparklyr hadoopConfiguration - r

I apologize that this question will be hard to make fully reproducible because it involves a running spark context (referenced to as sc below), but I am trying to set a hadoopConfiguration in sparklyr, specifically for accessing swift/objectStore objects from RStudio sparklyr as a Spark object, but in general for a scala call to hadoopConfiguration. Something like (scala code):
sc.hadoopConfiguration.set(f"fs.swift.service.$name.auth.url","https://identity.open.softlayer.com"/v3/auth/tokens")
where sc is a running spark context. In SparkR I can run (R code)
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", paste("fs.swift.service.keystone.auth.url"), paste("https://identity.open.softlayer.com/v3/auth/tokens",sep=""))
in sparklyr I have tried every incantation of this that I think of, but my best guess is (again R code)
sc %>% invoke("set", paste("fs.swift.service.keystone,auth.url"), paste("https://identity.open.softlayer.com/v3/auth/tokens",sep=""))
but this results in the non-verbose error (and irregular spelling) of
Error in enc2utf8(value) : argumemt is not a character vector
of course I tried to encode the inputs in every way that I can think of (naturally enc2utf8(value) being the first, but many others including lists and as.character(as.list(...)) which appears to be a favorite for sparklyr coders). Any suggestions would be greatly appreciated. I have combed the source code for sparklyr and cannot find any mentions of hadoopConfiguration in the sparklyr github, so I am afraid that I missing something very basic in the core configuration. I have also tried to pass these configs in the config.yml in the spark_connect() core call, but while this is working in setting the "fs.swift.service.keystone.auth.url" as a sc$config$s.swift.service.keystone.auth.url setting, it is apparently failing to set these as a core hadoopConfiguration.
By the way, I am using Spark1.6, scala 2.10, R 3.2.1, and sparklyr_0.4.19.

I figured this out
set_swift_config <- function(sc){
#get spark_context
ctx <- spark_context(sc)
#set the java spark context
jsc <- invoke_static(
sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx
)
#set the swift configs:
hconf <- jsc %>% invoke("hadoopConfiguration")
hconf %>% invoke("set","fs.swift.service.keystone.auth.url",
"https://identity.open.softlayer.com/v3/auth/tokens" )
}
which can be run with set_swift_config(sc).

Related

Writing a partitioned parquet file with SparkR

I've got two scripts, one in R and a short second one in pyspark that uses the output. I'm trying to copy that functionality into the first script for simplicity.
The second script is very simple -- read a bunch of csv files and emit them as partitioned parquet:
spark.read.csv(path_to_csv, header = True) \
.repartition(partition_column).write \
.partitionBy(partition_column).mode('overwrite') \
.parquet(path_to_parquet)
This should be equally simple in R but I can't figure out how to match the partitionBy functionality in SparkR. I've got this so far:
library(SparkR); library(magrittr)
read.df(path_to_csv, 'csv', header = TRUE) %>%
repartition(col = .$partition_column) %>%
write.df(path_to_parquet, 'parquet', mode = 'overwrite')
This successfully writes one parquet file for each value of partition_column. The issue is the emitted files have the wrong directory structure; whereas Python produces something like
/path/to/parquet/
partition_column=key1/
file.parquet.gz
partition_column=key2/
file.parquet.gz
...
R produces only
/path/to/parquet/
file_for_key1.parquet.gz
file_for_key2.parquet.gz
...
Am I missing something? the partitionBy function in SparkR appears only to refer to the context of window functions and I don't see anything else in the manual that could be related. Perhaps there's a way to pass something in ... but I don't see any examples in the documentation or from a search online.
Partitioning of the output is not supported in Spark <= 2.x.
However, it will be supported in SparR >= 3.0.0 (SPARK-21291 - R partitionBy API), with the following syntax:
write.df(
df, path_to_csv, "parquet", mode = "overwrite",
partitionBy = "partition_column"
)
Since corresponding PR modifies only R files, you should be able to patch any SparkR 2.x distribution, if upgrading to development version is not an option:
git clone https://github.com/apache/spark.git
git checkout v2.4.3 # Or whatever branch you use
# https://github.com/apache/spark/commit/cb77a6689137916e64bc5692b0c942e86ca1a0ea
git cherry-pick cb77a6689137916e64bc5692b0c942e86ca1a0ea
R -e "devtools::install('R/pkg')"
In the client mode this should be required only on the driver node.
but these are not fatal, and shouldn't cause any serious issues.

How to resolve "sql(sqlContext...)' is deprecated" warning in SparkR

I'm building a new version of some old code using SparkR. Upon a block like this
hiveContext <- sparkRHive.init(sc)
hive_db = 'our_database'
db <- sql(hiveContext, paste0("use ", hive_db))
I'm told that 'sparkRHive.init' is deprecated. Use 'sparkR.session' instead. So, okay, fine, I now have:
hiveContext <- sparkR.session(sc)
hive_db = 'our_database'
db <- sql(hiveContext, paste0("use ", hive_db))
This runs, but now Spark warns 'sql(sqlContext...)' is deprecated. Use 'sql(sqlQuery)' instead. I'm at a loss for what kind of input it's expecting here and would like to resolve this. Has anyone figured out what to do here?
Since Spark 2.0 sql and the number of other functions (like createDataFrame) dont require SQLContext instance. Just:
sql(paste0("use ", hive_db))
Internally this will use getSparkSession to retrieve a session object.

When use sparkR to process data, where is the place that the programme really run?

I am new to spark and sparkR, and my question is below:
when I wrote the codes below:
1). set up environment and start a spark.session()
sparkR.session(master = "my/spark/master/on/one/server/standaloneMode", , sparkConfig = list(spark.driver.memory="4g",spark.sql.warehouse.dir = "my/hadoop_home/bin",sparkPackages = "com.databricks:spark-avro_2.11:3.0.1"))
Then I wrote:
rund <- data.frame(V1 = runif(10000000,100,10000),V2 =runif(10000000,100,10000))
df <- as.DataFrame(rund)
Here is the thing:
1). Where does the program do the 'splitting'? on my local machine or on server?
2). Also, could anyone tell me where did the program exactly run the code "as.DataFrame()"? on my computer or on my server where was set as standalone_mode of spark.
SparkR is an interface for Spark. This means that some R functions are overridden by the SparkR package to provide a similar user experience you already know from R. You probably should have a look at the documentation to see which Spark functions are available: https://spark.apache.org/docs/latest/api/R/index.html
Those functions typically ingest SparkDataFrames which you can create, for example with the as.DataFrame function. SparkDataFrames provide a reference to a SparkDataFrame in your Spark cluster.
In your example you've created a local R data frame rund. The runif functions also were executed locally in your R instance.
# executed in your local R instance
rund <- data.frame(V1 = runif(10000000,100,10000),V2 =runif(10000000,100,10000))
The df object however is a SparkDataFrame, which will be created in your Spark Cluster. as.DataFrame is executed in R, but the actual SparkDataFrame only will exists in your cluster.
df <- as.DataFrame(rund)
To easily distinguish between R and Spark data frames, you can use the class function:
> class(df)
[1] "data.frame"
> class(df.spark)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
In general a SparkDataFrame can be used as input for the various functions the SparkR package has to offer, for example to group or sort your SparkDataFrame in Spark. The Spark operations are executed when a Spark action is called. An example for such an action is collect. It triggers the transformations in Spark and retrieves the computed data from your Spark cluster and creates a corresponded R data frame in your local R instance. If you have a look at the documentation you can see if a function can ingest a SparkDataFrame:
##S4 method for signature 'SparkDataFrame'
collect(x, stringsAsFactors = FALSE)
Moreover it is possible to execute custom R code in your Spark cluster by using user-defined functions: https://spark.apache.org/docs/latest/sparkr.html#applying-user-defined-function.

Trying to find R equivalent for SetConf from Java

In Java, you can do something like:
sc.setConf('spark.sql.parquet.binaryAsString','true')
What would the equivalent be in R? I've looked at the methods available to the sc object, and can't find any obvious way of doing this
Thanks
You can set environment variables during SparkContext initialization. sparkR.init has a number of optional arguments including:
sparkEnvir - a list of environment variables to set on worker nodes.
sparkExecutorEnv - a list of environment variables to be used when launching executors
In your case something like this should do the trick:
sparkEnvir <- list('spark.sql.parquet.binaryAsString'='true')
sc <- sparkR.init(master, app_name, sparkEnvir=sparkEnvir)
I found the solution to the problem.
We can do the following:
sql(sqlContext,'SET spark.sql.parquet.binaryAsString=true')
This fixes everything.

How to read data from Cassandra with R?

I am using R 2.14.1 and Cassandra 1.2.11, I have a separate program which has written data to a single Cassandra table. I am failing to read them from R.
The Cassandra schema is defined like this:
create table chosen_samples (id bigint , temperature double, primary key(id))
I have first tried the RCassandra package (http://www.rforge.net/RCassandra/)
> # install.packages("RCassandra")
> library(RCassandra)
> rc <- RC.connect(host ="192.168.33.10", port = 9160L)
> RC.use(rc, "poc1_samples")
> cs <- RC.read.table(rc, c.family="chosen_samples")
The connection seems to succeed but the parsing of the table into data frame fails:
> cs
Error in data.frame(..dfd. = c("#\"ffffff", "#(<cc><cc><cc><cc><cc><cd>", :
duplicate row.names:
I have also tried using JDBC connector, as described here: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
> # install.packages("RJDBC")
> library(RJDBC)
> cassdrv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver", "/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar", "`")
But this one fails like this:
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Even though the location to the java driver is correct
$ ls /Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
/Users/svend/dev/libs/cassandra-jdbc-1.2.5.jar
You have to download apache-cassandra-2.0.10-bin.tar.gz and cassandra-jdbc-1.2.5.jar and cassandra-all-1.1.0.jar.
There is no need to install Cassandra on your local machine; just put the cassandra-jdbc-1.2.5.jar and the cassandra-all-1.1.0.jar files in the lib directory of unziped apache-cassandra-2.0.10-bin.tar.gz. Then you can use
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("D:/apache-cassandra-2.0.10/lib",
pattern="jar$",full.names=T))
That is working on my unix but not on my windows machine.
Hope that helps.
This question is old now, but since it's the one of the top hits for R and Cassandra I thought I'd leave a simple solution here, as I found frustratingly little up-to-date support for what I thought would be a fairly common task.
Sparklyr makes this pretty easy to do from scratch now, as it exposes a java context so the Spark-Cassandra-Connector can be used directly. I've wrapped up the bindings in this simple package, crassy, but it's not necessary to use.
I mostly made it to demystify the config around how to make sparklyr load the connector, and as the syntax for selecting a subset of columns is a little unwieldy (assuming no Scala knowledge).
Column selection and partition filtering are supported. These were the only features I thought were necessary for general Cassandra use cases, given CQL can't be submitted directly to the cluster.
I've not found a solution to submitting more general CQL queries which doesn't involve writing custom scala, however there's an example of how this can work here.
Right, I found an (admittedly ugly) way, simply by calling python from R, parsing the NA manually and re-assigning the data-frames names in R, like this
# install.packages("rPython")
# (don't forget to "pip install cql")
library(rPython)
python.exec("import sys")
# adding libraries from virtualenv
python.exec("sys.path.append('/Users/svend/dev/pyVe/playground/lib/python2.7/site-packages/')")
python.exec("import cql")
python.exec("connection=cql.connect('192.168.33.10', cql_version='3.0.0')")
python.exec("cursor = connection.cursor()")
python.exec("cursor.execute('use poc1_samples')")
python.exec("cursor.execute('select * from chosen_samples' )")
# coding python None into NA (rPython seem to just return nothing )
python.exec("rep = lambda x : '__NA__' if x is None else x")
python.exec( "def getData(): return [rep(num) for line in cursor for num in line ]" )
data <- python.call("getData")
df <- as.data.frame(matrix(unlist(data), ncol=15, byrow=T))
names(df) <- c("temperature", "maxTemp", "minTemp",
"dewpoint", "elevation", "gust", "latitude", "longitude",
"maxwindspeed", "precipitation", "seelevelpressure", "visibility", "windspeed")
# and decoding NA's
parsena <- function (x) if (x=="__NA__") NA else x
df <- as.data.frame(lapply(df, parsena))
Anybody has a better idea?
I had the same error message when executing Rscript with RJDBC connection via batch file (R 3.2.4, Teradata driver).
Also, when run in RStudio it worked fine in the second run but not first.
What helped was explicitly call:
library(rJava)
.jinit()
It not enough to just download the driver, you have to also download the dependencies and put them into your JAVA ClassPath (MacOS: /Library/Java/Extensions) as stated on the project main page.
Include the Cassandra JDBC dependencies in your classpath : download dependencies
As of the RCassandra package, right now it's still too primitive compared to RJDBC.

Resources