How to store SparkR result into an R object? - r

Still new to the world of Azure Databricks, the use of SparkR remains very obscure to me, even for very simple tasks...
It took me a very long time to find how to count distinct values, and I'm not sure it's the right way to go :
library(SparkR)
sparkR.session()
DW <- sql("select * from db.mytable")
nb.var <- head(summarize(DW, n_distinct(DW$VAR)))
I thought I found, but nb.per is not an object, but still a dataframe...
class(nb.per)
[1] "data.frame"
I tried :
nb.per <- as.numeric(head(summarize(DW, n_distinct(DW$PERIODE))))
It seems ok, but I'm pretty sure there is a better way to achieve this ?
Thanks !

Since you are anyway using Spark SQL, a very simple approach would be to do like this:
nb.per <- `[[`(SparkR::collect(SparkR::sql("select count(distinct VAR) from db.mytable")), 1).
And using SparkR APIs like:
DW <- SparkR::tableToDF("db.mytable")
nb.per <- `[[`(SparkR::collect(SparkR::agg(DW, SparkR::countDistinct(SparkR::column("VAR")))), 1)

The SparkR::sql function returns a SparkDataFrame.
In order to use this in R as an R data.frame, you can simply coerce it:
as.data.frame(sql("select * from db.mytable"))

Related

Error in switch(class(arraydata), matrix = { : EXPR must be a vector of length 1

I would like to know if someone got a similar error using the ord function?
My code hasn't worked since using the new version of R (4.0.2) and I wonder if it's associated with made4 updates?
I would really appreciate it if someone would give some hint
This is the problematic piece of code
res_coa <- ord(tdata, classvec=Species, type="coa")
where tdata is a matrix and Species a factor
I had the same problem and I fixed it just converting my data into a data.frame. You can try something like:
res_coa <- ord(data.frame(tdata), classvec=Species, type="coa")

sparkR sql() returns string

We have parquet data saved on a server and I am trying to use SparkR sql() function in the following ways
df <- sql("SELECT * FROM parquet.`<path to parquet file`")
head(df)
show(df) # returns "<SQL> SELECT * FROM parquet.`<path to parquet file`"
and
createOrReplaceTempView(df, "table")
df2 <- sql("SELECT * FROM table")
show(df2) # returns "<SQL> SELECT * FROM table"
In both cases what I get is the sql query in a string format instead of the spark dataframe. Does anyone have any idea why this happens and why I don't get the dataframe?
Thank you
Don't use the show statement use showDF()...or, View(head(df2, num=20L))
This was a very silly problem. The solution is to just use the full name of the method
SparkR::sql(...)
instead of the short name. Apparently the function sql is masked.
show method documentation in sparkR states that if eager evaluation is not enabled it returns the class and type information of the Spark object. so you should use showDF instead.
besides that, apparently, the sql method is masked in R and you should call it with explicit package deceleration. like this:
df <- SparkR::sql("select 1")
showDF(df)

select command not working in R even after installing the library dplyr

Error message : could not find function "select"
After installing the package dplyr which contains the select function for R,
this error isn't expected but still i am getting this error.
I want to select a particular column of the dataset but the dollar sign operator is also not working.
I think I've had this problem as well and I'm not sure what causes it. However, I can usually solve the problem by specifying the package before the command as in the code below.
dplyr::select()
Hope this helps.
#THATguy nailed it! That will solve your problems. The cause of this error is often due to multiple libraries with the same function. In this case specifically, the function "select" exists in the package 'dplyr' and 'MASS'. If you type in select in your code it's likely going to pull the MASS library, and if your intention is select only certain columns out of a data frame then, you want to the select from 'dplyr'. For example:
df <- read.csv("df.csv") %>% #bring in the data frame
dplyr::select(-x, -y, -z) # remove the x, y, and z columns from the data frame
Or if you want to keep certain columns then drop the '-' in front of the variable.
There are various ways you can try to solve this problem.
Restart the R session with ctrl + shift + F10
You can use dplyr::select() if that's the select function you want

Error: could not find function "includePackage"

I am trying to execute Random Forest algorithm on SparkR, with Spark 1.5.1 installed. I am not getting clear idea, why i am getting the error -
Error: could not find function "includePackage"
Further even if I use mapPartitions function in my code , i get the error saying -
Error: could not find function "mapPartitions"
Please find the below code:
rdd <- SparkR:::textFile(sc, "http://localhost:50070/explorer.html#/Datasets/Datasets/iris.csv",5)
includePackage(sc,randomForest)
rf <- mapPartitions(rdd, function(input) {
## my function code for RF
}
This is more of a comment and a cross question rather than an answer (not allowed to comment because of the reputation) but just to take this further, if we are using the collect method to convert the rdd back to an R dataframe, isnt that counter productive as if the data is too large, it would take too long to execute in R.
Also does it mean that we could possibly use any R package say, markovChain or a neuralnet using the same methodology.
Kindly check the functions that can be in used in sparkR http://spark.apache.org/docs/latest/api/R/index.html
This doesn't include function mapPartitions() or includePackage()
#For reading csv in sparkR
sparkRdf <- read.df(sqlContext, "./nycflights13.csv",
"com.databricks.spark.csv", header="true")
#Possible way to use `randomForest` is to convert the `sparkR` data frame to `R` data frame
Rdf <- collect(sparkRdf)
#compute as usual in `R` code
>install.packages("randomForest")
>library(rainForest)
......
#convert back to sparkRdf
sparkRdf <- createDataFrame(sqlContext, Rdf)

Use neo4j with R

Is there a R library that supports neo4j? I would like to construct a R graph (e.g. igraph) from neo4j or - vice versa - store a R graph in neo4j.
More precisely, I am looking for something similar to bulbflow for Python.
Update
There is a new neo4j driver for R that looks promising: http://nicolewhite.github.io/RNeo4j/. I changed the correct answer.
This link might be helpful. I'm going to connect ne04j with R in the following days and will try first with the provided link. Hope it helps.
I tried it out and it works well. Here is the function that works:
First, install and load packages and then execute function:
install.packages('RCurl')
install.packages('RJSONIO')
library('bitops')
library('RCurl')
library('RJSONIO')
query <- function(querystring) {
h = basicTextGatherer()
curlPerform(url="localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query",
postfields=paste('query',curlEscape(querystring), sep='='),
writefunction = h$update,
verbose = FALSE
)
result <- fromJSON(h$value())
#print(result)
data <- data.frame(t(sapply(result$data, unlist)))
print(data)
names(data) <- result$columns
}
and this is an example of calling function:
q <-"start a = node(50) match a-->b RETURN b"
data <- query(q)
Consider the RNeo4j driver. The function shown above is incomplete: it cannot return single column data and there is no NULL handling.
https://github.com/nicolewhite/RNeo4j
I tried to use the R script (thanks a lot for providing it) and it seems to me that you can directly use :
/db/data/cypher
instead of
db/data/ext/CypherPlugin/graphdb/execute_query
(with neo4j 2.0).
Not sure if it fits your requirements but have a look at Gephi.
http://gephi.org/.

Resources