R callback functions using sparklyr - r

I hope to use mapPartitions and reduce function of Spark (http://spark.apache.org/docs/latest/programming-guide.html), using sparklyr.
It is easy in pyspark, the only thing I need to use is a plain python code. I can simply add python functions as callback function. So easy.
For example, in pyspark, I can use those two functions as follows:
mapdata = self.rdd.mapPartitions(mycbfunc1(myparam1))
res = mapdata.reduce(mycbfunc2(myparam2))
However, it seems this is not possible in R, for example sparklyr library. I checked RSpark, but it seems it is another way of query/wrangling data in R, nothing else.
I would appreciate if someone let me know how to use those two functions in R, with R callback functions.

In SparkR you could use internal functions - hence the prefix SparkR::: - to accomplish the same.
newRdd = SparkR:::toRDD(self)
mapdata = SparkR:::mapPartitions(newRdd, function(x) { mycbfunc1(x, myparam1)})
res = SparkR:::reduce(mapdata, function(x) { mycbfunc2(x, myparam2)})
I believe sparklyr interfaces only with the DataFrame / DataSet API.

Related

Choose function to load from an R package

I like using function reshape from the matlab package, but I need then to specify base::sum(m) each time I want to sum the elements of my matrix or else matlab::sum is called, which only sums by columns..
I need loading package gtools to use the rdirichlet function, but then the function gtools::logit masks the function pracma::logit that I like better..
I gess there are no such things like:
library(loadOnly = "rdirichlet", from = "gtools")
or
library(loadEverythingFrom = "matlab", except = "sum")
.. because functions from the package matlab may internaly work on the matlab::sum function. So the latter must be loaded. But is there no way to get this behavior from the point of view of the user? Something that would feel like:
library(pracma)
library(matlab)
library(gtools)
sum <- base::sum
logit <- pracma::logit
.. but that would not spoil your ls() with all these small utilitary functions?
Maybe I need defining my own default namespace?
To avoid spoiling your ls, you can do something like this:
.ns <- new.env()
.ns$sum <- base::sum
.ns$logit <- pracma::logit
attach(.ns)
To my knowledge there is no easy answer to what you want to achieve. The only dirty hack I can think of is to download the source of the packages "matlab", "gtools", "pracma" and delete the offending functions from their NAMESPACE file prior to installation from source (with R CMD INSTALL package).
However, I would recommend using the explicit notation pracma::logit, because it improves readability of your code for other people and yourself in the future.
This site gives a good overview about package namespaces:
http://r-pkgs.had.co.nz/namespace.html

R Functions require package declaration when they are included from another file?

I am writing some data manipulation scripts in R, and I finally decided to create an external .r file and call my functions from there. But it started giving me some problems when I try calling some functions. Simple example:
This one works with no problem:
change_column_names <- function(file,new_columns,seperation){
new_data <- read.table(file, header=TRUE, sep=seperation)
colnames(new_data) <- new_columns
write.table(new_data, file=file, sep=seperation, quote=FALSE, row.names = FALSE)
}
change_column_names("myfile.txt",c("Column1", "Column2", "Cost"),"|")
When I crate a file "data_manipulation.r", and put the above change_column_names function in there, and do this
sys.source("data_manipulation.r")
change_column_names("myfile.txt",c("Column1", "Column2", "Cost"),"|")
it does not work. It gives me could not find function "read.table" error. I fixed it by changing the function calls to util:::read.table and util:::write.table .
But this kinda getting frustrating. Now I have the same issue with the aggregate function, and I do not even know what package it belongs to.
My questions: Which package aggregate belongs to? How can I easily know what packages functions come from? Any cleaner way to handle this issue?
The sys.source() by default evaluates inside the base environment (which is empty) rather than the global environment (where you usually evaluate code). You probably should just be using source() instead.
You can also see where functions come from by looking at their environment.
environment(aggregate)
# <environment: namespace:stats>
For the first part of your question: If you want to find what package a function belongs to, and that function is working properly you can do one of two (probably more) things:
1.) Access the help files
?aggregate and you will see the package it belongs to in the top of the help file.
Another way, is to simply type aggregate without any arguments into the R console:
> aggregate
function (x, ...)
UseMethod("aggregate")
<bytecode: 0x7fa7a2328b40>
<environment: namespace:stats>
The namespace is the package it belongs to.
2.) Both of those functions that you are having trouble with are base R functions and should always be loaded. I was not able to recreate the issue. Try using source instead of sys.source and let me know if it alleviates your error.

How to parrallellize a for loop in R

I have this loop and I'm wondering what are the different ways to parralelize it :
for (i in 1:nrow(dataset)){
dataset$dayDiff[i] = dataset$close[i] - dataset$open[i]
}
I was thinking of using lapply but I don't see how to use a list in this context. Maybe I would use foreach in the parallel package but I don't know how to use it.
There is no good reason to use a loop here. Simply do dataset$dayDiff <- dataset$close - dataset$open. R is vectorized.

Creating R package containing a dataset and an R function which uses the data

I am creating an R package containing a dataset and an R function which uses the data.
The R function looks like this:
myFun <- function(iobs){
data(MyData)
return(MyData[iobs,])
}
When I do the usual "R CMD check myPack" business, it gives me error saying
* checking R code for possible problems ... NOTE
myFun: no visible binding for global variable ‘MyData’
Is there way to fix this problem?
You can use lazy-loading for this.
Just put
LazyData: yes
in your DESCRIPTION file and remove
data(MyData)
from your function.
Due to lazy-loading your MyData-Object will be available in your namespace, so no need for to call data().
Two alternatives to the lazy data approach. Both rely on using the list argument to data
data(list = 'MyData')
Define as an default argument of the function (may not ideal as then can be changed)
myFun <- function(iobs, myData = data(list='MyData')){
return(myData[iobs,])
}
Load into an empty environment then extract using [[.
myFun2 <- function(iobs){
e <- new.env(parent = emptyenv())
data(list='MyData', envir = e)
e[['MyData']][iobs,]
}
Note that
e$MyData[iobs,] should also work.
I would also suggest using drop = TRUEas safe practice to retain the same class as MyData
eg
MyData[iobs,,drop=TRUE]. This may not be an issue given the specifics of this function and the structure of MyData, but is good programming practice, especially within packages when you want robust code.

Use neo4j with R

Is there a R library that supports neo4j? I would like to construct a R graph (e.g. igraph) from neo4j or - vice versa - store a R graph in neo4j.
More precisely, I am looking for something similar to bulbflow for Python.
Update
There is a new neo4j driver for R that looks promising: http://nicolewhite.github.io/RNeo4j/. I changed the correct answer.
This link might be helpful. I'm going to connect ne04j with R in the following days and will try first with the provided link. Hope it helps.
I tried it out and it works well. Here is the function that works:
First, install and load packages and then execute function:
install.packages('RCurl')
install.packages('RJSONIO')
library('bitops')
library('RCurl')
library('RJSONIO')
query <- function(querystring) {
h = basicTextGatherer()
curlPerform(url="localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query",
postfields=paste('query',curlEscape(querystring), sep='='),
writefunction = h$update,
verbose = FALSE
)
result <- fromJSON(h$value())
#print(result)
data <- data.frame(t(sapply(result$data, unlist)))
print(data)
names(data) <- result$columns
}
and this is an example of calling function:
q <-"start a = node(50) match a-->b RETURN b"
data <- query(q)
Consider the RNeo4j driver. The function shown above is incomplete: it cannot return single column data and there is no NULL handling.
https://github.com/nicolewhite/RNeo4j
I tried to use the R script (thanks a lot for providing it) and it seems to me that you can directly use :
/db/data/cypher
instead of
db/data/ext/CypherPlugin/graphdb/execute_query
(with neo4j 2.0).
Not sure if it fits your requirements but have a look at Gephi.
http://gephi.org/.

Resources