Defining a Torch Class in R package "torch" - r

this post is related to my earlier How to define a Python Class which uses R code, but called from rTorch? .
I came across the torch package in R (https://torch.mlverse.org/docs/index.html) which allows to Define a DataSet class definition. Yet, I also need to be able to define a model class like class MyModelClass(torch.nn.Module) in Python. Is this possible in the torch package in R?
When I tried to do it with reticulate it did not work - there were conflicts like
ImportError: /User/homes/mreichstein/miniconda3/envs/r-torch/lib/python3.6/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZTINSt6thread6_StateE
It also would not make much sense, since torch isn't wrapping Python.
But it is loosing at lot of flexibility, which rTorch has (but see my problem in the upper post).
Thanks for any help!
Markus

You can do that directly using R's torch package which seems quite comprehensive at least for the basic tasks.
Neural networks
Here is an example of how to create nn.Sequential like this:
library(torch)
model <- nn_sequential(
nn_linear(D_in, H),
nn_relu(),
nn_linear(H, D_out)
)
Below is a custom nn_module (a.k.a. torch.nn.Module) which is a simple dense (torch.nn.Linear) layer (source):
library(torch)
# creates example tensors. x requires_grad = TRUE tells that
# we are going to take derivatives over it.
dense <- nn_module(
clasname = "dense",
# the initialize function tuns whenever we instantiate the model
initialize = function(in_features, out_features) {
# just for you to see when this function is called
cat("Calling initialize!")
# we use nn_parameter to indicate that those tensors are special
# and should be treated as parameters by `nn_module`.
self$w <- nn_parameter(torch_randn(in_features, out_features))
self$b <- nn_parameter(torch_zeros(out_features))
},
# this function is called whenever we call our model on input.
forward = function(x) {
cat("Calling forward!")
torch_mm(x, self$w) + self$b
}
)
model <- dense(3, 1)
Another example, using torch.nn.Linear layers to create a neural network this time (source):
two_layer_net <- nn_module(
"two_layer_net",
initialize = function(D_in, H, D_out) {
self$linear1 <- nn_linear(D_in, H)
self$linear2 <- nn_linear(H, D_out)
},
forward = function(x) {
x %>%
self$linear1() %>%
nnf_relu() %>%
self$linear2()
}
)
Also there are other resources like here (using flow control and weight sharing).
Other
Looking at the reference it seems most of the layers are already provided (didn't notice transformer layers at a quick glance, but this is minor).
As far as I can tell basic blocks for neural networks, their training etc. are in-place (even JIT so sharing between languages should be possible).

Related

What's the most simple approach to name-spacing R files with `file::function`

Criteria for answer to this question
Given the following function (within its own script)
# something.R
hello <- function(x){
paste0("hello ", x)
}
What is the most minimal amount of setup which will enable the following
library(something)
x <- something::hello('Sue')
# x now has value: "hello Sue"
Context
In python it's very simple to have a directory containing some code, and utilise it as
# here foo is a directory
from foo import bar
bar( ... )
I'm not sure how to do something similar in R though.
I'm aware there's source(file.R), but this puts everything into the global namespace. I'm also aware that there's library(package) which provides package::function. What I'm not sure about is whether there's a simple approach to using this namespacing within R. The packaging tutorials that I've searched for seem to be quite involved (in comparison to Python).
I don't know if there is a real benefit in creating a namespace just for one quick function. It is just not the way it is supposed to be (I think).
But anyway here is a rather minimalistic solution:
First install once: install.packages("namespace")
The function you wanted to call in the namespace:
hello <- function(x){
paste0("hello ", x)
}
Creating your namespace, assigning the function and exporting
ns <- namespace::makeNamespace("newspace")
assign("hello",hello ,env = ns)
base::namespaceExport(ns, ls(ns))
Now you can call your function with your new namespace
newspace::hello("you")
Here's the quickest workflow I know to produce a package, using RStudio. The default package already contains a hello function, that I overwrote with your code.
Notice there was also a box "create package based on source files", which I didn't use but you might.
A package done this way will contain exported undocumented untested functions.
If you want to learn how to document, export or not, write tests and run checks, include other objects than functions, include compiled code, share on github, share on CRAN.. This book describes the workflow used by thousands of users, and is designed so you can usually read sections independently.
If you don't want to do it from GUI you can useutils::package.skeleton() to build a package folder, and remotes::install_local() to install it :
Reproducible setup
# create a file containing function definition
# where your current function is located
function_path <- tempfile(fileext = ".R")
cat('
hello <- function(x){
paste0("hello ", x)
}
', file = function_path)
# where you store your package code
package_path <- tempdir()
Solution :
# create package directory at given location
package.skeleton("something", code_file = file_path, path = package_path)
# remove sample doc to make remotes::install_local happy
unlink(file.path(package_path, "something", "man/"), TRUE)
# install package
remotes::install_local(file.path(package_path, "something"))

R Package Development - Global Variables vs Parameter [duplicate]

I'm developing a package in R. I have a bunch of functions, some of them need some global variables. How do I manage global variables in packages?
I've read something about environment, but I do not understand how it will work, of if this even is the way to go about the things.
You can use package local variables through an environment. These variables will be available to multiple functions in the package, but not (easily) accessible to the user and will not interfere with the users workspace. A quick and simple example is:
pkg.env <- new.env()
pkg.env$cur.val <- 0
pkg.env$times.changed <- 0
inc <- function(by=1) {
pkg.env$times.changed <- pkg.env$times.changed + 1
pkg.env$cur.val <- pkg.env$cur.val + by
pkg.env$cur.val
}
dec <- function(by=1) {
pkg.env$times.changed <- pkg.env$times.changed + 1
pkg.env$cur.val <- pkg.env$cur.val - by
pkg.env$cur.val
}
cur <- function(){
cat('the current value is', pkg.env$cur.val, 'and it has been changed',
pkg.env$times.changed, 'times\n')
}
inc()
inc()
inc(5)
dec()
dec(2)
inc()
cur()
You could set an option, eg
options("mypkg-myval"=3)
1+getOption("mypkg-myval")
[1] 4
In general global variables are evil. The underlying principle why they are evil is that you want to minimize the interconnections in your package. These interconnections often cause functions to have side-effects, i.e. it depends not only on the input arguments what the outcome is, but also on the value of some global variable. Especially when the number of functions grows, this can be hard to get right and hell to debug.
For global variables in R see this SO post.
Edit in response to your comment:
An alternative could be to just pass around the needed information to the functions that need it. You could create a new object which contains this info:
token_information = list(token1 = "087091287129387",
token2 = "UA2329723")
and require all functions that need this information to have it as an argument:
do_stuff = function(arg1, arg2, token)
do_stuff(arg1, arg2, token = token_information)
In this way it is clear from the code that token information is needed in the function, and you can debug the function on its own. Furthermore, the function has no side effects, as its behavior is fully determined by its input arguments. A typical user script would look something like:
token_info = create_token(token1, token2)
do_stuff(arg1, arg2, token_info)
I hope this makes things more clear.
The question is unclear:
Just one R process or several?
Just on one host, or across several machine?
Is there common file access among them or not?
In increasing order of complexity, I'd use a file, a SQLite backend via the RSQlite package or (my favourite :) the rredis package to set to / read from a Redis instance.
You could also create a list of tokens and add it to R/sysdata.rda with usethis::use_data(..., internal = TRUE). The data in this file is internal, but accessible by all functions. The only problem would arise if you only want some functions to access the tokens, which would be better served by:
the environment solution already proposed above; or
creating a hidden helper function that holds the tokens and returns them. Then just call this hidden function inside the functions that use the tokens, and (assuming it is a list) you can inject them to their environment with list2env(..., envir = environment()).
If you don't mind adding a dependency to your package, you can use an R6 object from the homonym package, as suggested in the comments to #greg-snow's answer.
R6 objects are actual environments with the possibility of adding public and private methods, are very lightweight and could be a good and more rigorous option to share package's global variables, without polluting the global environment.
Compared to #greg-snow's solution, it allows for a stricter control of your variables (you can add methods that check for types for example). The drawback can be the dependency and, of course, learning the R6 syntax.
library(R6)
MyPkgOptions = R6::R6Class(
"mypkg_options",
public = list(
get_option = function(x) private$.options[[x]]
),
active = list(
var1 = function(x){
if(missing(x)) private$.options[['var1']]
else stop("This is an environment parameter that cannot be changed")
}
,var2 = function(x){
if(missing(x)) private$.options[['var2']]
else stop("This is an environment parameter that cannot be changed")
}
),
private = list(
.options = list(
var1 = 1,
var2 = 2
)
)
)
# Create an instance
mypkg_options = MyPkgOptions$new()
# Fetch values from active fields
mypkg_options$var1
#> [1] 1
mypkg_options$var2
#> [1] 2
# Alternative way
mypkg_options$get_option("var1")
#> [1] 1
mypkg_options$get_option("var3")
#> NULL
# Variables are locked unless you add a method to change them
mypkg_options$var1 = 3
#> Error in (function (x) : This is an environment parameter that cannot be changed
Created on 2020-05-27 by the reprex package (v0.3.0)

Parallel computing when using ‘CORElearn' in R

I’m using the ReliefF for feature selection (using the package called "CORElearn"). It worked very well before. But later on, I want to speed up my code. Since I have bootstrap in my code (each loop is doing exactly the same thing, including using ReliefF), so I’m using the package ‘parallel’ for parallel computing. But I realized that every time when it comes to the part of ReliefF, the code will just stuck there.
The related codes are as followings:
num.round <- 10 # number of rounds for bootstrap
rounds.btsp <- seq(1, num.round) # sequence of numbers for bootstrap, used for parallel computing
boot.strap <- function(round.btsp) {
## some codes using other feature selection methods
print('Finish feature selection using other methods') # I can get this output
# use ReliefF to rank the features
data.ref <- data.frame(t(x.train.resample), y.train.resample, check.names = F) # add the param to avoid changing '-' to '.'
print('Start using attrEval') # I’ll get this output, but then I'll get stuck here
estReliefF <- attrEval('y.train.resample', data.ref, estimator = 'ReliefFexpRank', ReliefIterations = 30)
names(estReliefF) <- fea.name # This command needs to be added because it's very annoying that 'attrEval' will change the '-' in the names to '.'
print('Start using estReliefF') # I’ll never get here
fea.rank.ref <- estReliefF[order(abs(estReliefF), decreasing = T)]
fea.rank.ref <- data.frame(importance = fea.rank.ref)
fea.rank.name.ref <- rownames(fea.rank.ref) # the ranked feature list for this round
return(fea.rank.name.ref)
}
results.btsp <- mclapply(rounds.btsp, boot.strap, mc.cores = num.round)
What I’m thinking now is that the function ‘attrEval’ will use multiple cores for parallel computing (I read that in the document: https://cran.r-project.org/web/packages/CORElearn/CORElearn.pdf). Then there will be somehow conflict with the parallel that I’m using. When I change ’num.round’ to 1, then there’s no problem running the code (but even I set it to 2, it won’t work).
The server that I'm using has 80 cores.
Is there a way to solve this? I'm thinking that shutting down the parallel computing for the function 'attrEval' maybe a solution? Even though I don't how to do that~~~
Having multiple levels of parallelism can be tricky. Unfortunately the CORElearn package does not allow to directly manipulate the number of threads used. Since it uses OpenMP for parallel execution, you can try to set the environment variable OMP_NUM_THREADS appropriately, e.g.
Sys.setenv(OMP_NUM_THREADS = 8)
num.round <- 10
This way there should be 10 groups of 8 cores, each group handling one bootstrap round.
Got a solution from the contributor of the package Marko: disable multithreading in CORElearn by using parameter maxThreads=1 in 'attrEval'

Creating R package containing a dataset and an R function which uses the data

I am creating an R package containing a dataset and an R function which uses the data.
The R function looks like this:
myFun <- function(iobs){
data(MyData)
return(MyData[iobs,])
}
When I do the usual "R CMD check myPack" business, it gives me error saying
* checking R code for possible problems ... NOTE
myFun: no visible binding for global variable ‘MyData’
Is there way to fix this problem?
You can use lazy-loading for this.
Just put
LazyData: yes
in your DESCRIPTION file and remove
data(MyData)
from your function.
Due to lazy-loading your MyData-Object will be available in your namespace, so no need for to call data().
Two alternatives to the lazy data approach. Both rely on using the list argument to data
data(list = 'MyData')
Define as an default argument of the function (may not ideal as then can be changed)
myFun <- function(iobs, myData = data(list='MyData')){
return(myData[iobs,])
}
Load into an empty environment then extract using [[.
myFun2 <- function(iobs){
e <- new.env(parent = emptyenv())
data(list='MyData', envir = e)
e[['MyData']][iobs,]
}
Note that
e$MyData[iobs,] should also work.
I would also suggest using drop = TRUEas safe practice to retain the same class as MyData
eg
MyData[iobs,,drop=TRUE]. This may not be an issue given the specifics of this function and the structure of MyData, but is good programming practice, especially within packages when you want robust code.

Global variables in packages in R

I'm developing a package in R. I have a bunch of functions, some of them need some global variables. How do I manage global variables in packages?
I've read something about environment, but I do not understand how it will work, of if this even is the way to go about the things.
You can use package local variables through an environment. These variables will be available to multiple functions in the package, but not (easily) accessible to the user and will not interfere with the users workspace. A quick and simple example is:
pkg.env <- new.env()
pkg.env$cur.val <- 0
pkg.env$times.changed <- 0
inc <- function(by=1) {
pkg.env$times.changed <- pkg.env$times.changed + 1
pkg.env$cur.val <- pkg.env$cur.val + by
pkg.env$cur.val
}
dec <- function(by=1) {
pkg.env$times.changed <- pkg.env$times.changed + 1
pkg.env$cur.val <- pkg.env$cur.val - by
pkg.env$cur.val
}
cur <- function(){
cat('the current value is', pkg.env$cur.val, 'and it has been changed',
pkg.env$times.changed, 'times\n')
}
inc()
inc()
inc(5)
dec()
dec(2)
inc()
cur()
You could set an option, eg
options("mypkg-myval"=3)
1+getOption("mypkg-myval")
[1] 4
In general global variables are evil. The underlying principle why they are evil is that you want to minimize the interconnections in your package. These interconnections often cause functions to have side-effects, i.e. it depends not only on the input arguments what the outcome is, but also on the value of some global variable. Especially when the number of functions grows, this can be hard to get right and hell to debug.
For global variables in R see this SO post.
Edit in response to your comment:
An alternative could be to just pass around the needed information to the functions that need it. You could create a new object which contains this info:
token_information = list(token1 = "087091287129387",
token2 = "UA2329723")
and require all functions that need this information to have it as an argument:
do_stuff = function(arg1, arg2, token)
do_stuff(arg1, arg2, token = token_information)
In this way it is clear from the code that token information is needed in the function, and you can debug the function on its own. Furthermore, the function has no side effects, as its behavior is fully determined by its input arguments. A typical user script would look something like:
token_info = create_token(token1, token2)
do_stuff(arg1, arg2, token_info)
I hope this makes things more clear.
The question is unclear:
Just one R process or several?
Just on one host, or across several machine?
Is there common file access among them or not?
In increasing order of complexity, I'd use a file, a SQLite backend via the RSQlite package or (my favourite :) the rredis package to set to / read from a Redis instance.
You could also create a list of tokens and add it to R/sysdata.rda with usethis::use_data(..., internal = TRUE). The data in this file is internal, but accessible by all functions. The only problem would arise if you only want some functions to access the tokens, which would be better served by:
the environment solution already proposed above; or
creating a hidden helper function that holds the tokens and returns them. Then just call this hidden function inside the functions that use the tokens, and (assuming it is a list) you can inject them to their environment with list2env(..., envir = environment()).
If you don't mind adding a dependency to your package, you can use an R6 object from the homonym package, as suggested in the comments to #greg-snow's answer.
R6 objects are actual environments with the possibility of adding public and private methods, are very lightweight and could be a good and more rigorous option to share package's global variables, without polluting the global environment.
Compared to #greg-snow's solution, it allows for a stricter control of your variables (you can add methods that check for types for example). The drawback can be the dependency and, of course, learning the R6 syntax.
library(R6)
MyPkgOptions = R6::R6Class(
"mypkg_options",
public = list(
get_option = function(x) private$.options[[x]]
),
active = list(
var1 = function(x){
if(missing(x)) private$.options[['var1']]
else stop("This is an environment parameter that cannot be changed")
}
,var2 = function(x){
if(missing(x)) private$.options[['var2']]
else stop("This is an environment parameter that cannot be changed")
}
),
private = list(
.options = list(
var1 = 1,
var2 = 2
)
)
)
# Create an instance
mypkg_options = MyPkgOptions$new()
# Fetch values from active fields
mypkg_options$var1
#> [1] 1
mypkg_options$var2
#> [1] 2
# Alternative way
mypkg_options$get_option("var1")
#> [1] 1
mypkg_options$get_option("var3")
#> NULL
# Variables are locked unless you add a method to change them
mypkg_options$var1 = 3
#> Error in (function (x) : This is an environment parameter that cannot be changed
Created on 2020-05-27 by the reprex package (v0.3.0)

Resources