I am trying to load some sources to be available in cluster nodes (I'm trying out to do something with ClusterEvalQ from the "parallel" package). The problem I have is that for some reason some of the functions that are normally loaded when I simply use source() from within a script are not loaded when loaded with evalq(). I am trying to source my files that have multiple function definitions in the compute nodes with clusterEvalQ() - apparently there is this "tail" argument in the end of the source function that prevents from loading the last function. How do I go about fixing that?
I have seen there is another question addressing the same issue.
But my problem is different. It loads everything except the last thing defined in the source file.
Thank you for improving the formating guys - I rarelly ask questions on stack overflow.
My current workaround is to put a dummy empty function on the most imporant sources.
The clusterEvalQ function is originally part of the snow package that parallel uses to do some of its parallelization. In that package there are two functions typically used "pass stuff to nodes". These are:
1) clusterEvalQ
This function is used to evaluate expressions on each node. Typically used to load packages through library or require. The snow documentation says:
clusterEvalQ evaluates a literal expression on each cluster node. It a cluster version of evalq, and
is a convenience function defined in terms of clusterCall.
I'm not sure how this would work for evaluating a source call on each node, because honestly I've never tried. When I source functions I usually go with...
2) clusterExport
This function passes objects from the current workspace on to the nodes. This can be used in conjunction with source because these functions are part of the workspace just as any other object (you can do that before setting up the cluster, and then pass the sourced functions to the nodes):
clusterExport assigns the values on the master of the variables named in list to variables of the
same names in the global environments of each node.
The list argument is actually a character vector of objects names you want to pass on to the nodes. I usually take the lazy route (because I keep my workspace clean in the first place) and do:
clusterExport(localCl, list=ls())
Hope this helps!
Related
Sometimes I use attach with some subset terms to work with odd dimensions of study data. To prevent "masking" variables in the environment (really the warning message itself) I simply call detach() to just remove whatever dataset I was working with from the R search path. When I get muddled in scripting, I may end up calling detach a few times. Well, interestingly if I call it enough, R removes functions that are loaded at start-up as part of packages like from utils, stats, and graphics. Why does "detach" remove these functions?
R removes base functions from the search path, like plot and ? and so on.
These functions that were removed are often called “base” functions but they are not part of the actual ‹base› package. Rather, plot is from the package ‹graphics›, and ? is from the package ‹utils›, both of which are part of the R default packages, and are therefore attached by default. Both packages are attached after package:base, and you’re accidentally detaching these packages with your too many detach calls (package:base itself cannot be detached; this is important because if it were detached, you couldn’t reattach it: the functions necessary for that are inside package:base).
To expand on this, attach and detach are usually used in conjunction with package environments rather than data sets: to enable the uses functions from a package without explicitly typing the package name (e.g. graphics::plot), the library function attaches these packages. When loading R, some packages are attached by default. You can find more information about this in Hadley Wickham’s Advanced R.
As you noticed, you can also attach and detach data sets. However, this is generally discouraged (quite strongly, in fact). Instead, you can use data transformation functions from the base package (e.g. with and transform, as noted by Moody_Mudskipper in a comment) or from data manipulation package (‹dplyr› is state of the art; an alternative is ‹data.table›).
Q. How does one write an R package function that references an R package dataset in a way that is simple/friendly/efficient for an R user. For example, how does one handle afunction() that calls adataset from a package?
What I think may not be simple/friendly/efficient:
User is required to run
data(adataset) before running
afunction(...)
or else receiving an Error: ... object 'adataset' not found. I have noticed some packages have built-in datasets that can be called anytime the package is loaded, for example, iris, which one can call without bringing it to the Global Environment.
Possible options which I have entertained:
Write data(NamedDataSet) directly into the function. Is this a bad idea. I thought perhaps it could be, looking at memory and given my limiting understanding of function environments.
Code the structure of the dataset directly into the function. I think this works depending on the size of the data but it makes me wonder about how to go about proper documentation in the package.
Change Nothing. Given a large enough dataset, maybe it does not make sense to implement a way different from reading it before calling the function.
Any comments are appreciated.
You might find these resources about writing R data packages useful:
the "External Data" section of R Packages
Creating an R data package
Creating an R data package (overview)
In particular, take note of the DESCRIPTION file and usage of the line LazyData: true. This is how datasets are made available without having to use data(), as in the iris example that you mention.
background
I am the maintainer of two packages that I would like to add to CRAN. They were rejected because some functions assign variables to .GlobalEnv. Now I am trying to find a different but as convenient way to handle these variables.
the packages
Both packages belong to the Database of Odorant Responses, DoOR. We collect published odor-response data in the DoOR.data package, the algorithms in the DoOR.functions package merge these data into a single consensus dataset.
intended functionality
The data package contains a precomputed consensus dataset. The user is able to modify the underlying original datasets (e.g. add his own datasets, remove some...) and compute a new consensus dataset. Thus the functions must be able to access the modified datasets.
The easiest way was to load the complete package data into the .GlobalEnv (via a function) and then modify data there. This was also straight forward for the user, as he saw the relevant datasets in his "main" environment. The problem is that writing into the user environment is bad practice and CRAN wouldn't accept the package this way (understandable).
things I tried
assigning only modified datasets to .GlobalEnv, non explicitly via parent.frame() - Hadley pointed out that this is still bad, in the end we are writing into the users environment.
writing only modified datasets into a dedicated new environment door.env <- new.env().
door.env is not in the search path, thus data in it is ignored by the functions
putting it into the search path with attach(door.env), as I learned, creates a new environment in the search path, thus any further edits in new.env() will again be ignored by the functions
it is complicated to see and edit the data in new.env for the user, I'd rather have a solution where a user would not have to learn environment handling
So bottom line, with all solutions I tried I ended up with multiple copies of datasets in different environments and I am afraid that this confuses the average user of our packe (including me :))
Hopefully someone has an idea of where to store data, easily accessible to user and functions.
EDIT: the data is stored as *.RData files under /data in the DoOR.data package. I tried using LazyData = true to avoid loading all the data into the .GlobalEnv. This works good, the probllems with the manipulated/updated data remain.
[Revised based on suggestion of exporting names.]
I have been working on an R package that is nearing about 100 functions, maybe more.
I want to have, say, 10 visible functions and each may have 10 "invisible" sub-functions.
Is there an easy way to select which functions are visible, and which are not?
Also, in the interest of avoiding 'diff', is there a command like "all.equal" that can be applied to two different packages to see where they differ?
You can make a file called NAMESPACE in the base directory of your package. In this you can define which functions you want to export to the user, and you can also import functions from other packages. Exporting will make a function usable, and import will transfer a function from another package to you without making it available to the user (useful if you just need one function and don't want to require your users to load another package when they load yours).
A trunctuated part of my packages NAMESPACE :
useDynLib(qgraph)
export(qgraph)
(...)
importFrom(psych,"principal")
(...)
import(plyr)
which respectively loads the compiled functions, makes the function qgraph() available, imports from psych the principal function and imports from plyr all functions that are exported in plyr's NAMESPACE.
For more details read:
http://cran.r-project.org/doc/manuals/R-exts.pdf
I think you should organise your package and code the way you feel most comfortable with; it is your package after all. NAMESPACE can be used to control what gets exposed or not to the user up-front, as other's have mentioned, and you don't need to document all the functions, just the main user-called functions, by adding \alias{} tags to the Rd files for all the support functions you don't want people to know too much about, or hide them on an package.internals.Rd man page.
That being said, if you want people to help develop your package, or run with it and do amazing things, the better organised it is the easier that job will be. So lay out your functions logically, perhaps one file per function, named after the function name, or group all the related functions into a single R file for example. But be consistent in which approach you do.
If you have generic functions that have more general use, consider splitting those functions out into a separate package that others can use, without having to depend on your mega package with the extra cruft that is more specific. Your package can then depend on this generic package, as can packages of other authors. But don't split packages up just for the sake of making them smaller.
The answer is almost certainly to create a package. Some rules of thumb may help in your design choice:
A package should solve one problem
If you have functions that solve a different problem, put them in a separate package
For example, have a look at the ggplot2 package:
ggplot2 is a package that creates wonderful graphics
It imports plyr, a package that gives a consistent syntax and approach to solve the Split, Apply, Combine problem
It depends on reshape2, a package with only few functions that turns wide data into long, and vice-versa.
The point is that all of these packages were written by a single author, i.e. Hadley Wickham.
If you do decide to make a package, you can control the visibility of your functions:
Only functions that are exported are directly visible in the namespace
You can additionally mark some functions with the keyword internal, which will prevent them appearing in automatically generated lists of functions.
If you decide to develop your own package, I strongly recommend the devtools package, and reading the devtools wiki
If your reformulated question is about 'how to organise large packages', then this may apply:
NAMESPACE allows for very fine-grained exporting of functions: your user would see 10 visisble functions
even the invisible function are accessible if you or the users 'known', that is done via the ::: triple colon operator
packages do come in all sizes and shapes; one common rule about 'when to split' may be that as soon as you have functionality of use in different contexts
As for diff on packages: Huh? Packages are not usually all that close so that one would need a comparison function. The diff command is indeed quite useful on source code. You could use a hash function on binary code if you really wanted to but I am still puzzled as to why one would want to.
One way of parallelization in R is through the snowfall package. To send custom functions to workers you can use sfExport() (see Joris' post here).
I have a custom function that depends on functions from non-base packages that are not loaded automagically. Thus, when I run my function in parallel, R craps out because certain functions are not available (think of the packages spatstat, splancs, sp...). So far I've solved this by calling library() in my custom function. This loads the packages on the first run and possibly just ignores on subsequent iterations. Still, I was wondering if there's another way of telling each worker to load the package on first iteration and be done with it (Or am I missing something and each iteration starts as a tabula rasa?).
I don't understand the question.
Packages are loaded via library(), and most of the parallel execution functions support that. For example, the snow package uses
clusterEvalQ(cl, library(boot))
to 'quietly' (ie not return value) evaluate the given expression---here a call to library()---on each node. Most of the parallel execution frameworks have something like that.
Why again would you need something different, and what exactly does not work here?
There's a specific command for that in snowfall, sfLibrary(). See also ?"snowfall-tools". Calling library manually on every node is strongly discouraged. sfLibrary is basically a wrapper around the solution Dirk gave based on the snow package.