How can Iist only functions and those that come from a package - r

I use the foreach package to parallelize some stuff and I am tired of indicating 5 functions in .export everytime I need to use it.
I know I can do foreach(...,.export=ls(.GlobalEnv)) but this transfers a lot of data to the workers and slow me down (there can be big tables defined).
So the question is how can I list only functions in the .GlobalEnv
I did that:
getAllFunctions <- function(envir=.GlobalEnv){
allClasses <- sapply(grep(x=ls(envir), pattern='^%', value=TRUE, invert=TRUE), FUN=function(x){class(eval(parse(text=x)))})
fnNames <- names(allClasses)[allClasses == 'function']
return(fnNames)
}
But that's ugly (and gives everything) and I'm sure there is an idiomatic way

From the comments:
as.list(.GlobalEnv)[sapply(.GlobalEnv, is.function)]

Related

Determining which version of a function is active when many packages are loaded

If I have multiple packages loaded that define functions of the same name, is there an easy way to determine which version of the function is currently the active one? Like, lets say I have base R, the tidyverse, and a bunch of time series packages loaded. I'd like a function which_package("intersect") that would tell me the package name of the active version of the intersect function. I know you can go back and look at all the warning messages you recieved when installing packages, but I think that sort of manual search is not only tedious but also error-prone.
There is a function here that does sort of what I want, except it produces a table for all conflicts rather than the value for one function. I would actually be quite happy with that, and would also accept a similar function as an answer, but I have had problems with the implimentation of function given. As applied to my examples, it inserts vast amounts of white space and many duplicates of the package names (e.g. the %>% function shows up with 132 packages listed), making the output hard to read and hard to use. It seems like it should be easy to remove the white space and duplicates, and I have spent considerable time on various approaches that I expected to work but which had no impact on the outcome.
So, for an example of many conflicts:
install.packages(pkg = c("tidyverse", "fpp3", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink")
lapply(x = c("tidyverse", "fable", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink"),
library, character.only = TRUE)
You can pull this information with your own function helper.
which_package <- function(fun) {
if(is.character(fun)) fun <- getFunction(fun)
stopifnot(is.function(fun))
x <- environmentName(environment(fun))
if (!is.null(x)) return(x)
}
This will return R_GlobalEnv for functions that you define in the global environment. There is also the packageName function if you really want to restrict it to packages only.
For example
library(MASS)
library(dplyr)
which_package(select)
# [1] "dplyr"

Global Variables as Function Parameters

In an R project, we have a global dataframe df that is to be used inside a function my_func(). The dataframe will not be changed, but it will be used as a "read-only" table.
Can you please assist me, on, what is the best practice:
Include the dataframe in the parameters of the function, as in
my_func(df)
{
a <- df[1,2]
}
OR
Not include it in the parameters, just use it (read it) in the function body, as in
my_func()
{
a <- df[1,2]
}
In an ideal world, data enters a function as an argument and leaves it as a return value. That is a good principle. Besides it is prefereable for code reuse. Right now you may be conviced, that you will only ever call this code on df (bad name by the way, as there is a function calles df already in R and that can lead to terrible error messages).
The only exception from this rule, and the reason, why <<- exist(*), may rarely be performance.
However in the read-only case, there are no performance gains, as R does behave cleverly.
Will will need to install the microbenchmark package for the following code to run:
expl <- data.frame(a = rep("Hello world.", 1e8),
b = rep(1, 1e8))
fun1 <- function(dataframe) return(sum(dataframe$b))
fun2 <- function() return(sum(expl$b))
microbenchmark::microbenchmark(fun1(expl), fun2())
Try it and you will see, that there is no performance gain in fun2over fun1, even though the dataframe has considerable size.
Edit:
(*) as I have learned from Konrad Rudolph's comment below, <<- can be usefull, when giving data to the parent, not necessarily the global namespace. Very interesting read even if not strictly on topic here: http://adv-r.had.co.nz/Functional-programming.html#mutable-state

R Package development: overriding a function from one package with a function from another?

I am currently working on developing two packages, below is a simplified
version of my problem:
In package A I have some functions (say "sum_twice"), and I it calls to
another function inside the package (say "slow_sum").
However, in package B, I wrote another function (say "fast_sum"), with
which I wish to replace the slow function in package A.
Now, how do I manage this "overriding" of the "slow_sum" function with the
"fast_sum" function?
Here is a simplified example of such functions (just to illustrate):
############################
##############
# Functions in package A
slow_sum <- function(x) {
sum_x <- 0
for(i in seq_along(x)) sum_x <- sum_x + x[i]
sum_x
}
sum_twice <- function(x) {
x2 <- rep(x,2)
slow_sum(x2)
}
##############
# A function in package B
fast_sum <- function(x) { sum(x) }
############################
If I only do something like slow_sum <- fast_sum, this would not work, since "sum_twice" uses "slow_sum" from the NAMESPACE
of package A.
I tried using the following function when loading package "B":
assignInNamespace(x = "slow_sum", value = B:::fast_sum, ns = "A")
This indeed works, however, it makes the CRAN checks return both a NOTE on
how I should not use ":::", and also a warning for using assignInNamespace
(since it is supposed to not be very safe).
However, I am at a loss.
What would be a way to have "sum_twice" use "fast_sum" instead of
"slow_sum"?
Thank you upfront for any feedback or suggestion,
With regards,
Tal
p.s: this is a double post from here.
UDPATE: motivation for this question
I am developing two packages, one is based solely on R and works fine (but a bit slow), it is dendextend (which is now on CRAN). The other one is meant to speed up the first package by using Rcpp (this is dendextendRcpp which is on github). The second package speeds up the first by overriding some basic functions the first package uses. But in order for the higher levels functions in the first package will use the lower functions in the second package, I have to use assignInNamespace which leads CRAN to throw warnings+NOTES, which ended up having the package rejected from CRAN (until these warnings will be avoided).
The problem is that I have no idea how to approach this issue. The only solution I can think of is either mixing the two packages together (making it harder to maintain, and will automatically require a larger dependency structure for people asking to use the package). And the other option is to just copy paste the higher level functions from dendextend to dendextendRcpp, and thus have them mask the other functions. But I find this to be MUCH less elegant (because that means I will need to copy-paste MANY functions, forcing more double-code maintenance) . Any other ideas? Thanks.
We could put this in sum_twice:
my_sum_ch <- getOption("my_sum", if ("package:fastpkg" %in% search())
"fast_sum" else "slow_sum")
my_sum <- match.fun(my_sum_ch)
If the "my_sum" option were set then that version of my_sum would be used and if not it would make the decision based on whether or not fastpkg had been loaded.
The solution I ended up using (thanks to Uwe and Kurt), is using "local" to create a localized environment with the package options. If you're curious, the function is called "dendextend_options", and is here:
https://github.com/talgalili/dendextend/blob/master/R/zzz.r
Here is an example for its use:
dendextend_options <- local({
options <- list()
function(option, value) {
# ellipsis <- list(...)
if(missing(option)) return(options)
if(missing(value))
options[[option]]
else options[[option]] <<- value
}
})
dendextend_options("a")
dendextend_options("a", 1)
dendextend_options("a")
dendextend_options("a", NULL)
dendextend_options("a")
dendextend_options()

Options for caching / memoization / hashing in R

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.
Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?
As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).
It looks like the options are:
Hashing
digest - provides hashing for arbitrary R objects.
Memoization
memoise - a very simple tool for memoization of functions.
R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
Caching
hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
Key/value storage
These are basic options for external storage of R objects.
stashr
filehash
Checkpointing
cacher - this seems to be more akin to checkpointing.
CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
Other
Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
The data.table package supports rapid lookups of elements in a data table.
Use case
Although I'm mostly interested in knowing the options, I have two basic use cases that arise:
Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
Memoization of monstrous calculations.
These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.
Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.
Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)
Dirk Eddelbuettel: digest - a lot of other packages depend on this.
Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.
Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
I did not have luck with memoise because it gave a 'too deep recursive' problem to some functions of a package I tried it with. With R.cache I had better luck. Following is more annotated code I adapted from the R.cache documentation. The code shows different options for doing caching:
# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F)
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath()
simulate <- function(mean, sd) {
# 1. Try to load cached data, if already generated
key <- list(mean, sd)
data <- loadCache(key)
if (!is.null(data)) {
cat("Loaded cached data\n")
return(data);
}
# 2. If not available, generate it.
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
saveCache(data, key=key, comment="simulate()")
data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))
simulate2 <- function(mean, sd) {
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("Done generating data from scratch\n")
data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages.
mzs <- addMemoization(simulate2)
data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)
# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
for (kk in 1:3) {
cat(sprintf("Iteration #%d:\n", kk))
res <- evalWithMemoization({
cat("Evaluating expression...")
a <- kk
Sys.sleep(1)
cat("done\n")
a
}, key=list(kk=kk))
# expressions inside 'res' are skipped on the repeated run
print(res)
# Sanity checks
stopifnot(a == kk)
# Clean up
rm(a)
} # for (kk ...)
} # for (ii ...)
For simple counting of strings (and not using table or similar), a multiset data structure seems like a good fit. The environment object can be used to emulate this.
# Define the insert function for a multiset
msetInsert <- function(mset, s) {
if (exists(s, mset, inherits=FALSE)) {
mset[[s]] <- mset[[s]] + 1L
} else {
mset[[s]] <- 1L
}
}
# First we generate a bunch of strings
n <- 1e5L # Total number of strings
nus <- 1e3L # Number of unique strings
ustrs <- paste("Str", seq_len(nus))
set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)
# Now we use an environment as our multiset
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled
# ...and insert the strings one by one...
for (s in strs) {
msetInsert(mset, s)
}
# Now we should have nus unique strings in the multiset
identical(nus, length(mset))
# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))
# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)
Related to #biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:
R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:
simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
data})
}

hiding personal functions in R

I have a few convenience functions in my .Rprofile, such as this handy function for returning the size of objects in memory. Sometimes I like to clean out my workspace without restarting and I do this with rm(list=ls()) which deletes all my user created objects AND my custom functions. I'd really like to not blow up my custom functions.
One way around this seems to be creating a package with my custom functions so that my functions end up in their own namespace. That's not particularly hard, but is there an easier way to ensure custom functions don't get killed by rm()?
Combine attach and sys.source to source into an environment and attach that environment. Here I have two functions in file my_fun.R:
foo <- function(x) {
mean(x)
}
bar <- function(x) {
sd(x)
}
Before I load these functions, they are obviously not found:
> foo(1:10)
Error: could not find function "foo"
> bar(1:10)
Error: could not find function "bar"
Create an environment and source the file into it:
> myEnv <- new.env()
> sys.source("my_fun.R", envir = myEnv)
Still not visible as we haven't attached anything
> foo(1:10)
Error: could not find function "foo"
> bar(1:10)
Error: could not find function "bar"
and when we do so, they are visible, and because we have attached a copy of the environment to the search path the functions survive being rm()-ed:
> attach(myEnv)
> foo(1:10)
[1] 5.5
> bar(1:10)
[1] 3.027650
> rm(list = ls())
> foo(1:10)
[1] 5.5
I still think you would be better off with your own personal package, but the above might suffice in the meantime. Just remember the copy on the search path is just that, a copy. If the functions are fairly stable and you're not editing them then the above might be useful but it is probably more hassle than it is worth if you are developing the functions and modifying them.
A second option is to just name them all .foo rather than foo as ls() will not return objects named like that unless argument all = TRUE is set:
> .foo <- function(x) mean(x)
> ls()
character(0)
> ls(all = TRUE)
[1] ".foo" ".Random.seed"
Here are two ways:
1) Have each of your function names start with a dot., e.g. .f instead of f. ls will not list such functions unless you use ls(all.names = TRUE) therefore they won't be passed to your rm command.
or,
2) Put this in your .Rprofile
attach(list(
f = function(x) x,
g = function(x) x*x
), name = "MyFunctions")
The functions will appear as a component named "MyFunctions" on your search list rather than in your workspace and they will be accessible almost the same as if they were in your workspace. search() will display your search list and ls("MyFunctions") will list the names of the functions you attached. Since they are not in your workspace the rm command you normally use won't remove them. If you do wish to remove them use detach("MyFunctions") .
Gavin's answer is wonderful, and I just upvoted it. Merely for completeness, let me toss in another one:
R> q("no")
followed by
M-x R
to create a new session---which re-reads the .Rprofile. Easy, fast, and cheap.
Other than that, private packages are the way in my book.
Another alternative: keep the functions in a separate file which is sourced within .RProfile. You can re-source the contents directly from within R at your leisure.
I find that often my R environment gets cluttered with various objects when I'm creating or debugging a function. I wanted a way to efficiently keep the environment free of these objects while retaining personal functions.
The simple function below was my solution. It does 2 things:
1) deletes all non-function objects that do not begin with a capital letter and then
2) saves the environment as an RData file
(requires the R.oo package)
cleanup=function(filename="C:/mymainR.RData"){
library(R.oo)
# create a dataframe listing all personal objects
everything=ll(envir=1)
#get the objects that are not functions
nonfunction=as.vector(everything[everything$data.class!="function",1])
#nonfunction objects that do not begin with a capital letter should be deleted
trash=nonfunction[grep('[[:lower:]]{1}',nonfunction)]
remove(list=trash,pos=1)
#save the R environment
save.image(filename)
print(paste("New, CLEAN R environment saved in",filename))
}
In order to use this function 3 rules must always be kept:
1) Keep all data external to R.
2) Use names that begin with a capital letter for non-function objects that I want to keep permanently available.
3) Obsolete functions must be removed manually with rm.
Obviously this isn't a general solution for everyone...and potentially disastrous if you don't live by rules #1 and #2. But it does have numerous advantages: a) fear of my data getting nuked by cleanup() keeps me disciplined about using R exclusively as a processor and not a database, b) my main R environment is so small I can backup as an email attachment, c) new functions are automatically saved (I don't have to manually manage a list of personal functions) and d) all modifications to preexisting functions are retained. Of course the best advantage is the most obvious one...I don't have to spend time doing ls() and reviewing objects to decide whether they should be rm'd.
Even if you don't care for the specifics of my system, the "ll" function in R.oo is very useful for this kind of thing. It can be used to implement just about any set of clean up rules that fit your personal programming style.
Patrick Mohr
A nth, quick and dirty option, would be to use lsf.str() when using rm(), to get all the functions in the current workspace. ...and let you name the functions as you wish.
pattern <- paste0('*',lsf.str(), '$', collapse = "|")
rm(list = ls()[-grep(pattern, ls())])
I agree, it may not be the best practice, but it gets the job done! (and I have to selectively clean after myself anyway...)
Similar to Gavin's answer, the following loads a file of functions but without leaving an extra environment object around:
if('my_namespace' %in% search()) detach('my_namespace'); source('my_functions.R', attach(NULL, name='my_namespace'))
This removes the old version of the namespace if it was attached (useful for development), then attaches an empty new environment called my_namespace and sources my_functions.R into it. If you don't remove the old version you will build up multiple attached environments of the same name.
Should you wish to see which functions have been loaded, look at the output for
ls('my_namespace')
To unload, use
detach('my_namespace')
These attached functions, like a package, will not be deleted by rm(list=ls()).

Resources