Check if a function name is used in existing CRAN packages - r

I am creating an R package that I plan to submit to CRAN. How can I check if any of my function names conflict with function names in packages already on CRAN? Before my package goes public, it's still easy to change the names of functions, and I'd like to be a good citizen and avoid conflicts where possible.
For instance, the packages MASS and dplyr both have functions called "select". I'd like to avoid that sort of collision.

There are a lot of packages (9008 at the moment, Aug 2016), so it is almost certainly better to only look at a subset you want to avoid clashes with. Also, to re-emphasise some of the good advice in the comments (just for the record in case comments get deleted, or hidden):
sharing function names with other packages is not really a big problem, and not worth avoiding beyond, perhaps avoiding clashes with common packages that are most likely to be loaded at the same time (thanks #Nicola and #Joran)
Unnecessarily avoiding re-usue of names "leads to bad function names because the good ones are taken" (#Konrad Rudolph)
But, if you really want to check all the packages, perhaps to at least know which packages use the same names as yours, you can get a vector of the package names by
crans <- available.packages()[, "Package"]
# A3 abbyyR abc ABCanalysis abc.data abcdeFBA
# "A3" "abbyyR" "abc" "ABCanalysis" "abc.data" "abcdeFBA"
length(crans)
# [1] 9008
You can then install them in bulk using
N = 4 # only using the 1st 4 packages here -
# doing it for the whole lot will take a lot of time and disk space!!!
install.packages(crans[1:N])
Then you can get a list of the function names in these packages with
existing_functions = sapply(1:N, function(i) ls(getNamespace(crans[i])))

Related

Determining which version of a function is active when many packages are loaded

If I have multiple packages loaded that define functions of the same name, is there an easy way to determine which version of the function is currently the active one? Like, lets say I have base R, the tidyverse, and a bunch of time series packages loaded. I'd like a function which_package("intersect") that would tell me the package name of the active version of the intersect function. I know you can go back and look at all the warning messages you recieved when installing packages, but I think that sort of manual search is not only tedious but also error-prone.
There is a function here that does sort of what I want, except it produces a table for all conflicts rather than the value for one function. I would actually be quite happy with that, and would also accept a similar function as an answer, but I have had problems with the implimentation of function given. As applied to my examples, it inserts vast amounts of white space and many duplicates of the package names (e.g. the %>% function shows up with 132 packages listed), making the output hard to read and hard to use. It seems like it should be easy to remove the white space and duplicates, and I have spent considerable time on various approaches that I expected to work but which had no impact on the outcome.
So, for an example of many conflicts:
install.packages(pkg = c("tidyverse", "fpp3", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink")
lapply(x = c("tidyverse", "fable", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink"),
library, character.only = TRUE)
You can pull this information with your own function helper.
which_package <- function(fun) {
if(is.character(fun)) fun <- getFunction(fun)
stopifnot(is.function(fun))
x <- environmentName(environment(fun))
if (!is.null(x)) return(x)
}
This will return R_GlobalEnv for functions that you define in the global environment. There is also the packageName function if you really want to restrict it to packages only.
For example
library(MASS)
library(dplyr)
which_package(select)
# [1] "dplyr"

R: How to check if the libraries that I am loading, I use them for my code in R?

I have used several packages of R libraries for my study. All libraries charge together at the beginning of my code. And here is the problem. It turns out that I have done several tests with different functions that were already in the packages of R. However, in the final code I have not implemented all the functions I have tried. Therefore, I am loading libraries that I do not use.
Would there be any way to check the libraries to know if they really are necessary for my code?
Start by restarting R with a fresh environment, no libraries loaded. For this demonstration, I'm going to define two functions:
zoo1 <- function() na.locf(1:10)
zoo2 <- function() zoo::na.locf(1:10)
With no libraries loaded, let's try something:
codetools::checkUsage(zoo1)
# <anonymous>: no visible global function definition for 'na.locf'
codetools::checkUsage(zoo2)
library(zoo)
# Attaching package: 'zoo'
# The following objects are masked from 'package:base':
# as.Date, as.Date.numeric
codetools::checkUsage(zoo1)
Okay, so we know we can check a single function to see if it is abusing scope and/or using non-base functions. Let's assume that you've loaded your script full of functions (but not the calls to require or library), so let's do this process for all of them. Let's first unload zoo, so that we'll see a complaint again about our zoo1 function:
detach("package:zoo", unload=TRUE)
Now let's iterate over all functions:
allfuncs <- Filter(function(a) is.function(get(a)), ls())
str(sapply(allfuncs, function(fn) capture.output(codetools::checkUsage(get(fn))), simplify=FALSE))
# List of 2
# $ zoo1: chr "<anonymous>: no visible global function definition for 'na.locf'"
# $ zoo2: chr(0)
Now you know to look in the function named zoo1 for a call to na.locf. It'll be up to you to find in which not-yet-loaded package this function resides, but that might be more more reasonable, depending on the number of packages you are loading.
Some side-thoughts:
If you have a script file that does not have everything comfortably ensconced in functions, then just wrap all of the global R code into a single function, say bigfunctionfortest <- function() { as the first line and } as the last. Then source the file and run codetools::checkUsage(bigfunctionfortest).
Package developers have to go through a process that uses this, so that the Imports: and Depends: sections of NAMESPACE (another ref: http://r-pkgs.had.co.nz/namespace.html) will be correct. One good trick to do that will prevent "namespace pollution" is loading the namespace but not the package ... and though that may sound confusing, it often results in using zoo::na.locf for all non-base functions. This gets old quickly (especially if you are using dplyr and such, where most of your daily functions are non-base), suggesting those oft-used functions should be directly imported instead of just referenced wholly. If you're familiar with python, then:
# R
library(zoo)
na.locf(c(1,2,NA,3))
is analagous to
# fake-python
from zoo import *
na_locf([1,2,None,3])
(if that package/function exists). Then the non-polluting variant looks like:
# R
zoo::na.locf(c(1,2,NA,3))
# fake-python
import zoo
zoo.na_locf([1,2,None,3])
where the function's package (and/or subdir packaging) must be used explicitly. There is no ambiguity. It is explicit. This is by some/many considered "A Good Thing (tm)".
(Language-philes will likely say that library(zoo) and from zoo import * are not exactly the same ... a better way to describe what is happening is that they bring everything from zoo into the search path of functions, potentially causing masking as we saw in a console message earlier; while the :: functionality only loads the namespace but does not add it to the search path. Lots of things going on in the background.)

Using datasets in an R package

I am trying to get the latest version of my package (https://github.com/jmcurran/relSim) on CRAN. This has been rejected because of the use of a data set that is included in the package in a function which is not exported (i.e. the user cannot use it unless they use the ::: operator. A code snippet:
testIS = function(nc = c(3, 2), locus = 1, seed = 123456){
set.seed(seed)
np = 2 * nc[2]
freqs = USCaucs$freqs
The dataset is included in the package, and as per Hadley's advice I have LazyData: true in my DESCRIPTION file. However I get this note from https://win-builder.r-project.org which I don't know how to resolve.
* checking R code for possible problems ... [11s] NOTE
testIS: no visible binding for global variable 'USCaucs'
Undefined global functions or variables::
USCaucs
I find this especially frustrating, since, as I said, this function is not even exported (it also works without complaint because the package loads this dataset). All help appreciated
The solution appears to involve a little duplication. At the suggestion of Thomas Lumley, I placed the object in R/sysdata.rda as well as having it in data/USCaucs.rda. I followed Hadley Wickham's suggestion to use devtools::use_data with the argument internal set to TRUE so that it was saved in the correct manner for a package.
As noted, this solution involves duplicating the data. This isn't an issue for a small object such as the one I have here, but I'd like to think there is a more elegant solution out there.

Extract English words from a text in R

I have a text and I need to extract all English words from it. For instance I want to have a function which would analyse the vector
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
And return only English words from this vector i.e. "picture", "carpet", "lamp"
I do understand that the definition of "English word" depends on the dictionary but I would be satisfied even with a basic dictionary.
You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
library(qdapDictionaries)
vector[vector %in% GradyAugmented]
## [1] "picture" "carpet" "lamp"
intersect(vector, GradyAugmented)
## [1] "picture" "carpet" "lamp"
The error you are receiving with installing qdap sounds like #Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

available CRAN vignettes

There's the available.packages() function to list all packages available on CRAN. Is there a similar function to find all available vignettes? If not how would I get a list of all vignettes and the packages they're associated with?
As a corner case to keep in mind the data.table package has 3 vignettes associated with it.
EDIT: Per Andrie's response I realize I wasn't clear. I know about the vignette function for finding all the available local vignettes, I'm after a way to get all the vignettes of all packages on CRAN.
I seem to recall looking at this in response to some SO question (can't find it now) and deciding that since the information isn't included in the output of available.packages(), nor in the result of applying readRDS to #CRAN/web/packages/packages.rds (a trick from Jeroen Ooms), I couldn't think of a non-scraping way to do it ...
Here's my scraper, applied to the first 100 packages (leading to 44 vignettes)
pkgs <- unname(available.packages()[, 1])[1:100]
vindex_urls <- paste0(getOption("repos"),"/web/packages/", pkgs,
"/vignettes/index.rds", sep = "")
getf <- function(x) {
## I think there should be a way to do this directly
## with readRDS(url(...)) but I can't get it to work
suppressWarnings(
download.file(x,"tmp.rds",quiet=TRUE))
readRDS("tmp.rds")
}
library(plyr)
vv <- ldply(vindex_urls,
.progress="text",
function(x) {
if (inherits(z <- try(getf(x),silent=TRUE),
"try-error")) NULL else z
})
tmpf <- function(x,n) { if (is.null(x)) NULL else
data.frame(pkg=n,x) }
vframe <- do.call(rbind,mapply(tmpf,vv,pkgs))
rownames(vframe) <- NULL
head(vframe[,c("pkg","Title")])
There may be ways to clean this up/make it more compact, but it seems to work OK. Your scrape once/update occasionally strategy seems reasonable. Or if you wanted you could scrape daily (or weekly or whatever seems reasonable) and save/post the results somewhere publicly accessible, then include a function with that URL hard-coded in the package ... or even create a nicely formatted HTML table, with links, that the whole world could use (and then add Viagra ads to the page, and $$PROFIT$$ ...)
edit: wrapped both the download and the readRDS in a function, so I can wrap the whole thing in try
The functions vignette() and browseVignettes() list all vignettes of packages installed on your machine.
vignette(package="data.table")
Vignettes in package ‘data.table’:
datatable-faq Frequently asked questions (source, pdf)
datatable-intro Quick introduction (source, pdf)
datatable-timings Timings of common tasks (source, pdf)
browseVignettes() is especially helpful since it creates a web page with hyperlinks:
browseVignettes(package="data.table")
Vignettes found by browseVignettes(package = "data.table")
Vignettes in package data.table
Frequently asked questions - PDF R LaTeX/noweb
Quick introduction - PDF R LaTeX/noweb
Timings of common tasks - PDF R LaTeX/noweb

Resources