Package that downloads data from the internet during installation - r

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?
EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.
I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.

I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.
myFunction <- function(this, that, dataset) {
# We're giving the user a chance to specify the dataset.
# Maybe they have already downloaded it and saved it.
if (is.null(dataset)) {
# Check to see if the object is already in the workspace.
# If it is not, check to see whether the .RData file that
# contains the object is in the current working directory.
if (!exists("OBJECTNAME", where = 1)) {
if (isTRUE(list.files(
pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
load("DATAFILE.RData")
# If neither of those are successful, prompt the user
# to download the dataset.
} else {
ans = readline(
"DATAFILE.RData dataset not found in working directory.
OBJECTNAME object not found in workspace. \n
Download and load the dataset now? (y/n) ")
if (ans != "y")
return(invisible())
# I usually use RCurl in case the URL is https
require(RCurl)
baseURL = c("http://some/base/url/")
# Here, we actually download the data
temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))
# Here we load the data
load(rawConnection(temp), envir=.GlobalEnv)
message("OBJECTNAME data downloaded from \n",
paste0(baseURL, "DATAFILE.RData \n"),
"and added to your workspace\n\n")
rm(temp, baseURL)
}
}
dataset <- OBJECTNAME
}
TEMP <- dataset
## Other fun stuff with TEMP, this, and that.
}
Two packages, hosted at Github
Here's another approach, building on the comments between #juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:
Check to see if the data package is installed
Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.
When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.
CheckVersionFirst <- function() {
# Check to see if installed
if (!"StataDCTutils" %in% installed.packages()[, 1]) {
Checks <- "Failed"
} else {
# Compare version numbers
require(RCurl)
temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
CurrentVersion <- gsub("^\\s|\\s$", "",
gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
if (packageVersion("StataDCTutils") == CurrentVersion) {
Checks <- "Passed"
}
if (packageVersion("StataDCTutils") < CurrentVersion) {
Checks <- "Failed"
}
}
switch(
Checks,
Passed = { message("Everything looks OK! Proceeding!") },
Failed = {
ans = readline(
"'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
if (ans != "y")
return(invisible())
require(devtools)
install_github("StataDCTutils", "mrdwab")
})
# Some cool things you want to do after you are sure the data is there
}
Try it out with CheckVersionFirst().
Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!
So, to clarify/recap/expand, the basic idea would be to:
Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....

This method may not be efficient, but a good workaround. If you are making a package that needs regularly updated data, first make a package which has that data. It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter.
Then when you make your package, have the 'data'-package as a dependency. This way, whenever someone installs your package, he/she will always have the latest data.
On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want.
If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK, R CMD BUILD

Related

How to get the github repo url for all the packages on CRAN?

I would like to extract the github repo url for all the packages on CRAN. And I have tried to first read the link of CRAN and get the table of all the package names, which also contains the url for the description page of each package, for I want to extract the github repo url through the description page. But I can't get the completed url. Could you please help me with this? Or is there any better way to get the repo url for all packages?
This is my supplementary :
Actually, I want to filter the pkgs that do have a official github repo, like some pkgs as xfun or fddm. And I found I can extract the username and repo name from the description of pkgs on CRAN, and put them in a github formatted url. (for most of them have the same format url like : https://github.com/{username}/{reponame}. For example, for package xfun , it would be like : https://github.com/yihui/xfun.
And now, I have get some of them like : (three of them)
enter image description here
And I am wondering how could I get the url for all of them. I know use glue pkg can replace the elements in a url. and for get the url by replacing elements (username and reponame) I have tried map()
and map_dfr() function. But it returns me error : Error in parse_url(url) : length(url) == 1 is not TRUE
Here is my code :
get <- map_dfr(dat, ~{
username <- dat$user
reponame <- dat$package
pkg_url <- GET(glue::glue("https://github.com/{username}/{reponame}"))
})
Could you please help me with this? Thanks a lot ! :)
I want to suggest a different method for getting where you want.
As discussed in the comments, not all R packages have public GitHub repos.
Here is a version of some code from an answer to another question by Dirk Eddelbuettel that retrieves information from CRAN's database, including the package name and the URL field. If a package has a public GH repo, it is very likely that the authors have included that information in the URL field: there may be a few packages where the GH repo information is guessable (i.e. the GH user name is the same as (e.g.) the identifier in the maintainer's e-mail address; the GH repo name is the same as the package name), but it seems like a lot of work to do all that guessing (and accessing GitHub to see if the guess was correct) for a relatively low return.
getPackageRDS <- function() {
description <- sprintf("%s/web/packages/packages.rds",
getOption("repos")["CRAN"])
con <- if(substring(description, 1L, 7L) == "file://") {
file(description, "rb")
} else {
url(description, "rb")
}
on.exit(close(con))
db <- readRDS(gzcon(con))
rownames(db) <- NULL
return(db)
}
dd <- as.data.frame(getPackageRDS())
dd2 <- subset(dd, grepl("github.com", URL))
## clean up (multiple URLs, etc.)
dd2$URL <- sapply(strsplit(dd2$URL,"[, \n]"),
function(x) trimws(grep("github.com", x, value=TRUE)[1]))
As of today (25 May 2021) there are 17665 packages in total on CRAN, of which 6184 have "github.com" in the URL field. Here are the first few results:
Package URL
5 abbyyR http://github.com/soodoku/abbyyR
12 ABCoptim http://github.com/gvegayon/ABCoptim
16 abctools https://github.com/dennisprangle/abctools
18 abdiv https://github.com/kylebittinger/abdiv
20 abess https://github.com/abess-team/abess
23 ABHgenotypeR http://github.com/StefanReuscher/ABHgenotypeR
The URL field may still not be completely clean, but this should get you most of the way there.
An alternative approach would be to use the githubinstall package, which works by downloading a data frame that has been generated by crawling GitHub looking for R packages.
library(githubinstall)
dd3 <- gh_list_packages()
At present there are 34491 packages in this list, so obviously it includes a lot of stuff that's not on CRAN. You could intersect this list of packages with information from available_packages() ...

How to manage dependencies with renv explicitly

I would prefer to have a config file and list the packages within it which are needed for the project, rather than relying on renv::init() to scrape the project and find all which I need (it often can't).
So my question is - how do I explicitly tell renv which packages are required for a project, an example would be appreciated.
The renv package does all sorts of fancy things: installing from several different locations, setting up a project-specific library so that you can control the versions for a project, etc. If you need that stuff, I think you're out of luck. As far as I can see it has no way to pass in a list of dependencies, it needs to scan your source to find them. I suppose you could include a function like
loadPackages <- function() {
requireNamespace("foo")
requireNamespace("bar")
...
}
to make it easier for renv to find your required packages, but if it's failing in some other way (e.g. you have incomplete files that don't parse properly), this won't help.
If you don't need all that fancy stuff, you could use the function below:
needsPackages <- function(pkgs, install = TRUE, update = FALSE,
load = FALSE, attach = FALSE) {
missing <- c()
for (p in pkgs) {
if (!nchar(system.file(package = p)))
missing <- c(missing, p)
}
if (length(missing)) {
missing <- unique(missing)
if (any(install)) {
toinstall <- intersect(missing, pkgs[install])
install.packages(toinstall)
for (p in missing)
if (!nchar(system.file(package = p)))
stop("Did not install: ", p)
} else
stop("Missing packages: ", paste(missing, collapse = ", "))
}
if (any(update))
update.packages(oldPkgs = pkgs[update], ask = FALSE, checkBuilt = TRUE)
for (p in pkgs[load])
loadNamespace(p)
for (p in pkgs[attach])
library(p, character.only = TRUE)
}
which is what I've used in one project. You call it as
needsPackages(c("foo", "bar"))
and it installs the missing ones. It can also update, load, or attach them. It's just using the standard function install.packages to install from CRAN,
no fancy selection of install locations, or maintenance of particular package versions. If you do use something simple like this, you should run sessionInfo() afterwards to record package version numbers, in case you need to return to the same state later. (Though returning to that state will probably be painful!)
There are two possible ways forward here:
Configure renv to use "explicit" snapshots, as described in https://rstudio.github.io/renv/reference/snapshot.html#snapshot-type -- this workflow requires that you list your package requirements in your DESCRIPTION file;
Manually use renv::init(bare = TRUE) + renv::install(<packages>) (or your own package installation functions) to install the packages you need for your project, building the list of <packages> from some separate source that you maintain.
If you have specific workflow that you wish renv would make possible, then you could consider filing a feature request at https://github.com/rstudio/renv/issues.

Update Package Automatically at Start-up

I find it annoying that I have to click Tools -> Update Packages every time I load RStudio. I could use update.packages(c("ggplot2")) for instance to update my packages in .RProfile, but the issue is that it won't look for other packages (dependencies). For instance, I have to update "seriation" and "digest" package every time I start RStudio, and these packages are not loaded by me at start-up. Does anyone have code to automatically check and update all packages at start-up ? If so, can you please share here? I extensively googled this topic and searched through SO, and it seems that popular opinion is to use RStudio's menu. Here's the thread I am referring to: How to update R2jags in R?
One way I can think of doing this is in .RProfile:
a<-installed.packages()
b<-data.frame(a[,1])
and then calling this function: https://gist.github.com/stevenworthington/3178163
However, I am not quite sure whether this is the most optimal method.
Another linked thread is: Load package at start-up
I created the thread above.
I'd appreciate any thoughts.
i found this on internet(don't remember where) when i was struggling with the same problem, though you still need to run this program . Hope this helps .
all.packages <- installed.packages()
r.version <- paste(version[['major']], '.', version[['minor']], sep = '')
for (i in 1:nrow(all.packages))
{
package.name <- all.packages[i, 1]
package.version <- all.packages[i, 3]
if (package.version != r.version)
{
print(paste('Installing', package.name))
install.packages(package.name)
}
}

How do we set constant variables while building R packages?

We are building a package in R for our service (a robo-advisor here in Brazil) and we send requests all the time to our external API inside our functions.
As it is the first time we build a package we have some questions. :(
When we will use our package to run some scripts we will need some information as api_path, login, password.
How do we place this information inside our package?
Here is a real example:
get_asset_daily <- function(asset_id) {
api_path <- "https://api.verios.com.br"
url <- paste0(api_path, "/assets/", asset_id, "/dailies?asc=d")
data <- fromJSON(url)
data
}
Sometimes we use a staging version of the API and we have to constantly switch paths. How we should call it inside our function?
Should we set a global environment variable, a package environment variable, just define api_path in our scripts or a package config file?
How do we do that?
Thanks for your help in advance.
Ana
One approach would be to use R's options interface. Create a file zzz.r in the R directory (this is the customary name for this file) with the following:
.onLoad <- function(libname, pkgname) {
options(api_path='...', username='name', password='pwd')
}
This will set these options when the package is loaded into memory.

Save package settings between sessions

Is there a definitive way to save options or information pertaining to a certain package between sessions?
For example say somebody made a game and released it as an R package. If they wanted to save high scores and not have them reset each time R started a new session what would be the best way to do this? Currently I can only think of storing a file in the users home directory but I'm not sure if I like that approach.
This may be an approach. I created a dummy package with a dummy function (any function I create is bound to be a dummy function) and a data set I called scores that I set as follows:
scores <- NA
Then I created the package with the scores data set.
Then I used the following to change the data set from within R.
loc <- paste0(find.package("new"), "/Data")
unlink(paste0(loc, "/scores.rda"), recursive = TRUE, force = FALSE)
scores <- 10
save(scores, file=paste0(loc, "/scores.rda"))
Then when I unloaded the library and re loaded agin the data set now says:
> scores
[1] 10
Could this be modified to do what you want? You'd have to have it save in between somehow but am not sure on how to do this without messing with .Last function.
EDIT:
It appears this option is not viable in that when you compile as a package and use lazy load it saves the data sets as:
RData.rbd, RData.rbx, not as .rda files. That means the approach I use above is kinda worthless in that we want it to automatically be recognized.
EDIT2
This approach works and I tried it on a package I made. You can't do lazy load of the data and you have to either explicitly use data(scores) or use data(scores) inside of the function you're calling. I also assigned scores to .scores int he global.env the first time it was created and used exists inside the function to see if it exists. If `.scores. existed I assigned that to scores within the function. Once you unload the library and laod again you never have to worry about that again.
Maybe an alternative is to save this as a function somehow that can be altered using Josh's advice here: Permanently replacing a function
I guess there is no way to store settings without saving them to disk or a database, some way or another. It can be done silently though by putting the code below in your ~/.Rprofile. However, if you have packages that save settings in other ways than using options you need to add them manually.
I know this is exactly what you said you did not want, but it might spark some debate at least.
.Last <- function(){
my.options <- options()
save(my.options, file="~/.Roptions.Rdata")
}
.First <- function(){
tryCatch({
load("~/.Roptions.Rdata")
do.call(options, my.options)
rm(my.options)
}, error=function(...){})
}
To my suprise try(..., silent=TRUE) gives a warning on startup if ~/.Roptions.Rdata does not exist, which is why I used tryCatch instead.
The modern answer to this problem is well explained at https://blog.r-hub.io/2020/03/12/user-preferences/
I think I will be trying the hoardr package! Here is an example that worked for me :)
x <- hoardr::hoard()
x$cache_path_set("yourpackage", type = 'user_cache_dir')
x$mkdir()
scores<-data.frame(
user=c("one","two","three"),
score=c("500,200,1100")
)
save(scores,file = file.path(x$cache_path_get(), "scores.rdata"))
x$list()
x$details()
#new session
x <- hoardr::hoard()
x$cache_path_set("yourpackage", type = 'user_cache_dir')
x$list()
x$details()
load(file = file.path(x$cache_path_get(), "scores.rdata"))
PS - you can see a working example in the rnoaa package found on at github "opensci/rnoaa". Check their R/onload.r file! I can expand if needed.

Resources